Commit Graph

25710 Commits

Author SHA1 Message Date
Daniel Lezcano
c80081b920 genirq: Allow to pass the IRQF_TIMER flag with percpu irq request
The irq timings infrastructure tracks when interrupts occur in order to
statistically predict te next interrupt event.

There is no point to track timer interrupts and try to predict them because
the next expiration time is already known. This can be avoided via the
IRQF_TIMER flag which is passed by timer drivers in request_irq(). It marks
the interrupt as timer based which alloes to ignore these interrupts in the
timings code.

Per CPU interrupts which are requested via request_percpu_+irq() have no
flag argument, so marking per cpu timer interrupts is not possible and they
get tracked pointlessly.

Add __request_percpu_irq() as a variant of request_percpu_irq() with a
flags argument and make request_percpu_irq() an inline wrapper passing
flags = 0.

The flag parameter is restricted to IRQF_TIMER as all other IRQF_ flags
make no sense for per cpu interrupts.

The next step is to convert all existing users of request_percpu_irq() and
then remove the wrapper and the underscores.

[ tglx: Massaged changelog ]

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: nicolas.pitre@linaro.org
Cc: vincent.guittot@linaro.org
Cc: rafael@kernel.org
Link: http://lkml.kernel.org/r/1499344144-3964-1-git-send-email-daniel.lezcano@linaro.org
2017-07-06 23:16:22 +02:00
Linus Torvalds
9ced560b82 Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:

 - Waiman made the debug controller work and a lot more useful on
   cgroup2

 - There were a couple issues with cgroup subtree delegation. The
   documentation on delegating to a non-root user was missing some part
   and cgroup namespace support wasn't factoring in delegation at all.
   The documentation is updated and the now there is a mount option to
   make cgroup namespace fit for delegation

* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: implement "nsdelegate" mount option
  cgroup: restructure cgroup_procs_write_permission()
  cgroup: "cgroup.subtree_control" should be writeable by delegatee
  cgroup: fix lockdep warning in debug controller
  cgroup: refactor cgroup_masks_read() in the debug controller
  cgroup: make debug an implicit controller on cgroup2
  cgroup: Make debug cgroup support v2 and thread mode
  cgroup: Make Kconfig prompt of debug cgroup more accurate
  cgroup: Move debug cgroup to its own file
  cgroup: Keep accurate count of tasks in each css_set
2017-07-06 09:52:09 -07:00
Michael Sartain
99c621d704 tracing: Add saved_tgids file to show cached pid to tgid mappings
Export the cached pid / tgid mappings in debugfs tracing saved_tgids file.
This allows user apps to translate the pids from a trace to their respective
thread group.

Example saved_tgids file with pid / tgid values separated by ' ':

  # cat saved_tgids
  1048 1048
  1047 1047
  7 7
  1049 1047
  1054 1047
  1053 1047

Link: http://lkml.kernel.org/r/20170630004023.064965233@goodmis.org
Link: http://lkml.kernel.org/r/20170706040713.unwkumbta5menygi@mikesart-cos

Reviewed-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Michael Sartain <mikesart@fastmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-06 10:11:53 -04:00
Thomas Gleixner
9cd4f1a4e7 smp/hotplug: Move unparking of percpu threads to the control CPU
Vikram reported the following backtrace:

   BUG: scheduling while atomic: swapper/7/0/0x00000002
   CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
   schedule
   schedule_hrtimeout_range_clock
   schedule_hrtimeout
   wait_task_inactive
   __kthread_bind_mask
   __kthread_bind
   __kthread_unpark
   kthread_unpark
   cpuhp_online_idle
   cpu_startup_entry
   secondary_start_kernel

He analyzed correctly that a parked cpu hotplug thread of an offlined CPU
was still on the runqueue when the CPU came back online and tried to unpark
it. This causes the thread which invoked kthread_unpark() to call
wait_task_inactive() and subsequently schedule() with preemption disabled.
His proposed workaround was to "make sure" that a parked thread has
scheduled out when the CPU goes offline, so the situation cannot happen.

But that's still wrong because the root cause is not the fact that the
percpu thread is still on the runqueue and neither that preemption is
disabled, which could be simply solved by enabling preemption before
calling kthread_unpark().

The real issue is that the calling thread is the idle task of the upcoming
CPU, which is not supposed to call anything which might sleep.  The moron,
who wrote that code, missed completely that kthread_unpark() might end up
in schedule().

The solution is simpler than expected. The thread which controls the
hotplug operation is waiting for the CPU to call complete() on the hotplug
state completion. So the idle task of the upcoming CPU can set its state to
CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the control
task on a different CPU, which then can safely do the unpark and kick the
now unparked hotplug thread of the upcoming CPU to complete the bringup to
the final target state.

Control CPU                     AP

bringup_cpu();
  __cpu_up()  ------------>
				bringup_ap();
  bringup_wait_for_ap()
    wait_for_completion();
                                cpuhp_online_idle();
                <------------    complete();
    unpark(AP->stopper);
    unpark(AP->hotplugthread);
                                while(1)
                                  do_idle();
    kick(AP->hotplugthread);
    wait_for_completion();	hotplug_thread()
				  run_online_callbacks();
				  complete();

Fixes: 8df3e07e7f ("cpu/hotplug: Let upcoming cpu bring itself fully up")
Reported-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707042218020.2131@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-07-06 10:55:10 +02:00
David Howells
4cc7c1864b bpf: Implement show_options
Implement the show_options superblock op for bpf as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexei Starovoitov <ast@kernel.org>
cc: Daniel Borkmann <daniel@iogearbox.net>
cc: netdev@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:31:46 -04:00
Linus Torvalds
7114f51fcb Merge branch 'work.memdup_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull memdup_user() conversions from Al Viro:
 "A fairly self-contained series - hunting down open-coded memdup_user()
  and memdup_user_nul() instances"

* 'work.memdup_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bpf: don't open-code memdup_user()
  kimage_file_prepare_segments(): don't open-code memdup_user()
  ethtool: don't open-code memdup_user()
  do_ip_setsockopt(): don't open-code memdup_user()
  do_ipv6_setsockopt(): don't open-code memdup_user()
  irda: don't open-code memdup_user()
  xfrm_user_policy(): don't open-code memdup_user()
  ima_write_policy(): don't open-code memdup_user_nul()
  sel_write_validatetrans(): don't open-code memdup_user_nul()
2017-07-05 16:05:24 -07:00
Linus Torvalds
ea3b25e132 Merge branch 'timers-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull timer-related user access updates from Al Viro:
 "Continuation of timers-related stuff (there had been more, but my
  parts of that series are already merged via timers/core). This is more
  of y2038 work by Deepa Dinamani, partially disrupted by the
  unification of native and compat timers-related syscalls"

* 'timers-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  posix_clocks: Use get_itimerspec64() and put_itimerspec64()
  timerfd: Use get_itimerspec64() and put_itimerspec64()
  nanosleep: Use get_timespec64() and put_timespec64()
  posix-timers: Use get_timespec64() and put_timespec64()
  posix-stubs: Conditionally include COMPAT_SYS_NI defines
  time: introduce {get,put}_itimerspec64
  time: add get_timespec64 and put_timespec64
2017-07-05 15:34:35 -07:00
Linus Torvalds
4be95131bf Merge branch 'work.sys_wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull wait syscall updates from Al Viro:
 "Consolidating sys_wait* and compat counterparts.

  Gets rid of set_fs()/double-copy mess, simplifies the whole thing
  (lifting the copyouts to the syscalls means less headache in the part
  that does actual work - fewer failure exits, to start with), gets rid
  of the overhead of field-by-field __put_user()"

* 'work.sys_wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  osf_wait4: switch to kernel_wait4()
  waitid(): switch copyout of siginfo to unsafe_put_user()
  wait_task_zombie: consolidate info logics
  kill wait_noreap_copyout()
  lift getrusage() from wait_noreap_copyout()
  waitid(2): leave copyout of siginfo to syscall itself
  kernel_wait4()/kernel_waitid(): delay copying status to userland
  wait4(2)/waitid(2): separate copying rusage to userland
  move compat wait4 and waitid next to native variants
2017-07-05 14:10:19 -07:00
Linus Torvalds
5518b69b76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
  merge window:

   1) Several optimizations for UDP processing under high load from
      Paolo Abeni.

   2) Support pacing internally in TCP when using the sch_fq packet
      scheduler for this is not practical. From Eric Dumazet.

   3) Support mutliple filter chains per qdisc, from Jiri Pirko.

   4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

   5) Add batch dequeueing to vhost_net, from Jason Wang.

   6) Flesh out more completely SCTP checksum offload support, from
      Davide Caratti.

   7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
      Neira Ayuso, and Matthias Schiffer.

   8) Add devlink support to nfp driver, from Simon Horman.

   9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
      Prabhu.

  10) Add stack depth tracking to BPF verifier and use this information
      in the various eBPF JITs. From Alexei Starovoitov.

  11) Support XDP on qed device VFs, from Yuval Mintz.

  12) Introduce BPF PROG ID for better introspection of installed BPF
      programs. From Martin KaFai Lau.

  13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

  14) For loads, allow narrower accesses in bpf verifier checking, from
      Yonghong Song.

  15) Support MIPS in the BPF selftests and samples infrastructure, the
      MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
      Daney.

  16) Support kernel based TLS, from Dave Watson and others.

  17) Remove completely DST garbage collection, from Wei Wang.

  18) Allow installing TCP MD5 rules using prefixes, from Ivan
      Delalande.

  19) Add XDP support to Intel i40e driver, from Björn Töpel

  20) Add support for TC flower offload in nfp driver, from Simon
      Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
      Kicinski, and Bert van Leeuwen.

  21) IPSEC offloading support in mlx5, from Ilan Tayari.

  22) Add HW PTP support to macb driver, from Rafal Ozieblo.

  23) Networking refcount_t conversions, From Elena Reshetova.

  24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
      for tuning the TCP sockopt settings of a group of applications,
      currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
  net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
  dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
  cxgb4: Support for get_ts_info ethtool method
  cxgb4: Add PTP Hardware Clock (PHC) support
  cxgb4: time stamping interface for PTP
  nfp: default to chained metadata prepend format
  nfp: remove legacy MAC address lookup
  nfp: improve order of interfaces in breakout mode
  net: macb: remove extraneous return when MACB_EXT_DESC is defined
  bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
  bpf: fix return in load_bpf_file
  mpls: fix rtm policy in mpls_getroute
  net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
  net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
  ...
2017-07-05 12:31:59 -07:00
Linus Torvalds
e24dd9ee53 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:

 - a major update for AppArmor. From JJ:

     * several bug fixes and cleanups

     * the patch to add symlink support to securityfs that was floated
       on the list earlier and the apparmorfs changes that make use of
       securityfs symlinks

     * it introduces the domain labeling base code that Ubuntu has been
       carrying for several years, with several cleanups applied. And it
       converts the current mediation over to using the domain labeling
       base, which brings domain stacking support with it. This finally
       will bring the base upstream code in line with Ubuntu and provide
       a base to upstream the new feature work that Ubuntu carries.

     * This does _not_ contain any of the newer apparmor mediation
       features/controls (mount, signals, network, keys, ...) that
       Ubuntu is currently carrying, all of which will be RFC'd on top
       of this.

 - Notable also is the Infiniband work in SELinux, and the new file:map
   permission. From Paul:

      "While we're down to 21 patches for v4.13 (it was 31 for v4.12),
       the diffstat jumps up tremendously with over 2k of line changes.

       Almost all of these changes are the SELinux/IB work done by
       Daniel Jurgens; some other noteworthy changes include a NFS v4.2
       labeling fix, a new file:map permission, and reporting of policy
       capabilities on policy load"

   There's also now genfscon labeling support for tracefs, which was
   lost in v4.1 with the separation from debugfs.

 - Smack incorporates a safer socket check in file_receive, and adds a
   cap_capable call in privilege check.

 - TPM as usual has a bunch of fixes and enhancements.

 - Multiple calls to security_add_hooks() can now be made for the same
   LSM, to allow LSMs to have hook declarations across multiple files.

 - IMA now supports different "ima_appraise=" modes (eg. log, fix) from
   the boot command line.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (126 commits)
  apparmor: put back designators in struct initialisers
  seccomp: Switch from atomic_t to recount_t
  seccomp: Adjust selftests to avoid double-join
  seccomp: Clean up core dump logic
  IMA: update IMA policy documentation to include pcr= option
  ima: Log the same audit cause whenever a file has no signature
  ima: Simplify policy_func_show.
  integrity: Small code improvements
  ima: fix get_binary_runtime_size()
  ima: use ima_parse_buf() to parse template data
  ima: use ima_parse_buf() to parse measurements headers
  ima: introduce ima_parse_buf()
  ima: Add cgroups2 to the defaults list
  ima: use memdup_user_nul
  ima: fix up #endif comments
  IMA: Correct Kconfig dependencies for hash selection
  ima: define is_ima_appraise_enabled()
  ima: define Kconfig IMA_APPRAISE_BOOTPARAM option
  ima: define a set of appraisal rules requiring file signatures
  ima: extend the "ima_policy" boot command line to support multiple policies
  ...
2017-07-05 11:26:35 -07:00
Linus Torvalds
7391786a64 Merge branch 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Things are relatively quiet on the audit front for v4.13, just five
  patches for a total diffstat of 102 lines.

  There are two patches from Richard to consistently record the POSIX
  capabilities and add the ambient capability information as well.

  I also chipped in two patches to fix a race condition with the auditd
  tracking code and ensure we don't skip sending any records to the
  audit multicast group.

  Finally a single style fix that I accepted because I must have been in
  a good mood that day.

  Everything passes our test suite, and should be relatively harmless,
  please merge for v4.13"

* 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit:
  audit: make sure we never skip the multicast broadcast
  audit: fix a race condition with the auditd tracking code
  audit: style fix
  audit: add ambient capabilities to CAPSET and BPRM_FCAPS records
  audit: unswing cap_* fields in PATH records
2017-07-05 11:24:05 -07:00
Linus Torvalds
eed1fc8779 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:

 - Store printk() messages into the main log buffer directly even in NMI
   when the lock is available. It is the best effort to print even large
   chunk of text. It is handy, for example, when all ftrace messages are
   printed during the system panic in NMI.

 - Add missing annotations to calm down compiler warnings

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: add __printf attributes to internal functions
  printk: Use the main logbuf in NMI when logbuf_lock is available
2017-07-05 11:11:26 -07:00
Jeffrey Hugo
65a4433aeb sched/fair: Fix load_balance() affinity redo path
If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source CPU from consideration and
attempts to redo and balance among the new subset of CPUs.

There is a bug in this code path where the algorithm considers all active
CPUs in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active CPUs may not be in the current
scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid CPUs and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other CPUs in future
load_balance() runs as that algorithm intends.

Fix the check by removing CPUs not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining CPUs
from which load might be migrated.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Austin Christ <austinwc@codeaurora.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Timur Tabi <timur@codeaurora.org>
Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 16:28:48 +02:00
Steven Rostedt (VMware)
69d71879d2 ftrace: Test for NULL iter->tr in regex for stack_trace_filter changes
As writing into stack_trace_filter, the iter-tr is not set and is NULL.
Check if it is NULL before dereferencing it in ftrace_regex_release().

Fixes: 8c08f0d5c6 ("ftrace: Have cached module filters be an active filter")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-05 09:52:18 -04:00
Steven Rostedt (VMware)
4dce17b26b Merge commit '0f17976568b3f72e676450af0c0db6f8752253d6' into trace/ftrace/core
Need to get the changes from 0f17976568 ("ftrace: Fix regression with
module command in stack_trace_filter") as it is required to fix some other
changes with stack_trace_filter and the new development code.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-05 09:51:24 -04:00
Kirill Tkhai
a0c4acd2c2 locking/rwsem-spinlock: Fix EINTR branch in __down_write_common()
If a writer could been woken up, the above branch

	if (sem->count == 0)
		break;

would have moved us to taking the sem. So, it's
not the time to wake a writer now, and only readers
are allowed now. Thus, 0 must be passed to __rwsem_do_wake().

Next, __rwsem_do_wake() wakes readers unconditionally.
But we mustn't do that if the sem is owned by writer
in the moment. Otherwise, writer and reader own the sem
the same time, which leads to memory corruption in
callers.

rwsem-xadd.c does not need that, as:

  1) the similar check is made lockless there,
  2) in __rwsem_mark_wake::try_reader_grant we test,

that sem is not owned by writer.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Cassel <niklas.cassel@axis.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 17fcbd590d "locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y"
Link: http://lkml.kernel.org/r/149762063282.19811.9129615532201147826.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 12:26:29 +02:00
Wanpeng Li
2a42eb9594 sched/cputime: Accumulate vtime on top of nsec clocksource
Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:15 +02:00
Frederic Weisbecker
bac5b6b6b1 sched/cputime: Move the vtime task fields to their own struct
We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:15 +02:00
Frederic Weisbecker
60a9ce57e7 sched/cputime: Rename vtime fields
The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Frederic Weisbecker
9fa57cf5a5 sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.

vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Frederic Weisbecker
1c3eda01a7 vtime, sched/cputime: Remove vtime_account_user()
It's an unnecessary function between vtime_user_exit() and
account_user_time().

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:14 +02:00
Linus Torvalds
408c9861c6 Power management updates for v4.13-rc1
- Rework suspend-to-idle to allow it to take wakeup events signaled
    by the EC into account on ACPI-based platforms in order to properly
    support power button wakeup from suspend-to-idle on recent Dell
    laptops (Rafael Wysocki).
 
    That includes the core suspend-to-idle code rework, support for
    the Low Power S0 _DSM interface, and support for the ACPI INT0002
    Virtual GPIO device from Hans de Goede (required for USB keyboard
    wakeup from suspend-to-idle to work on some machines).
 
  - Stop trying to export the current CPU frequency via /proc/cpuinfo
    on x86 as that is inaccurate and confusing (Len Brown).
 
  - Rework the way in which the current CPU frequency is exported by
    the kernel (over the cpufreq sysfs interface) on x86 systems with
    the APERF and MPERF registers by always using values read from
    these registers, when available, to compute the current frequency
    regardless of which cpufreq driver is in use (Len Brown).
 
  - Rework the PCI/ACPI device wakeup infrastructure to remove the
    questionable and artificial distinction between "devices that
    can wake up the system from sleep states" and "devices that can
    generate wakeup signals in the working state" from it, which
    allows the code to be simplified quite a bit (Rafael Wysocki).
 
  - Fix the wakeup IRQ framework by making it use SRCU instead of
    RCU which doesn't allow sleeping in the read-side critical
    sections, but which in turn is expected to be allowed by the
    IRQ bus locking infrastructure (Thomas Gleixner).
 
  - Modify some computations in the intel_pstate driver to avoid
    rounding errors resulting from them (Srinivas Pandruvada).
 
  - Reduce the overhead of the intel_pstate driver in the HWP
    (hardware-managed P-states) mode and when the "performance"
    P-state selection algorithm is in use by making it avoid
    registering scheduler callbacks in those cases (Len Brown).
 
  - Rework the energy_performance_preference sysfs knob in
    intel_pstate by changing the values that correspond to
    different symbolic hint names used by it (Len Brown).
 
  - Make it possible to use more than one cpuidle driver at the same
    time on ARM (Daniel Lezcano).
 
  - Make it possible to prevent the cpuidle menu governor from using
    the 0 state by disabling it via sysfs (Nicholas Piggin).
 
  - Add support for FFH (Fixed Functional Hardware) MWAIT in ACPI C1
    on AMD systems (Yazen Ghannam).
 
  - Make the CPPC cpufreq driver take the lowest nonlinear performance
    information into account (Prashanth Prakash).
 
  - Add support for hi3660 to the cpufreq-dt driver, fix the
    imx6q driver and clean up the sfi, exynos5440 and intel_pstate
    drivers (Colin Ian King, Krzysztof Kozlowski, Octavian Purdila,
    Rafael Wysocki, Tao Wang).
 
  - Fix a few minor issues in the generic power domains (genpd)
    framework and clean it up somewhat (Krzysztof Kozlowski,
    Mikko Perttunen, Viresh Kumar).
 
  - Fix a couple of minor issues in the operating performance points
    (OPP) framework and clean it up somewhat (Viresh Kumar).
 
  - Fix a CONFIG dependency in the hibernation core and clean it up
    slightly (Balbir Singh, Arvind Yadav, BaoJun Luo).
 
  - Add rk3228 support to the rockchip-io adaptive voltage scaling
    (AVS) driver (David Wu).
 
  - Fix an incorrect bit shift operation in the RAPL power capping
    driver (Adam Lessnau).
 
  - Add support for the EPP field in the HWP (hardware managed
    P-states) control register, HWP.EPP, to the x86_energy_perf_policy
    tool and update msr-index.h with HWP.EPP values (Len Brown).
 
  - Fix some minor issues in the turbostat tool (Len Brown).
 
  - Add support for AMD family 0x17 CPUs to the cpupower tool and fix
    a minor issue in it (Sherry Hurwitz).
 
  - Assorted cleanups, mostly related to the constification of some
    data structures (Arvind Yadav, Joe Perches, Kees Cook, Krzysztof
    Kozlowski).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZWrICAAoJEILEb/54YlRxZYMQAIRhfbyDxKq+ByvSilUS8kTA
 AItwJ8FFzykhiwN75Cqabg4rAGyWma7IRs1vzU7zeC1aEQIn+bTQtvk+utZNI+g2
 ANFlDha20q/sXsP/CDMMTIAdW9tSOC0TOvFI9s2V2Y8dJZhoekO4ctx34FAfUS5d
 Ao6rwSAWCMsCXcGaTAlqTA+TEJmBG7u6Iq6hq6ngltoFwOv3mWWBVn52VVaJ7SMp
 9/IPbbLGMFAedrgEBRGCR+MME1xZZpvcZIJaTt1Mgn7Cx3cJaysIUAvqY/SsvFGq
 5FcUTcF2qpK3+AGawiAxZIjvOBsGRtIwqKinNIzYWs/NjiIdzmgVAmTeuPtTqp+5
 HFehUdtkFcnuDnLqSNzAaZUa7tw84cJkwnbVMnesx0MkG6rZ1SeL22E2Sabpcdsh
 3Yo1ThzJSxi59DhiiE92EQnNCEjmCldRy+8q5Ag035muxl6EJYvuNBMnZv/BMCUn
 ltSNOrmps1DlN+Col8ORIeNzQ1YjYzWMqKAYzSbyccm4ug/iSHx0/DuESmQ4GTlF
 YCwkmqyWiHrBwpl51jc+4a7SGlMmKRqU+MJes0CjagaaqoUAb8qeBOpzEJ0yNwjZ
 wtI41l6blE6kbMD3yqGdCfiB2S7GlPVoxa15eX1wRyLH3fLjwwrzJirEaiBS86tI
 1PzHZEOmBlh3DYC6DBKA
 =Wsph
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The big ticket items here are the rework of suspend-to-idle in order
  to add proper support for power button wakeup from it on recent Dell
  laptops and the rework of interfaces exporting the current CPU
  frequency on x86.

  In addition to that, support for a few new pieces of hardware is
  added, the PCI/ACPI device wakeup infrastructure is simplified
  significantly and the wakeup IRQ framework is fixed to unbreak the IRQ
  bus locking infrastructure.

  Also, there are some functional improvements for intel_pstate, tools
  updates and small fixes and cleanups all over.

  Specifics:

   - Rework suspend-to-idle to allow it to take wakeup events signaled
     by the EC into account on ACPI-based platforms in order to properly
     support power button wakeup from suspend-to-idle on recent Dell
     laptops (Rafael Wysocki).

     That includes the core suspend-to-idle code rework, support for the
     Low Power S0 _DSM interface, and support for the ACPI INT0002
     Virtual GPIO device from Hans de Goede (required for USB keyboard
     wakeup from suspend-to-idle to work on some machines).

   - Stop trying to export the current CPU frequency via /proc/cpuinfo
     on x86 as that is inaccurate and confusing (Len Brown).

   - Rework the way in which the current CPU frequency is exported by
     the kernel (over the cpufreq sysfs interface) on x86 systems with
     the APERF and MPERF registers by always using values read from
     these registers, when available, to compute the current frequency
     regardless of which cpufreq driver is in use (Len Brown).

   - Rework the PCI/ACPI device wakeup infrastructure to remove the
     questionable and artificial distinction between "devices that can
     wake up the system from sleep states" and "devices that can
     generate wakeup signals in the working state" from it, which allows
     the code to be simplified quite a bit (Rafael Wysocki).

   - Fix the wakeup IRQ framework by making it use SRCU instead of RCU
     which doesn't allow sleeping in the read-side critical sections,
     but which in turn is expected to be allowed by the IRQ bus locking
     infrastructure (Thomas Gleixner).

   - Modify some computations in the intel_pstate driver to avoid
     rounding errors resulting from them (Srinivas Pandruvada).

   - Reduce the overhead of the intel_pstate driver in the HWP
     (hardware-managed P-states) mode and when the "performance" P-state
     selection algorithm is in use by making it avoid registering
     scheduler callbacks in those cases (Len Brown).

   - Rework the energy_performance_preference sysfs knob in intel_pstate
     by changing the values that correspond to different symbolic hint
     names used by it (Len Brown).

   - Make it possible to use more than one cpuidle driver at the same
     time on ARM (Daniel Lezcano).

   - Make it possible to prevent the cpuidle menu governor from using
     the 0 state by disabling it via sysfs (Nicholas Piggin).

   - Add support for FFH (Fixed Functional Hardware) MWAIT in ACPI C1 on
     AMD systems (Yazen Ghannam).

   - Make the CPPC cpufreq driver take the lowest nonlinear performance
     information into account (Prashanth Prakash).

   - Add support for hi3660 to the cpufreq-dt driver, fix the imx6q
     driver and clean up the sfi, exynos5440 and intel_pstate drivers
     (Colin Ian King, Krzysztof Kozlowski, Octavian Purdila, Rafael
     Wysocki, Tao Wang).

   - Fix a few minor issues in the generic power domains (genpd)
     framework and clean it up somewhat (Krzysztof Kozlowski, Mikko
     Perttunen, Viresh Kumar).

   - Fix a couple of minor issues in the operating performance points
     (OPP) framework and clean it up somewhat (Viresh Kumar).

   - Fix a CONFIG dependency in the hibernation core and clean it up
     slightly (Balbir Singh, Arvind Yadav, BaoJun Luo).

   - Add rk3228 support to the rockchip-io adaptive voltage scaling
     (AVS) driver (David Wu).

   - Fix an incorrect bit shift operation in the RAPL power capping
     driver (Adam Lessnau).

   - Add support for the EPP field in the HWP (hardware managed
     P-states) control register, HWP.EPP, to the x86_energy_perf_policy
     tool and update msr-index.h with HWP.EPP values (Len Brown).

   - Fix some minor issues in the turbostat tool (Len Brown).

   - Add support for AMD family 0x17 CPUs to the cpupower tool and fix a
     minor issue in it (Sherry Hurwitz).

   - Assorted cleanups, mostly related to the constification of some
     data structures (Arvind Yadav, Joe Perches, Kees Cook, Krzysztof
     Kozlowski)"

* tag 'pm-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (69 commits)
  cpufreq: Update scaling_cur_freq documentation
  cpufreq: intel_pstate: Clean up after performance governor changes
  PM: hibernate: constify attribute_group structures.
  cpuidle: menu: allow state 0 to be disabled
  intel_idle: Use more common logging style
  PM / Domains: Fix missing default_power_down_ok comment
  PM / Domains: Fix unsafe iteration over modified list of domains
  PM / Domains: Fix unsafe iteration over modified list of domain providers
  PM / Domains: Fix unsafe iteration over modified list of device links
  PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device
  PM / Domains: Call driver's noirq callbacks
  PM / core: Drop run_wake flag from struct dev_pm_info
  PCI / PM: Simplify device wakeup settings code
  PCI / PM: Drop pme_interrupt flag from struct pci_dev
  ACPI / PM: Consolidate device wakeup settings code
  ACPI / PM: Drop run_wake from struct acpi_device_wakeup_flags
  PM / QoS: constify *_attribute_group.
  PM / AVS: rockchip-io: add io selectors and supplies for rk3228
  powercap/RAPL: prevent overridding bits outside of the mask
  PM / sysfs: Constify attribute groups
  ...
2017-07-04 13:39:41 -07:00
Arvind Yadav
1d0c6e5930 PM / sleep: constify attribute_group structures
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   3802	    624	     32	   4458	   116a	kernel/power/main.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   3866	    560	     32	   4458	   116a	kernel/power/main.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-04 22:01:16 +02:00
Thomas Gleixner
2343877fbd genirq/timings: Move free timings out of spinlocked region
No point to do memory management from a interrupt disabled spin locked
region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Julia Cartwright <julia@ni.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: linux-rockchip@lists.infradead.org
Cc: John Keeping <john@metanate.com>
Cc: linux-gpio@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629214344.196130646@linutronix.de
2017-07-04 12:46:16 +02:00
Thomas Gleixner
46e48e2573 genirq: Move irq resource handling out of spinlocked region
Aside of being conceptually wrong, there is also an actual (hard to
trigger and mostly theoretical) problem.

CPU0				CPU1
free_irq(X)			interrupt X
				spin_lock(desc->lock)
				wake irq thread()
				spin_unlock(desc->lock)
spin_lock(desc->lock)
remove action()
shutdown_irq()			
release_resources()		thread_handler()
spin_unlock(desc->lock)		  access released resources.

synchronize_irq()

Move the release resources invocation after synchronize_irq() so it's
guaranteed that the threaded handler has finished.

Move the resource request call out of the desc->lock held region as well,
so the invocation context is the same for both request and release.

This solves the problems with those functions on RT as well.
 
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Julia Cartwright <julia@ni.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: linux-rockchip@lists.infradead.org
Cc: John Keeping <john@metanate.com>
Cc: linux-gpio@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629214344.117028181@linutronix.de
2017-07-04 12:46:16 +02:00
Thomas Gleixner
9114014cf4 genirq: Add mutex to irq desc to serialize request/free_irq()
The irq_request/release_resources() callbacks ar currently invoked under
desc->lock with interrupts disabled. This is a source of problems on RT and
conceptually not required.

Add a seperate mutex to struct irq_desc which allows to serialize
request/free_irq(), which can be used to move the resource functions out of
the desc->lock held region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Julia Cartwright <julia@ni.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: linux-rockchip@lists.infradead.org
Cc: John Keeping <john@metanate.com>
Cc: linux-gpio@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629214344.039220922@linutronix.de
2017-07-04 12:46:16 +02:00
Thomas Gleixner
3a90795e1e genirq: Move bus locking into __setup_irq()
There is no point in having the irq_bus_lock() protection around all
callers to __setup_irq().

Move it into __setup_irq(). This is also a preparatory patch for addressing
the issues with the irq resource callbacks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Julia Cartwright <julia@ni.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Cc: linux-rockchip@lists.infradead.org
Cc: John Keeping <john@metanate.com>
Cc: linux-gpio@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629214343.960949031@linutronix.de
2017-07-04 12:46:15 +02:00
Geert Uytterhoeven
2372a519f6 genirq: Force inlining of __irq_startup_managed to prevent build failure
If CONFIG_SMP=n, and gcc (e.g. 4.1.2) decides not to inline
__irq_startup_managed(), the build fails with:

    kernel/built-in.o: In function `irq_startup':
    (.text+0x38ed8): undefined reference to `irq_set_affinity_locked'

Fix this by forcing inlining of __irq_startup_managed().

Fixes: 761ea388e8 ("genirq: Handle managed irqs gracefully in irq_startup()")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1499162761-12398-1-git-send-email-geert@linux-m68k.org
2017-07-04 12:36:44 +02:00
Sebastian Ott
e5682b4eec genirq/debugfs: Fix build for !CONFIG_IRQ_DOMAIN
Fix this build error:

kernel/irq/internals.h:440:20: error: inlining failed in call to always_inline
  'irq_domain_debugfs_init': function body not available
kernel/irq/debugfs.c:202:2: note: called from here
  irq_domain_debugfs_init(root_dir);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1707041124000.1712@schleppi
2017-07-04 12:36:43 +02:00
Ingo Molnar
3b9c08ae3d Revert "sched/cputime: Refactor the cputime_adjust() code"
This reverts commit 72298e5c92.

As Peter explains:

> Argh, no... That code was perfectly fine. The new code otoh is
> convoluted.
>
> The old code had the following form:
>
>         if (exception1)
>           deal with exception1
>
>         if (execption2)
>           deal with exception2
>
>         do normal stuff
>
> Which is as simple and straight forward as it gets.
>
> The new code otoh reads like:
>
>         if (!exception1) {
>                 if (exception2)
>                   deal with exception 2
>                 else
>                   do normal stuff
>         }

So restore the old form.

Also fix the comment describing the logic, as it was confusing.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-04 11:58:05 +02:00
Linus Torvalds
650fc870a2 There has been a fair amount of activity in the docs tree this time
around.  Highlights include:
 
  - Conversion of a bunch of security documentation into RST
 
  - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine.  We can now drop the entire DocBook build chain.
 
  - The usual collection of fixes and minor updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZWkGAAAoJEI3ONVYwIuV6rf0P/0B3JTiVPKS/WUx53+jzbAi4
 1BN7dmmuMxE1bWpgdEq+ac4aKxm07iAojuntuMj0qz/ZB1WARcmvEqqzI5i4wfq9
 5MrLduLkyuWfr4MOPseKJ2VK83p8nkMOiO7jmnBsilu7fE4nF+5YY9j4cVaArfMy
 cCQvAGjQzvej2eiWMGUSLHn4QFKh00aD7cwKyBVsJ08b27C9xL0J2LQyCDZ4yDgf
 37/MH3puEd3HX/4qAwLonIxT3xrIrrbDturqLU7OSKcWTtGZNrYyTFbwR3RQtqWd
 H8YZVg2Uyhzg9MYhkbQ2E5dEjUP4mkegcp6/JTINH++OOPpTbdTJgirTx7VTkSf1
 +kL8t7+Ayxd0FH3+77GJ5RMj8LUK6rj5cZfU5nClFQKWXP9UL3IelQ3Nl+SpdM8v
 ZAbR2KjKgH9KS6+cbIhgFYlvY+JgPkOVruwbIAc7wXVM3ibk1sWoBOFEujcbueWh
 yDpQv3l1UX0CKr3jnevJoW26LtEbGFtC7gSKZ+3btyeSBpWFGlii42KNycEGwUW0
 ezlwryDVHzyTUiKllNmkdK4v73mvPsZHEjgmme4afKAIiUilmcUF4XcqD86hISFT
 t+UJLA/zEU+0sJe26o2nK6GNJzmo4oCtVyxfhRe26Ojs1n80xlYgnZRfuIYdd31Z
 nwLBnwDCHAOyX91WXp9G
 =cVjZ
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.13' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "There has been a fair amount of activity in the docs tree this time
  around. Highlights include:

   - Conversion of a bunch of security documentation into RST

   - The conversion of the remaining DocBook templates by The Amazing
     Mauro Machine. We can now drop the entire DocBook build chain.

   - The usual collection of fixes and minor updates"

* tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
  scripts/kernel-doc: handle DECLARE_HASHTABLE
  Documentation: atomic_ops.txt is core-api/atomic_ops.rst
  Docs: clean up some DocBook loose ends
  Make the main documentation title less Geocities
  Docs: Use kernel-figure in vidioc-g-selection.rst
  Docs: fix table problems in ras.rst
  Docs: Fix breakage with Sphinx 1.5 and upper
  Docs: Include the Latex "ifthen" package
  doc/kokr/howto: Only send regression fixes after -rc1
  docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
  doc: Document suitability of IBM Verse for kernel development
  Doc: fix a markup error in coding-style.rst
  docs: driver-api: i2c: remove some outdated information
  Documentation: DMA API: fix a typo in a function name
  Docs: Insert missing space to separate link from text
  doc/ko_KR/memory-barriers: Update control-dependencies example
  Documentation, kbuild: fix typo "minimun" -> "minimum"
  docs: Fix some formatting issues in request-key.rst
  doc: ReSTify keys-trusted-encrypted.txt
  doc: ReSTify keys-request-key.txt
  ...
2017-07-03 21:13:25 -07:00
Linus Torvalds
f4dd029ee0 Char/Misc patches for 4.13-rc1
Here is the "big" char/misc driver patchset for 4.13-rc1.
 
 Lots of stuff in here, a large thunderbolt update, w1 driver header
 reorg, the new mux driver subsystem, google firmware driver updates, and
 a raft of other smaller things.  Full details in the shortlog.
 
 All of these have been in linux-next for a while with the only reported
 issue being a merge problem with this tree and the jc-docs tree in the
 w1 documentation area.  The fix should be obvious for what to do when it
 happens, if not, we can send a follow-up patch for it afterward.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWVpXKA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynLrQCdG9SxRjAbOd6pT9Fr2NAzpUG84YsAoLw+I3iO
 EMi60UXWqAFJbtVMS9Aj
 =yrSq
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc updates from Greg KH:
 "Here is the "big" char/misc driver patchset for 4.13-rc1.

  Lots of stuff in here, a large thunderbolt update, w1 driver header
  reorg, the new mux driver subsystem, google firmware driver updates,
  and a raft of other smaller things. Full details in the shortlog.

  All of these have been in linux-next for a while with the only
  reported issue being a merge problem with this tree and the jc-docs
  tree in the w1 documentation area"

* tag 'char-misc-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (147 commits)
  misc: apds990x: Use sysfs_match_string() helper
  mei: drop unreachable code in mei_start
  mei: validate the message header only in first fragment.
  DocBook: w1: Update W1 file locations and names in DocBook
  mux: adg792a: always require I2C support
  nvmem: rockchip-efuse: add support for rk322x-efuse
  nvmem: core: add locking to nvmem_find_cell
  nvmem: core: Call put_device() in nvmem_unregister()
  nvmem: core: fix leaks on registration errors
  nvmem: correct Broadcom OTP controller driver writes
  w1: Add subsystem kernel public interface
  drivers/fsi: Add module license to core driver
  drivers/fsi: Use asynchronous slave mode
  drivers/fsi: Add hub master support
  drivers/fsi: Add SCOM FSI client device driver
  drivers/fsi/gpio: Add tracepoints for GPIO master
  drivers/fsi: Add GPIO based FSI master
  drivers/fsi: Document FSI master sysfs files in ABI
  drivers/fsi: Add error handling for slave
  drivers/fsi: Add tracepoints for low-level operations
  ...
2017-07-03 20:55:59 -07:00
Linus Torvalds
974668417b driver core patches for 4.13-rc1
Here is the big driver core update for 4.13-rc1.
 
 The large majority of this is a lot of cleanup of old fields in the
 driver core structures and their remaining usages in random drivers.
 All of those fixes have been reviewed by the various subsystem
 maintainers.  There's also some small firmware updates in here, a new
 kobject uevent api interface that makes userspace interaction easier,
 and a few other minor things.
 
 All of these have been in linux-next for a long while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWVpX4A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymobgCfd0d13IfpZoq1N41wc6z2Z0xD7cwAnRMeH1/p
 kEeISGpHPYP9f8PBh9FO
 =Hfqt
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the big driver core update for 4.13-rc1.

  The large majority of this is a lot of cleanup of old fields in the
  driver core structures and their remaining usages in random drivers.
  All of those fixes have been reviewed by the various subsystem
  maintainers. There's also some small firmware updates in here, a new
  kobject uevent api interface that makes userspace interaction easier,
  and a few other minor things.

  All of these have been in linux-next for a long while with no reported
  issues"

* tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (56 commits)
  arm: mach-rpc: ecard: fix build error
  zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()
  driver-core: remove struct bus_type.dev_attrs
  powerpc: vio_cmo: use dev_groups and not dev_attrs for bus_type
  powerpc: vio: use dev_groups and not dev_attrs for bus_type
  USB: usbip: convert to use DRIVER_ATTR_RW
  s390: drivers: convert to use DRIVER_ATTR_RO/WO
  platform: thinkpad_acpi: convert to use DRIVER_ATTR_RO/RW
  pcmcia: ds: convert to use DRIVER_ATTR_RO
  wireless: ipw2x00: convert to use DRIVER_ATTR_RW
  net: ehea: convert to use DRIVER_ATTR_RO
  net: caif: convert to use DRIVER_ATTR_RO
  TTY: hvc: convert to use DRIVER_ATTR_RW
  PCI: pci-driver: convert to use DRIVER_ATTR_WO
  IB: nes: convert to use DRIVER_ATTR_RW
  HID: hid-core: convert to use DRIVER_ATTR_RO and drv_groups
  arm: ecard: fix dev_groups patch typo
  tty: serdev: use dev_groups and not dev_attrs for bus_type
  sparc: vio: use dev_groups and not dev_attrs for bus_type
  hid: intel-ish-hid: use dev_groups and not dev_attrs for bus_type
  ...
2017-07-03 20:27:48 -07:00
Linus Torvalds
9a9594efe5 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug updates from Thomas Gleixner:
 "This update is primarily a cleanup of the CPU hotplug locking code.

  The hotplug locking mechanism is an open coded RWSEM, which allows
  recursive locking. The main problem with that is the recursive nature
  as it evades the full lockdep coverage and hides potential deadlocks.

  The rework replaces the open coded RWSEM with a percpu RWSEM and
  establishes full lockdep coverage that way.

  The bulk of the changes fix up recursive locking issues and address
  the now fully reported potential deadlocks all over the place. Some of
  these deadlocks have been observed in the RT tree, but on mainline the
  probability was low enough to hide them away."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  cpu/hotplug: Constify attribute_group structures
  powerpc: Only obtain cpu_hotplug_lock if called by rtasd
  ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
  cpu/hotplug: Remove unused check_for_tasks() function
  perf/core: Don't release cred_guard_mutex if not taken
  cpuhotplug: Link lock stacks for hotplug callbacks
  acpi/processor: Prevent cpu hotplug deadlock
  sched: Provide is_percpu_thread() helper
  cpu/hotplug: Convert hotplug locking to percpu rwsem
  s390: Prevent hotplug rwsem recursion
  arm: Prevent hotplug rwsem recursion
  arm64: Prevent cpu hotplug rwsem recursion
  kprobes: Cure hotplug lock ordering issues
  jump_label: Reorder hotplug lock and jump_label_lock
  perf/tracing/cpuhotplug: Fix locking order
  ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
  PCI: Replace the racy recursion prevention
  PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
  perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
  x86/perf: Drop EXPORT of perf_check_microcode
  ...
2017-07-03 18:08:06 -07:00
Linus Torvalds
03ffbcdd78 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq department delivers:

   - Expand the generic infrastructure handling the irq migration on CPU
     hotplug and convert X86 over to it. (Thomas Gleixner)

     Aside of consolidating code this is a preparatory change for:

   - Finalizing the affinity management for multi-queue devices. The
     main change here is to shut down interrupts which are affine to a
     outgoing CPU and reenabling them when the CPU comes online again.
     That avoids moving interrupts pointlessly around and breaking and
     reestablishing affinities for no value. (Christoph Hellwig)

     Note: This contains also the BLOCK-MQ and NVME changes which depend
     on the rework of the irq core infrastructure. Jens acked them and
     agreed that they should go with the irq changes.

   - Consolidation of irq domain code (Marc Zyngier)

   - State tracking consolidation in the core code (Jeffy Chen)

   - Add debug infrastructure for hierarchical irq domains (Thomas
     Gleixner)

   - Infrastructure enhancement for managing generic interrupt chips via
     devmem (Bartosz Golaszewski)

   - Constification work all over the place (Tobias Klauser)

   - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

   - The usual set of fixes, updates and enhancements all over the
     place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
  irqchip/or1k-pic: Fix interrupt acknowledgement
  irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
  irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
  nvme: Allocate queues for all possible CPUs
  blk-mq: Create hctx for each present CPU
  blk-mq: Include all present CPUs in the default queue mapping
  genirq: Avoid unnecessary low level irq function calls
  genirq: Set irq masked state when initializing irq_desc
  genirq/timings: Add infrastructure for estimating the next interrupt arrival time
  genirq/timings: Add infrastructure to track the interrupt timings
  genirq/debugfs: Remove pointless NULL pointer check
  irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
  irqchip/gic-v3-its: Add ACPI NUMA node mapping
  irqchip/gic-v3-its-platform-msi: Make of_device_ids const
  irqchip/gic-v3-its: Make of_device_ids const
  irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
  irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
  dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
  genirq/irqdomain: Remove auto-recursive hierarchy support
  irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
  ...
2017-07-03 16:50:31 -07:00
Linus Torvalds
1b044f1cfc Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather large update for timers/timekeeping:

   - compat syscall consolidation (Al Viro)

   - Posix timer consolidation (Christoph Helwig / Thomas Gleixner)

   - Cleanup of the device tree based initialization for clockevents and
     clocksources (Daniel Lezcano)

   - Consolidation of the FTTMR010 clocksource/event driver (Linus
     Walleij)

   - The usual set of small fixes and updates all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (93 commits)
  timers: Make the cpu base lock raw
  clocksource/drivers/mips-gic-timer: Fix an error code in 'gic_clocksource_of_init()'
  clocksource/drivers/fsl_ftm_timer: Unmap region obtained by of_iomap
  clocksource/drivers/tcb_clksrc: Make IO endian agnostic
  clocksource/drivers/sun4i: Switch to the timer-of common init
  clocksource/drivers/timer-of: Fix invalid iomap check
  Revert "ktime: Simplify ktime_compare implementation"
  clocksource/drivers: Fix uninitialized variable use in timer_of_init
  kselftests: timers: Add test for frequency step
  kselftests: timers: Fix inconsistency-check to not ignore first timestamp
  time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD
  time: Clean up CLOCK_MONOTONIC_RAW time handling
  posix-cpu-timers: Make timespec to nsec conversion safe
  itimer: Make timeval to nsec conversion range limited
  timers: Fix parameter description of try_to_del_timer_sync()
  ktime: Simplify ktime_compare implementation
  clocksource/drivers/fttmr010: Factor out clock read code
  clocksource/drivers/fttmr010: Implement delay timer
  clocksource/drivers: Add timer-of common init routine
  clocksource/drivers/tcb_clksrc: Save timer context on suspend/resume
  ...
2017-07-03 16:14:51 -07:00
Linus Torvalds
241e5e6f04 m68k updates for 4.13
- Nubus improvements and cleanups,
   - Defconfig updates,
   - Fix debugger syscall restart interactions, leading to the global
     removal of ptrace_signal_deliver().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZWfHjAAoJEEgEtLw/Ve779c0QAKUjc6Gow7Z6RC5QdtjZs9QR
 LWREsHc/GRP2J477yUYSGonKvI10O4HL6boruux3BVdNVMZBpIy/LB8dTCYDnzCI
 i8+bJ6eeG6q/xut2anV19v+RiblEFxP7dcZj5acoxrlmEhJEUFbZe8jFHD8LJoiZ
 A1er1sUGfsCF+nlMPeh0+QuomGOXSJZBEtAB/zjddU+JGpFDnr0qQfAR2rAno44S
 i2GbfIAtPXrvSupfDIjEJTL0MuY+3q3HlhlrPvlOLwNJL0CnmdQDsMFupg5F9A2P
 ycIKRJ862Kvg/6GDxgbtap6gUj13FmxwXs1yOu5R2Laevhr5gAR0v1vLB4iynvRt
 Ib3CTUfKSord11dbLH89xB9Z7hPMSf5wh+GA1KGYCzkP16GWO72O89qmu9QkI03H
 naVDpSIvje0zgidrtQ5LIPZi1BRrDKte5EcSJyQoNcrG0H7L5Rq0FHSCa22kObSt
 ZBCNReCWImZ2RT7FTx3jyu/8xF20g39C+oPdJrR0NwqcfsNO0aRlmzPxiqGR9XZg
 XtpCbL23wdR0aKiriDMDip52GGhQ4vdCzdCn+KCcI1DfaCGk+7/08YgBRThHmU8Y
 pa7yl9sjxvTr+U/mz8S0ZwFxvB0zoEIkx0efiVOJYpqCrrpz7iuPLOqEkHvIFQ9G
 rY49u5m8Lo+Y2QEQ1jQU
 =kAE1
 -----END PGP SIGNATURE-----

Merge tag 'm68k-for-v4.13-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k

Pull m68k updates from Geert Uytterhoeven:

  - NuBus improvements and cleanups

  - defconfig updates

  - Fix debugger syscall restart interactions, leading to the global
    removal of ptrace_signal_deliver()

* tag 'm68k-for-v4.13-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: Remove ptrace_signal_deliver
  m68k/defconfig: Update defconfigs for v4.12-rc1
  nubus: Fix pointer validation
  nubus: Remove slot zero probe
2017-07-03 15:12:52 -07:00
Linus Torvalds
59b60185b4 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull nohz updates from Ingo Molnar:
 "The main changes in this cycle relate to fixing another bad (but
  sporadic and hard to detect) interaction between the dynticks
  scheduler tick and hrtimers, plus related improvements to better
  detection and handling of similar problems - by Frédéric Weisbecker"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: Fix spurious warning when hrtimer and clockevent get out of sync
  nohz: Fix buggy tick delay on IRQ storms
  nohz: Reset next_tick cache even when the timer has no regs
  nohz: Fix collision between tick and other hrtimers, again
  nohz: Add hrtimer sanity check
2017-07-03 13:33:57 -07:00
Linus Torvalds
9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Linus Torvalds
7447d56217 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Most of the changes are for tooling, the main changes in this cycle were:

   - Improve Intel-PT hardware tracing support, both on the kernel and
     on the tooling side: PTWRITE instruction support, power events for
     C-state tracing, etc. (Adrian Hunter)

   - Add support to measure SMI cost to the x86 architecture, with
     tooling support in 'perf stat' (Kan Liang)

   - Support function filtering in 'perf ftrace', plus related
     improvements (Namhyung Kim)

   - Allow adding and removing fields to the default 'perf script'
     columns, using + or - as field prefixes to do so (Andi Kleen)

   - Allow resolving the DSO name with 'perf script -F brstack{sym,off},dso'
     (Mark Santaniello)

   - Add perf tooling unwind support for PowerPC (Paolo Bonzini)

   - ... and various other improvements as well"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
  perf auxtrace: Add CPU filter support
  perf intel-pt: Do not use TSC packets for calculating CPU cycles to TSC
  perf intel-pt: Update documentation to include new ptwrite and power events
  perf intel-pt: Add example script for power events and PTWRITE
  perf intel-pt: Synthesize new power and "ptwrite" events
  perf intel-pt: Move code in intel_pt_synth_events() to simplify attr setting
  perf intel-pt: Factor out intel_pt_set_event_name()
  perf intel-pt: Tidy messages into called function intel_pt_synth_event()
  perf intel-pt: Tidy Intel PT evsel lookup into separate function
  perf intel-pt: Join needlessly wrapped lines
  perf intel-pt: Remove unused instructions_sample_period
  perf intel-pt: Factor out common code synthesizing event samples
  perf script: Add synthesized Intel PT power and ptwrite events
  perf/x86/intel: Constify the 'lbr_desc[]' array and make a function static
  perf script: Add 'synth' field for synthesized event payloads
  perf auxtrace: Add itrace option to output power events
  perf auxtrace: Add itrace option to output ptwrite events
  tools include: Add byte-swapping macros to kernel.h
  perf script: Add 'synth' event type for synthesized events
  x86/insn: perf tools: Add new ptwrite instruction
  ...
2017-07-03 12:40:46 -07:00
Linus Torvalds
892ad5acca Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add CONFIG_REFCOUNT_FULL=y to allow the disabling of the 'full'
     (robustness checked) refcount_t implementation with slightly lower
     runtime overhead. (Kees Cook)

     The lighter weight variant is the default. The two variants use the
     same API. Having this variant was a precondition by some
     maintainers to merge refcount_t cleanups.

   - Add lockdep support for rtmutexes (Peter Zijlstra)

   - liblockdep fixes and improvements (Sasha Levin, Ben Hutchings)

   - ... misc fixes and improvements"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  locking/refcount: Remove the half-implemented refcount_sub() API
  locking/refcount: Create unchecked atomic_t implementation
  locking/rtmutex: Don't initialize lockdep when not required
  locking/selftest: Add RT-mutex support
  locking/selftest: Remove the bad unlock ordering test
  rt_mutex: Add lockdep annotations
  MAINTAINERS: Claim atomic*_t maintainership
  locking/x86: Remove the unused atomic_inc_short() methd
  tools/lib/lockdep: Remove private kernel headers
  tools/lib/lockdep: Hide liblockdep output from test results
  tools/lib/lockdep: Add dummy current_gfp_context()
  tools/include: Add IS_ERR_OR_NULL to err.h
  tools/lib/lockdep: Add empty __is_[module,kernel]_percpu_address
  tools/lib/lockdep: Include err.h
  tools/include: Add (mostly) empty include/linux/sched/mm.h
  tools/lib/lockdep: Use LDFLAGS
  tools/lib/lockdep: Remove double-quotes from soname
  tools/lib/lockdep: Fix object file paths used in an out-of-tree build
  tools/lib/lockdep: Fix compilation for 4.11
  tools/lib/lockdep: Don't mix fd-based and stream IO
  ...
2017-07-03 12:14:18 -07:00
Linus Torvalds
330e9e4625 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The sole purpose of these changes is to shrink and simplify the RCU
  code base, which has suffered from creeping bloat over the past couple
  of years. The end result is a net removal of ~2700 lines of code:

     79 files changed, 1496 insertions(+), 4211 deletions(-)

  Plus there's a marked reduction in the Kconfig space complexity as
  well, here's the number of matches on 'grep RCU' in the .config:

                               before       after

     x86-defconfig                 17          15
     x86-allmodconfig              33          20"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  rcu: Remove RCU CPU stall warnings from Tiny RCU
  rcu: Remove event tracing from Tiny RCU
  rcu: Move RCU debug Kconfig options to kernel/rcu
  rcu: Move RCU non-debug Kconfig options to kernel/rcu
  rcu: Eliminate NOCBs CPU-state Kconfig options
  rcu: Remove debugfs tracing
  srcu: Remove Classic SRCU
  srcu: Fix rcutorture-statistics typo
  rcu: Remove SPARSE_RCU_POINTER Kconfig option
  rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
  rcu: Remove typecheck() from RCU locking wrapper functions
  rcu: Remove #ifdef moving rcu_end_inkernel_boot from rcupdate.h
  rcu: Remove nohz_full full-system-idle state machine
  rcu: Remove the RCU_KTHREAD_PRIO Kconfig option
  rcu: Remove *_SLOW_* Kconfig options
  srcu: Use rnp->lock wrappers to replace explicit memory barriers
  rcu: Move rnp->lock wrappers for SRCU use
  rcu: Convert rnp->lock wrappers to macros for SRCU use
  rcu: Refactor #includes from include/linux/rcupdate.h
  bcm47xx: Fix build regression
  ...
2017-07-03 11:34:53 -07:00
Linus Torvalds
e94693f797 Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
 "This is an extensive rewrite of the objdump tool to track all stack
  pointer modifications through the machine instructions of disassembled
  functions found in kernel .o files.

  This re-design removes the prior dependency on CONFIG_FRAME_POINTERS,
  with the goal to prepare the tool to generate kernel debuginfo data in
  the future. There's also an increase in checking/tracking robustness
  as a side effect as well.

  No (intended) changes to existing functionality"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Silence warnings for functions which use IRET
  objtool: Implement stack validation 2.0
  objtool, x86: Add several functions and files to the objtool whitelist
  objtool: Move checking code to check.c
2017-07-03 11:12:04 -07:00
Linus Torvalds
c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Linus Torvalds
81e3e04489 UUID/GUID updates:
- introduce the new uuid_t/guid_t types that are going to replace
    the somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)
  - consolidated generic uuid/guid helper functions lifted from XFS
    and libnvdimm (Amir and me)
  - conversions to the new types and helpers (Amir, Andy and me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc
 MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM
 2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe
 R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08
 uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa
 MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA
 ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw
 QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA
 xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt
 cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX
 OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ
 Sxg=
 =J/4P
 -----END PGP SIGNATURE-----

Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid

Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Petr Mladek
a5707eef79 Merge branch 'for-4.13' into for-linus 2017-07-03 15:33:39 +02:00
Rafael J. Wysocki
8f8e5c3e27 Merge branch 'acpi-pm'
* acpi-pm:
  PM / core: Drop run_wake flag from struct dev_pm_info
  PCI / PM: Simplify device wakeup settings code
  PCI / PM: Drop pme_interrupt flag from struct pci_dev
  ACPI / PM: Consolidate device wakeup settings code
  ACPI / PM: Drop run_wake from struct acpi_device_wakeup_flags
  ACPI / sleep: EC-based wakeup from suspend-to-idle on recent systems
  platform: x86: intel-hid: Wake up the system from suspend-to-idle
  platform: x86: intel-vbtn: Wake up the system from suspend-to-idle
  ACPI / PM: Ignore spurious SCI wakeups from suspend-to-idle
  platform/x86: Add driver for ACPI INT0002 Virtual GPIO device
  PCI / PM: Restore PME Enable if skipping wakeup setup
  PM / sleep: Print timing information if debug is enabled
  ACPI / PM: Clean up device wakeup enable/disable code
  ACPI / PM: Change log level of wakeup-related message
  USB / PCI / PM: Allow the PCI core to do the resume cleanup
  ACPI / PM: Run wakeup notify handlers synchronously

Conflicts:
	drivers/base/power/main.c
2017-07-03 14:23:09 +02:00
Rafael J. Wysocki
301f8d7463 Merge branch 'pm-sleep'
* pm-sleep:
  PM: hibernate: constify attribute_group structures.
  PM / hibernate: Drop redundant parameter of swsusp_alloc()
  PM / hibernate: Use CONFIG_HAVE_SET_MEMORY for include condition
  x86/power/64: Use char arrays for asm function names
2017-07-03 14:21:33 +02:00
Rafael J. Wysocki
875aabf52e Merge branch 'uuid-types'
Merge 'uuid-types' from git://git.infradead.org/users/hch/uuid.git
2017-07-03 14:13:44 +02:00
John Fastabend
43188702b3 bpf, verifier: add additional patterns to evaluate_reg_imm_alu
Currently the verifier does not track imm across alu operations when
the source register is of unknown type. This adds additional pattern
matching to catch this and track imm. We've seen LLVM generating this
pattern while working on cilium.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
John Fastabend
7bda4b40c5 bpf: extend bpf_trace_printk to support %i
Currently, bpf_trace_printk does not support common formatting
symbol '%i' however vsprintf does and is what eventually gets
called by bpf helper. If users are used to '%i' and currently
make use of it, then bpf_trace_printk will just return with
error without dumping anything to the trace pipe, so just add
support for '%i' to the helper.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
Daniel Borkmann
9780c0ab1a bpf: export whether tail call has jited owner
We do export through fdinfo already whether a prog is JITed or not,
given a program load can fail in case of either prog or tail call map
has JITed property, but neither both are JITed or not JITed, we can
facilitate error reporting in loaders like iproute2 through exporting
owner_jited of tail call map. We already do export owner_prog_type
through this facility, so parser can pick up both for comparison.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
Daniel Borkmann
f96da09473 bpf: simplify narrower ctx access
This work tries to make the semantics and code around the
narrower ctx access a bit easier to follow. Right now
everything is done inside the .is_valid_access(). Offset
matching is done differently for read/write types, meaning
writes don't support narrower access and thus matching only
on offsetof(struct foo, bar) is enough whereas for read
case that supports narrower access we must check for
offsetof(struct foo, bar) + offsetof(struct foo, bar) +
sizeof(<bar>) - 1 for each of the cases. For read cases of
individual members that don't support narrower access (like
packet pointers or skb->cb[] case which has its own narrow
access logic), we check as usual only offsetof(struct foo,
bar) like in write case. Then, for the case where narrower
access is allowed, we also need to set the aux info for the
access. Meaning, ctx_field_size and converted_op_size have
to be set. First is the original field size e.g. sizeof(<bar>)
as in above example from the user facing ctx, and latter
one is the target size after actual rewrite happened, thus
for the kernel facing ctx. Also here we need the range match
and we need to keep track changing convert_ctx_access() and
converted_op_size from is_valid_access() as both are not at
the same location.

We can simplify the code a bit: check_ctx_access() becomes
simpler in that we only store ctx_field_size as a meta data
and later in convert_ctx_accesses() we fetch the target_size
right from the location where we do convert. Should the verifier
be misconfigured we do reject for BPF_WRITE cases or target_size
that are not provided. For the subsystems, we always work on
ranges in is_valid_access() and add small helpers for ranges
and narrow access, convert_ctx_accesses() sets target_size
for the relevant instruction.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
Lawrence Brakmo
40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Linus Torvalds
c0a0c7a4e1 Two fixes:
One is for a crash when using the :mod: trace probe command into
  stack_trace_filter. This bug was introduced during the last merge
  window.
 
  The other was there forever. It's a small bug that makes it impossible
  to name a module function for kprobes when the module starts with a digit.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZVsqbFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 AsoH+wWK8tsDqI6MBTXO+v5RwLUu/zHClcMEcJGKLFsHRZ8HOJ9Afg+c1LbTujRR
 Ck20l+U/DibVO1AnjJ9elJDj7/3ajTUfCrTVCKf5B9XbdzAD2qZle3byynhZvm2Q
 CwVbzMvRYd5jZzEiO95YKOhH6iIDfXOZM7vQzz0F/bDZn9uAxiFumwdiSNA+f2wP
 6Ykeuth/IZh0xdbaTqsH1XvhweUSpIIjhOZH/V/uAg+LuRffh4sOBR/wjYUMuegX
 tgpGfJu8VVsa+GIBcThdCkjy1k48GLBsZ6j/47Lhc5r3GNIVR7D7mb9x3zTn1U30
 WTeDu+betctH7b03hiH88kKoEfg=
 =RXOC
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull last-minute tracing fixes from Steven Rostedt:
 "Two fixes:

  One is for a crash when using the :mod: trace probe command into
  stack_trace_filter. This bug was introduced during the last merge
  window.

  The other was there forever. It's a small bug that makes it impossible
  to name a module function for kprobes when the module starts with a
  digit"

* tag 'trace-v4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Allow to create probe with a module name starting with a digit
  ftrace: Fix regression with module command in stack_trace_filter
2017-06-30 17:18:57 -07:00
Kees Cook
3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
David S. Miller
b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
Josh Poimboeuf
c207aee480 objtool, x86: Add several functions and files to the objtool whitelist
In preparation for an objtool rewrite which will have broader checks,
whitelist functions and files which cause problems because they do
unusual things with the stack.

These whitelists serve as a TODO list for which functions and files
don't yet have undwarf unwinder coverage.  Eventually most of the
whitelists can be removed in favor of manual CFI hint annotations or
objtool improvements.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 10:19:19 +02:00
Deepa Dinamani
725816e8aa posix_clocks: Use get_itimerspec64() and put_itimerspec64()
Usage of these apis and their compat versions makes
the syscalls: timer_settime and timer_gettime and their
compat implementations simpler.

This patch also serves as a preparatory patch for changing
syscalls to use new time_t data types to support the
y2038 effort by isolating the processing of user pointers
through these apis.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:15:02 -04:00
Deepa Dinamani
c0edd7c9ac nanosleep: Use get_timespec64() and put_timespec64()
Usage of these apis and their compat versions makes
the syscalls: clock_nanosleep and nanosleep and
their compat implementations simpler.

This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:14:14 -04:00
Deepa Dinamani
5c4994102f posix-timers: Use get_timespec64() and put_timespec64()
Usage of these apis and their compat versions makes
the syscalls: clock_gettime, clock_settime, clock_getres
and their compat implementations simpler.

This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:13:19 -04:00
Gustavo A. R. Silva
72298e5c92 sched/cputime: Refactor the cputime_adjust() code
Address a Coverity false positive, which is caused by overly
convoluted code:

Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.

Remove this variable assignment and refactor the code related.

Addresses-Coverity-ID: 1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:37:59 +02:00
Arvind Yadav
993647a293 cpu/hotplug: Constify attribute_group structures
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group.

So mark the non-const structs as const:

File size before:
   text	   data	    bss	    dec	    hex	filename
  12582	  15361	     20	  27963	   6d3b	kernel/cpu.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
  12710	  15265	     20	  27995	   6d5b	kernel/cpu.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: anna-maria@linutronix.de
Cc: bigeasy@linutronix.de
Cc: boris.ostrovsky@oracle.com
Cc: rcochran@linutronix.de
Link: http://lkml.kernel.org/r/f9079e94e12b36d245e7adbf67d312bc5d0250c6.1498737970.git.arvind.yadav.cs@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:34:39 +02:00
Daniel Bristot de Oliveira
48365b3884 sched/debug: Expose the number of RT/DL tasks that can migrate
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:

 rt_rq[0]:
   .rt_nr_running                 : 2
   .rt_nr_migratory               : 1     <--- Like this
   .rt_throttled                  : 0
   .rt_time                       : 828.645877
   .rt_runtime                    : 1000.000000

This is useful to debug problems related to the RT/DL schedulers.

This also fixes the format of some variables, that were unsigned, rather
than signed.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:32:07 +02:00
Al Viro
e4448ed87c bpf: don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:11 -04:00
Al Viro
a9bd8dfa53 kimage_file_prepare_segments(): don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:10 -04:00
Sabrina Dubroca
9e52b32567 tracing/kprobes: Allow to create probe with a module name starting with a digit
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.

This allows creating a probe such as:

    p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0

Which is necessary for this command to work:

    perf probe -m 8021q -a vlan_gro_receive

Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net

Cc: stable@vger.kernel.org
Fixes: 413d37d1e ("tracing: Add kprobe-based event tracer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-29 23:13:23 -04:00
Linus Torvalds
4d8a991d46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Need to access netdev->num_rx_queues behind an accessor in netvsc
    driver otherwise the build breaks with some configs, from Arnd
    Bergmann.

 2) Add dummy xfrm_dev_event() so that build doesn't fail when
    CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu.

 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan
    Carpenter.

 4) Fix MCDI command size for filter operations in sfc driver, from
    Martin Habets.

 5) Fix UFO segmenting so that we don't calculate incorrect checksums,
    from Michal Kubecek.

 6) When ipv6 datagram connects fail, reset destination address and
    port. From Wei Wang.

 7) TCP disconnect must reset the cached receive DST, from WANG Cong.

 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric
    Dumazet.

 9) fman driver has to depend on HAS_DMA, from Madalin Bucur.

10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann.

11) Fix negative page counts with GFO, from Michal Kubecek.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  sfc: fix attempt to translate invalid filter ID
  net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
  bpf: prevent leaking pointer via xadd on unpriviledged
  arcnet: com20020-pci: add missing pdev setup in netdev structure
  arcnet: com20020-pci: fix dev_id calculation
  arcnet: com20020: remove needless base_addr assignment
  Trivial fix to spelling mistake in arc_printk message
  arcnet: change irq handler to lock irqsave
  rocker: move dereference before free
  mlxsw: spectrum_router: Fix NULL pointer dereference
  net: sched: Fix one possible panic when no destroy callback
  virtio-net: serialize tx routine during reset
  net: usb: asix88179_178a: Add support for the Belkin B2B128
  fsl/fman: add dependency on HAS_DMA
  net: prevent sign extension in dev_get_stats()
  tcp: reset sk_rx_dst in tcp_disconnect()
  net: ipv6: reset daddr and dport in sk if connect() fails
  bnx2x: Don't log mc removal needlessly
  bnxt_en: Fix netpoll handling.
  bnxt_en: Add missing logic to handle TPA end error conditions.
  ...
2017-06-29 14:30:07 -07:00
Arvind Yadav
59494fe2c8 PM: hibernate: constify attribute_group structures.
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   6332	    488	    308	   7128	   1bd8	kernel/power/hibernate.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   6396	    424	    308	   7128	   1bd8	kernel/power/hibernate.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-29 23:05:48 +02:00
Daniel Borkmann
6bdf6abc56 bpf: prevent leaking pointer via xadd on unpriviledged
Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:

  0: (b7) r0 = 0
  1: (18) r2 = 0xffff897e82304400
  3: (7b) *(u64 *)(r1 +48) = r2
  R2 leaks addr into ctx

Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:

  0: (b7) r0 = 0
  1: (7b) *(u64 *)(r1 +48) = r0
  2: (18) r2 = 0xffff897e82304400 ; map
  4: (db) lock *(u64 *)(r1 +48) += r2
  5: (95) exit

We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:

   0: (bf) r6 = r1
   1: (7a) *(u64 *)(r10 -8) = 0
   2: (bf) r2 = r10
   3: (07) r2 += -8
   4: (18) r1 = 0x0
   6: (85) call bpf_map_lookup_elem#1
   7: (15) if r0 == 0x0 goto pc+3
   R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
   8: (b7) r3 = 0
   9: (7b) *(u64 *)(r0 +0) = r3
  10: (db) lock *(u64 *)(r0 +0) += r6
  11: (b7) r0 = 0
  12: (95) exit

  from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
  11: (b7) r0 = 0
  12: (95) exit

Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.

Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:44:34 -04:00
Martin KaFai Lau
8007e40a24 bpf: Fix out-of-bound access on interpreters[]
The index is off-by-one when fp->aux->stack_depth
has already been rounded up to 32.  In particular,
if stack_depth is 512, the index will be 16.

The fix is to round_up and then takes -1 instead of round_down.

[   22.318680] ==================================================================
[   22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
[   22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
[   22.321646]
[   22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G        W       4.12.0-rc6-01680-g2ee87db3a287 #22
[   22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[   22.324260] Call Trace:
[   22.324612]  dump_stack+0x67/0x99
[   22.325081]  print_address_description+0x1e8/0x290
[   22.325734]  ? bpf_prog_select_runtime+0x48a/0x670
[   22.326360]  kasan_report+0x265/0x350
[   22.326860]  __asan_report_load8_noabort+0x19/0x20
[   22.327484]  bpf_prog_select_runtime+0x48a/0x670
[   22.328109]  bpf_prog_load+0x626/0xd40
[   22.328637]  ? __bpf_prog_charge+0xc0/0xc0
[   22.329222]  ? check_nnp_nosuid.isra.61+0x100/0x100
[   22.329890]  ? __might_fault+0xf6/0x1b0
[   22.330446]  ? lock_acquire+0x360/0x360
[   22.331013]  SyS_bpf+0x67c/0x24d0
[   22.331491]  ? trace_hardirqs_on+0xd/0x10
[   22.332049]  ? __getnstimeofday64+0xaf/0x1c0
[   22.332635]  ? bpf_prog_get+0x20/0x20
[   22.333135]  ? __audit_syscall_entry+0x300/0x600
[   22.333770]  ? syscall_trace_enter+0x540/0xdd0
[   22.334339]  ? exit_to_usermode_loop+0xe0/0xe0
[   22.334950]  ? do_syscall_64+0x48/0x410
[   22.335446]  ? bpf_prog_get+0x20/0x20
[   22.335954]  do_syscall_64+0x181/0x410
[   22.336454]  entry_SYSCALL64_slow_path+0x25/0x25
[   22.337121] RIP: 0033:0x7f263fe81f19
[   22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
[   22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
[   22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
[   22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
[   22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
[   22.343369]
[   22.343593] The buggy address belongs to the variable:
[   22.344241]  interpreters+0x80/0x980
[   22.344708]
[   22.344908] Memory state around the buggy address:
[   22.345556]  ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
[   22.346449]  ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
[   22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
[   22.348301]                                                        ^
[   22.349142]  ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
[   22.350058]  ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
[   22.350984] ==================================================================

Fixes: b870aa901f ("bpf: use different interpreter depending on required stack size")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:37:04 -04:00
Martin KaFai Lau
14dc6f04f4 bpf: Add syscall lookup support for fd array and htab
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS.

The lookup returns a prog-id or map-id to the userspace.
The userspace can then use the BPF_PROG_GET_FD_BY_ID
or BPF_MAP_GET_FD_BY_ID to get a fd.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 13:13:25 -04:00
Steven Rostedt (VMware)
0f17976568 ftrace: Fix regression with module command in stack_trace_filter
When doing the following command:

 # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter

it triggered a crash.

This happened with the clean up of probes. It required all callers to the
regex function (doing ftrace filtering) to have ops->private be a pointer to
a trace_array. But for the stack tracer, that is not the case.

Allow for the ops->private to be NULL, and change the function command
callbacks to handle the trace_array pointer being NULL as well.

Fixes: d2afd57a4b ("tracing/ftrace: Allow instances to have their own function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-29 10:05:45 -04:00
Luis R. Rodriguez
96b5b19459 module: make the modinfo name const
This can be accomplished by making blacklisted() also accept const.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
[jeyu: fix typo]
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-29 14:19:17 +02:00
Thomas Gleixner
ff801b716e sched/numa: Hide numa_wake_affine() from UP build
Stephen reported the following build warning in UP:

kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
         ^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want

Hide the numa_wake_affine() inline stub on UP builds to get rid of it.

Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-29 08:25:52 +02:00
Sebastian Andrzej Siewior
2287d8664f timers: Make the cpu base lock raw
The timers cpu base lock could not be converted to a raw spinlock becaue
the lock held time was non-deterministic due to cascading and long lasting
timer wheel traversals.

The rework of the timer wheel to the new non-cascading model removed also
the wheel traversals and the lock held times are deterministic now. This
allows to make the lock raw and thereby unbreaks NOHz* on preempt-RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20170627161538.30257-1-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-29 00:27:24 +02:00
Tejun Heo
5136f6365c cgroup: implement "nsdelegate" mount option
Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments.  This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries.  A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.

This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries.  If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.

This allows cgroup namespace to function as a delegation boundary by
itself.

v2: Silently ignore nsdelegate specified on !init mounts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
2017-06-28 14:45:21 -04:00
Tejun Heo
824ecbe01c cgroup: restructure cgroup_procs_write_permission()
Restructure cgroup_procs_write_permission() to make extending
permission logic easier.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-28 14:45:02 -04:00
Steven Rostedt (VMware)
4ec7846785 ftrace: Decrement count for dyn_ftrace_total_info for init functions
Init boot up functions may be traced, but they are also freed when the
kernel finishes booting. These are removed from the ftrace tables, and the
debug variable for dyn_ftrace_total_info needs to reflect that as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 11:57:03 -04:00
Steven Rostedt (VMware)
3b58a3c72f ftrace: Unlock hash mutex on failed allocation in process_mod_list()
If the new_hash fails to allocate, then unlock the hash mutex on error.

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 09:09:38 -04:00
Luis R. Rodriguez
165d1cc007 kmod: reduce atomic operations on kmod_concurrent and simplify
When checking if we want to allow a kmod thread to kick off we increment,
then read to see if we should enable a thread. If we were over the allowed
limit limit we decrement. Splitting the increment far apart from decrement
means there could be a time where two increments happen potentially
giving a false failure on a thread which should have been allowed.

CPU1			CPU2
atomic_inc()
			atomic_inc()
atomic_read()
			atomic_read()
atomic_dec()
			atomic_dec()

In this case a read on CPU1 gets the atomic_inc()'s and we could negate
it from getting a kmod thread. We could try to prevent this with a lock
or preemption but that is overkill. We can fix by reducing the number of
atomic operations. We do this by inverting the logic of of the enabler,
instead of incrementing kmod_concurrent as we get new kmod users, define the
variable kmod_concurrent_max as the max number of currently allowed kmod
users and as we get new kmod users just decrement it if its still positive.
This combines the dec and read in one atomic operation.

In this case we no longer get the same false failure:

CPU1			CPU2
atomic_dec_if_positive()
			atomic_dec_if_positive()
atomic_inc()
			atomic_inc()

The number of threads is computed at init, and since the current computation
of kmod_concurrent includes the thread count we can avoid setting
kmod_concurrent_max later in boot through an init call by simply sticking to
50 as the kmod_concurrent_max. The assumption here is a system with modules
must at least have ~16 MiB of RAM.

Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-27 19:36:31 +02:00
Luis R. Rodriguez
93437353da module: use list_for_each_entry_rcu() on find_module_all()
The module list has been using RCU in a lot of other calls
for a while now, we just overlooked changing this one over to
use RCU.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-27 19:35:52 +02:00
Joel Fernandes
441dae8f2f tracing: Add support for display of tgid in trace output
Earlier patches introduced ability to record the tgid using the 'record-tgid'
option. Here we read the tgid and output it if the option is enabled.

Link: http://lkml.kernel.org/r/20170626053844.5746-3-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
Joel Fernandes
d914ba37d7 tracing: Add support for recording tgid of tasks
Inorder to support recording of tgid, the following changes are made:

* Introduce a new API (tracing_record_taskinfo) to additionally record the tgid
  along with the task's comm at the same time. This has has the benefit of not
  setting trace_cmdline_save before all the information for a task is saved.
* Add a new API tracing_record_taskinfo_sched_switch to record task information
  for 2 tasks at a time (previous and next) and use it from sched_switch probe.
* Preserve the old API (tracing_record_cmdline) and create it as a wrapper
  around the new one so that existing callers aren't affected.
* Reuse the existing sched_switch and sched_wakeup probes to record tgid
  information and add a new option 'record-tgid' to enable recording of tgid

When record-tgid option isn't enabled to being with, we take care to make sure
that there's isn't memory or runtime overhead.

Link: http://lkml.kernel.org/r/20170627020155.5139-1-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
Steven Rostedt (VMware)
83dd14933e ftrace: Decrement count for dyn_ftrace_total_info file
The dyn_ftrace_total_info file is used to show how many functions have been
converted into nops and can be used by ftrace. The problem is that it does
not get decremented when functions are removed (init boot code being freed,
and modules being freed). That means the number is very inaccurate everytime
functions are removed from the ftrace tables. Decrement it when functions
are removed.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:27 -04:00
Steven Rostedt (VMware)
6a9c981b1e ftrace: Remove unused function ftrace_arch_read_dyn_info()
ftrace_arch_read_dyn_info() was used so that archs could add its own debug
information into the dyn_ftrace_total_info in the tracefs file system. That
file is for debugging usage of dynamic ftrace. No arch uses that function
anymore, so just get rid of it.

This also allows for tracing_read_dyn_info() to be cleaned up a bit.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:22 -04:00
BaoJun Luo
eba74c2944 PM / hibernate: Drop redundant parameter of swsusp_alloc()
The first parameter of swsusp_alloc is not used, so drop it.

Signed-off-by: BaoJun Luo <baojun.luo@samsung.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 02:10:44 +02:00
Balbir Singh
49368a47f6 PM / hibernate: Use CONFIG_HAVE_SET_MEMORY for include condition
Kbuild reported a build failure when CONFIG_STRICT_KERNEL_RWX was
enabled on powerpc. We don't yet have ARCH_HAS_SET_MEMORY and ppc32
saw a build failure.

I've only done a basic compile test with a config that has
hibernation enabled.

Fixes: 50327ddfbc (kernel/power/snapshot.c: use set_memory.h header)
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 02:05:28 +02:00
Kees Cook
0b5fa22906 seccomp: Switch from atomic_t to recount_t
This switches the seccomp usage tracking from atomic_t to refcount_t to
gain refcount overflow protections.

Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Hans Liljestrand <hans.liljestrand@aalto.fi>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-26 09:24:00 -07:00
Kees Cook
131b635159 seccomp: Clean up core dump logic
This just cleans up the core dumping logic to avoid the braces around
the RET_KILL case.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-26 09:22:33 -07:00
Steven Rostedt (VMware)
8c08f0d5c6 ftrace: Have cached module filters be an active filter
When a module filter is added to set_ftrace_filter, if the module is not
loaded, it is cached. This should be considered an active filter, and
function tracing should be filtered by this. That is, if a cached module
filter is the only filter set, then no function tracing should be happening,
as all the functions available will be filtered out.

This makes sense, as the reason to add a cached module filter, is to trace
the module when you load it. There shouldn't be any other tracing happening
until then.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:04 -04:00
Steven Rostedt (VMware)
d7fbf8df7c ftrace: Implement cached modules tracing on module load
If a module is cached in the set_ftrace_filter, and that module is loaded,
then enable tracing on that module as if the cached module text was written
into set_ftrace_filter just as the module is loaded.

  # echo ":mod:kvm_intel" >
  # cat /sys/kernel/tracing/set_ftrace_filter
 #### all functions enabled ####
 :mod:kvm_intel
  # modprobe kvm_intel
  # cat /sys/kernel/tracing/set_ftrace_filter
 vmx_get_rflags [kvm_intel]
 vmx_get_pkru [kvm_intel]
 vmx_get_interrupt_shadow [kvm_intel]
 vmx_rdtscp_supported [kvm_intel]
 vmx_invpcid_supported [kvm_intel]
 [..]

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:03 -04:00
Steven Rostedt (VMware)
5985ea8bd5 ftrace: Have the cached module list show in set_ftrace_filter
When writing in a module filter into set_ftrace_filter for a module that is
not yet loaded, it it cached, and will be executed when the module is loaded
(although that is not implemented yet at this commit). Display the list of
cached modules to be traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
Steven Rostedt (VMware)
673feb9d76 ftrace: Add :mod: caching infrastructure to trace_array
This is the start of the infrastructure work to allow for tracing module
functions before it is loaded.

Currently the following command:

  # echo :mod:some-mod > set_ftrace_filter

will enable tracing of all functions within the module "some-mod" if it is
loaded. What we want, is if the module is not loaded, that line will be
saved. When the module is loaded, then the "some-mod" will have that line
executed on it, so that the functions within it starts being traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
Corentin Labbe
1ba5c08b58 kernel/module.c: suppress warning about unused nowarn variable
This patch fix the following warning:
kernel/module.c: In function 'add_usage_links':
kernel/module.c:1653:6: warning: variable 'nowarn' set but not used [-Wunused-but-set-variable]

[jeyu: folded in first patch since it only swapped the function order
so that del_usage_links can be called from add_usage_links]
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-26 17:23:19 +02:00
Jeffy Chen
bf22ff45be genirq: Avoid unnecessary low level irq function calls
Check irq state in enable/disable/unmask/mask_irq to avoid unnecessary
low level irq function calls.

This has two advantages:
    - Conditionals are faster than hardware access

    - Solves issues with the underlying refcounting of the pinctrl
      infrastructure

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-2-git-send-email-jeffy.chen@rock-chips.com
2017-06-26 15:47:00 +02:00
Jeffy Chen
d829b8fb24 genirq: Set irq masked state when initializing irq_desc
The irq default state is set to disabled when allocating irq desc, but the
masked state flag is not set. This is inconsistent vs. the state tracking
logic which is used to prevent unnecessary calls to hardware level irq chip
functions.

Set the masked state flag as well.

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-1-git-send-email-jeffy.chen@rock-chips.com
2017-06-26 14:05:41 +02:00
Deepa Dinamani
63a766a178 posix-stubs: Conditionally include COMPAT_SYS_NI defines
These apis only need to be defined if CONFIG_COMPAT is
enabled.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-25 21:58:46 -04:00
Deepa Dinamani
d5b7ffbfbd time: introduce {get,put}_itimerspec64
As we change the user space type for the timerfd and posix timer
functions to newer data types, we need some form of conversion
helpers to avoid duplicating that logic.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-25 21:58:46 -04:00
Deepa Dinamani
f59dd9c886 time: add get_timespec64 and put_timespec64
Add helper functions to convert between struct timespec64 and
struct timespec at userspace boundaries.

This is a preparatory patch to use timespec64 as the basic type
internally in the kernel as timespec is not y2038 safe on 32 bit systems.
The patch helps the cause by containing all data conversions at the
userspace boundaries within these functions.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-25 21:58:46 -04:00
Linus Torvalds
5f4b37d878 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A few fixes for timekeeping and timers:

   - Plug a subtle race due to a missing READ_ONCE() in the timekeeping
     code where reloading of a pointer results in an inconsistent
     callback argument being supplied to the clocksource->read function.

   - Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the
     time keeping core code, to prevent a possible discontuity.

   - Apply a similar fix to the arm64 vdso clock_gettime()
     implementation

   - Add missing includes to clocksource drivers, which relied on
     indirect includes which fails in certain configs.

   - Use the proper iomem pointer for read/iounmap in a probe function"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW
  time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
  time: Fix clock->read(clock) race around clocksource changes
  clocksource: Explicitly include linux/clocksource.h when needed
  clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
2017-06-25 11:59:19 -07:00
Linus Torvalds
35d8d5d47c Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Three fixlets for perf:

   - Return the proper error code if aux buffers for a event are not
     supported.

   - Calculate the probe offset for inlined functions correctly

   - Update the Skylake DTLB load/store miss event so it can count 1G
     TLB entries as well"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf probe: Fix probe definition for inlined functions
  perf/x86/intel: Add 1G DTLB load/store miss support for SKL
  perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
2017-06-25 11:55:21 -07:00
Daniel Lezcano
e1c9214955 genirq/timings: Add infrastructure for estimating the next interrupt arrival time
An interrupt behaves with a burst of activity with periodic interval of time
followed by one or two peaks of longer interval.

As the time intervals are periodic, statistically speaking they follow a normal
distribution and each interrupts can be tracked individually.

Add a mechanism to compute the statistics on all interrupts, except the
timers which are deterministic from a prediction point of view, as their
expiry time is known.

The goal is to extract the periodicity for each interrupt, with the last
timestamp and sum them, so the next event can be predicted to a certain
extent.

Taking the earliest prediction gives the expected wakeup on the system
(assuming a timer won't expire before).

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1498227072-5980-2-git-send-email-daniel.lezcano@linaro.org
2017-06-24 11:44:39 +02:00
Daniel Lezcano
b2d3d61adb genirq/timings: Add infrastructure to track the interrupt timings
The interrupt framework gives a lot of information about each interrupt. It
does not keep track of when those interrupts occur though, which is a
prerequisite for estimating the next interrupt arrival for power management
purposes.

Add a mechanism to record the timestamp for each interrupt occurrences in a
per-CPU circular buffer to help with the prediction of the next occurrence
using a statistical model.

Each CPU can store up to IRQ_TIMINGS_SIZE events <irq, timestamp>, the
current value of IRQ_TIMINGS_SIZE is 32.

Each event is encoded into a single u64, where the high 48 bits are used
for the timestamp and the low 16 bits are for the irq number.

A static key is introduced so when the irq prediction is switched off at
runtime, the overhead is near to zero.

It results in most of the code in internals.h for inline reasons and a very
few in the new file timings.c. The latter will contain more in the next patch
which will provide the statistical model for the next event prediction.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1498227072-5980-1-git-send-email-daniel.lezcano@linaro.org
2017-06-24 11:44:11 +02:00
Thomas Gleixner
c2ce34c0a0 genirq/debugfs: Remove pointless NULL pointer check
debugfs_remove() has it's own NULL pointer check. Remove the conditional
and make irq_remove_debugfs_entry() an inline helper

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-24 11:43:53 +02:00
Linus Torvalds
f65013d655 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull timer fix from Eric Biederman:
 "This fixes an issue of confusing injected signals with the signals
  from posix timers that has existed since posix timers have been in the
  kernel.

  This patch is slightly simpler than my earlier version of this patch
  as I discovered in testing that I had misspelled "#ifdef
  CONFIG_POSIX_TIMERS". So I deleted that unnecessary test and made
  setting of resched_timer uncondtional.

  I have tested this and verified that without this patch there is a
  nasty hang that is easy to trigger, and with this patch everything
  works properly"

Thomas Gleixner dixit:
 "It fixes the problem at hand and covers the ptrace case as well, which
  I missed.

  Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Only reschedule timers on signals timers have sent
2017-06-24 02:24:53 -07:00
Rik van Riel
815abf5af4 sched/fair: Remove effective_load()
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:53 +02:00
Rik van Riel
3fed382b46 sched/numa: Implement NUMA node level wake_affine()
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.

Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
7d894e6e34 sched/fair: Simplify wake_affine() for the single socket case
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().

Just return true and don't look at all the other things.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
739294fb03 sched/numa: Override part of migrate_degrades_locality() when idle balancing
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.

Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:46 +02:00
Ingo Molnar
1bc3cd4dfa Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:20 +02:00
Yonghong Song
239946314e bpf: possibly avoid extra masking for narrower load in verifier
Commit 31fd85816d ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.

For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.

To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.

Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:04:11 -04:00
Nicolas Pitre
8887cd9903 sched/rt: Move RT related code from sched/core.c to sched/rt.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:45 +02:00
Nicolas Pitre
06a76fe08d sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:45 +02:00
Nicolas Pitre
e1d4eeec5a sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
Make CONFIG_CPUSETS=y depend on SMP as this feature makes no sense
on UP. This allows for configuring out cpuset_cpumask_can_shrink()
and task_can_attach() entirely, which shrinks the kernel a bit.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170614171926.8345-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-23 10:46:44 +02:00
Steven Rostedt (VMware)
feaf1283d1 tracing: Show address when function names are not found
Currently, when a function is not found in kallsyms, instead of simply
showing the function address, it shows nothing at all:

 # echo ':mod:kvm_intel' > /sys/kernel/tracing/set_ftrace_filter
 # echo function > /sys/kernel/tracing/set_ftrace_filter
 # qemu -enable-kvm /home/my-qemu-image
   <Ctrl-C>
 # rmmod kvm_intel
 # cat /sys/kernel/tracing/trace
 qemu-system-x86-2408  [001] d..2   135.013238:  <-kvm_arch_hardware_enable
 qemu-system-x86-2408  [001] ....   135.014574:  <-kvm_arch_vm_ioctl
 qemu-system-x86-2408  [001] ....   135.015420:  <-kvm_vm_ioctl_check_extension
 qemu-system-x86-2408  [001] ....   135.045411:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ...1   135.045413:  <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045413:  <-__do_cpuid_ent

When it should show:

 qemu-system-x86-2408  [001] d..2   135.013238: 0xffffffffa02a39f0 <-kvm_arch_hardware_enable
 qemu-system-x86-2408  [001] ....   135.014574: 0xffffffffa02a2ba0 <-kvm_arch_vm_ioctl
 qemu-system-x86-2408  [001] ....   135.015420: 0xffffffffa029e4e0 <-kvm_vm_ioctl_check_extension
 qemu-system-x86-2408  [001] ....   135.045411: 0xffffffffa02a1380 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e160 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e180 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045412: 0xffffffffa029e520 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ...1   135.045413: 0xffffffffa02a13b0 <-__do_cpuid_ent
 qemu-system-x86-2408  [001] ....   135.045413: 0xffffffffa02a1380 <-__do_cpuid_ent

instead.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-22 17:10:06 -04:00
Marc Zyngier
6a6544e520 genirq/irqdomain: Remove auto-recursive hierarchy support
It did seem like a good idea at the time, but it never really
caught on, and auto-recursive domains remain unused 3 years after
having been introduced.

Oh well, time for a late spring cleanup.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22 18:29:34 +02:00
Marc Zyngier
61d0a000b7 genirq/irqdomain: Add irq_domain_update_bus_token helper
We can have irq domains that are identified by the same fwnode
(because they are serviced by the same HW), and yet have different
functionnality (because they serve different busses, for example).
This is what we use the bus_token field.

Since we don't use this field when generating the domain name,
all the aliasing domains will get the same name, and the debugfs
file creation fails. Also, bus_token is updated by individual drivers,
and the core code is unaware of that update.

In order to sort this mess, let's introduce a helper that takes care
of updating bus_token, and regenerate the debugfs file.

A separate patch will update all the individual users.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22 18:28:45 +02:00
Christoph Hellwig
9a0ef98e18 genirq/affinity: Assign vectors to all present CPUs
Currently the irq vector spread algorithm is restricted to online CPUs,
which ties the IRQ mapping to the currently online devices and doesn't deal
nicely with the fact that CPUs could come and go rapidly due to e.g. power
management.

Instead assign vectors to all present CPUs to avoid this churn.

Build a map of all possible CPUs for a given node, as the architectures
only provide a map of all onlines CPUs. Do this dynamically on each call
for the vector assingments, which is a bit suboptimal and could be
optimized in the future by provinding a mapping from the arch code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-nvme@lists.infradead.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170603140403.27379-5-hch@lst.de
2017-06-22 18:21:26 +02:00
Thomas Gleixner
8f31a9845d genirq/cpuhotplug: Avoid irq affinity setting for single targets
Avoid trying to add a newly online CPU to the effective affinity mask of an
started up interrupt. That interrupt will either stay on the already online
CPU or move around for no value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.431321047@linutronix.de
2017-06-22 18:21:25 +02:00
Thomas Gleixner
d52dd44175 genirq: Introduce IRQD_SINGLE_TARGET flag
Many interrupt chips allow only a single CPU as interrupt target. The core
code has no knowledge about that. That's unfortunate as it could avoid
trying to readd a newly online CPU to the effective affinity mask.

Add the status flag and the necessary accessors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.352343969@linutronix.de
2017-06-22 18:21:25 +02:00
Thomas Gleixner
c5cb83bb33 genirq/cpuhotplug: Handle managed IRQs on CPU hotplug
If a CPU goes offline, interrupts affine to the CPU are moved away. If the
outgoing CPU is the last CPU in the affinity mask the migration code breaks
the affinity and sets it it all online cpus.

This is a problem for affinity managed interrupts as CPU hotplug is often
used for power management purposes. If the affinity is broken, the
interrupt is not longer affine to the CPUs to which it was allocated.

The affinity spreading allows to lay out multi queue devices in a way that
they are assigned to a single CPU or a group of CPUs. If the last CPU goes
offline, then the queue is not longer used, so the interrupt can be
shutdown gracefully and parked until one of the assigned CPUs comes online
again.

Add a graceful shutdown mechanism into the irq affinity breaking code path,
mark the irq as MANAGED_SHUTDOWN and leave the affinity mask unmodified.

In the online path, scan the active interrupts for managed interrupts and
if the interrupt is functional and the newly online CPU is part of the
affinity mask, restart the interrupt if it is marked MANAGED_SHUTDOWN or if
the interrupts is started up, try to add the CPU back to the effective
affinity mask.

Originally-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170619235447.273417334@linutronix.de
2017-06-22 18:21:25 +02:00
Thomas Gleixner
761ea388e8 genirq: Handle managed irqs gracefully in irq_startup()
Affinity managed interrupts should keep their assigned affinity accross CPU
hotplug. To avoid magic hackery in device drivers, the core code shall
manage them transparently and set these interrupts into a managed shutdown
state when the last CPU of the assigned affinity mask goes offline. The
interrupt will be restarted when one of the CPUs in the assigned affinity
mask comes back online.

Add the necessary logic to irq_startup(). If an interrupt is requested and
started up, the code checks whether it is affinity managed and if so, it
checks whether a CPU in the interrupts affinity mask is online. If not, it
puts the interrupt into managed shutdown state. 

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.189851170@linutronix.de
2017-06-22 18:21:24 +02:00
Thomas Gleixner
4cde9c6b82 genirq: Add force argument to irq_startup()
In order to handle managed interrupts gracefully on irq_startup() so they
won't lose their assigned affinity, it's necessary to allow startups which
keep the interrupts in managed shutdown state, if none of the assigend CPUs
is online. This allows drivers to request interrupts w/o the CPUs being
online, which avoid online/offline churn in drivers.

Add a force argument which can override that decision and let only
request_irq() and enable_irq() allow the managed shutdown
handling. enable_irq() is required, because the interrupt might be
requested with IRQF_NOAUTOEN and enable_irq() invokes irq_startup() which
would then wreckage the assignment again. All other callers force startup
and potentially break the assigned affinity.

No functional change as this only adds the function argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.112094565@linutronix.de
2017-06-22 18:21:24 +02:00
Thomas Gleixner
708d174b6c genirq: Split out irq_startup() code
Split out the inner workings of irq_startup() so it can be reused to handle
managed interrupts gracefully.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.033235144@linutronix.de
2017-06-22 18:21:24 +02:00
Thomas Gleixner
54fdf6a087 genirq: Introduce IRQD_MANAGED_SHUTDOWN
Affinity managed interrupts should keep their assigned affinity accross CPU
hotplug. To avoid magic hackery in device drivers, the core code shall
manage them transparently. This will set these interrupts into a managed
shutdown state when the last CPU of the assigned affinity mask goes
offline. The interrupt will be restarted when one of the CPUs in the
assigned affinity mask comes back online.

Introduce the necessary state flag and the accessor functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.954523476@linutronix.de
2017-06-22 18:21:23 +02:00
Thomas Gleixner
415fcf1a22 genirq/cpuhotplug: Use effective affinity mask
If the architecture supports the effective affinity mask, migrating
interrupts away which are not targeted by the effective mask is
pointless.

They can stay in the user or system supplied affinity mask, but won't be
targetted at any given point as the affinity setter functions need to
validate against the online cpu mask anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.328488490@linutronix.de
2017-06-22 18:21:21 +02:00
Thomas Gleixner
0d3f54257d genirq: Introduce effective affinity mask
There is currently no way to evaluate the effective affinity mask of a
given interrupt. Many irq chips allow only a single target CPU or a subset
of CPUs in the affinity mask.

Updating the mask at the time of setting the affinity to the subset would
be counterproductive because information for cpu hotplug about assigned
interrupt affinities gets lost. On CPU hotplug it's also pointless to force
migrate an interrupt, which is not targeted at the CPU effectively. But
currently the information is not available.

Provide a seperate mask to be updated by the irq_chip->irq_set_affinity()
implementations. Implement the read only proc files so the user can see the
effective mask as well w/o trying to deduce it from /proc/interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.247834245@linutronix.de
2017-06-22 18:21:20 +02:00
Thomas Gleixner
c1a8038696 genirq/proc: Replace ever repeating type cast
The proc file setup repeats the same ugly type cast for the irq number over
and over. Do it once and hand in the local void pointer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.160866358@linutronix.de
2017-06-22 18:21:20 +02:00
Thomas Gleixner
4ab764c336 genirq: Remove pointless gfp argument
All callers hand in GPF_KERNEL. No point to have an extra argument for
that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.082544752@linutronix.de
2017-06-22 18:21:19 +02:00
Thomas Gleixner
047dc6331d genirq: Remove pointless arg from show_irq_affinity
The third argument of the internal helper function is unused. Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235446.004958600@linutronix.de
2017-06-22 18:21:19 +02:00
Thomas Gleixner
36d84fb451 genirq: Move irq_fixup_move_pending() to core
Now that x86 uses the generic code, the function declaration and inline
stub can move to the core internal header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.928156166@linutronix.de
2017-06-22 18:21:19 +02:00
Thomas Gleixner
77f85e66aa genirq/cpuhotplug: Set force affinity flag on hotplug migration
Set the force migration flag when migrating interrupts away from an
outgoing CPU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.681874648@linutronix.de
2017-06-22 18:21:18 +02:00
Thomas Gleixner
47a06d3a78 genirq/cpuhotplug: Add support for conditional masking
Interrupts which cannot be migrated in process context, need to be masked
before the affinity is changed forcefully.

Add support for that. Will be compiled out for architectures which do not
have this x86 specific issue.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.604565591@linutronix.de
2017-06-22 18:21:17 +02:00
Thomas Gleixner
f0383c24b4 genirq/cpuhotplug: Add support for cleaning up move in progress
In order to move x86 to the generic hotplug migration code, add support for
cleaning up move in progress bits.

On architectures which have this x86 specific (mis)feature not enabled,
this is optimized out by the compiler.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.525817311@linutronix.de
2017-06-22 18:21:17 +02:00
Thomas Gleixner
91f26cb4cd genirq/cpuhotplug: Do not migrated shutdown irqs
Interrupts, which are shut down are tried to be migrated as well. That's
pointless because the interrupt cannot fire and the next startup will move
it to the proper place anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.447550992@linutronix.de
2017-06-22 18:21:17 +02:00
Thomas Gleixner
e8a7035039 genirq/cpuhotplug: Reorder check logic
Move the checks for a valid irq chip and the irq_set_affinity() callback
right in front of the whole migration logic. No point in doing a gazillion
of other things when the interrupt cannot be migrated at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.354181630@linutronix.de
2017-06-22 18:21:16 +02:00
Thomas Gleixner
735c09524d genirq/cpuhotplug: Dont claim success on error
In case the affinity of an interrupt was broken, a printk is emitted.

But if the affinity cannot be set at all due to a missing
irq_set_affinity() callback or due to a failing callback, the message is
still printed preceeded by a warning/error.

That makes no sense whatsoever.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.274852976@linutronix.de
2017-06-22 18:21:16 +02:00
Thomas Gleixner
0dd945ff46 genirq/cpuhotplug: Remove irq disabling logic
This is called from stop_machine() with interrupts disabled. No point in
disabling them some more.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.198042748@linutronix.de
2017-06-22 18:21:16 +02:00
Christoph Hellwig
137221df69 genirq: Move pending helpers to internal.h
So that the affinity code can reuse them.


Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170619235445.109426284@linutronix.de
2017-06-22 18:21:15 +02:00
Thomas Gleixner
2e051552df genirq: Move initial affinity setup to irq_startup()
The startup vs. setaffinity ordering of interrupts depends on the
IRQF_NOAUTOEN flag. Chained interrupts are not getting any affinity
assignment at all.

A regular interrupt is started up and then the affinity is set. A
IRQF_NOAUTOEN marked interrupt is not started up, but the affinity is set
nevertheless.

Move the affinity setup to startup_irq() so the ordering is always the same
and chained interrupts get the proper default affinity assigned as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235445.020534783@linutronix.de
2017-06-22 18:21:15 +02:00
Thomas Gleixner
43564bd97d genirq: Rename setup_affinity() to irq_setup_affinity()
Rename it with a proper irq_ prefix and make it available for other files
in the core code. Preparatory patch for moving the irq affinity setup
around.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235444.928501004@linutronix.de
2017-06-22 18:21:14 +02:00
Thomas Gleixner
cba4235e60 genirq: Remove mask argument from setup_affinity()
No point to have this alloc/free dance of cpumasks. Provide a static mask
for setup_affinity() and protect it proper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235444.851571573@linutronix.de
2017-06-22 18:21:14 +02:00
Thomas Gleixner
cdd16365b0 genirq: Provide irq_fixup_move_pending()
If an CPU goes offline, the interrupts are migrated away, but a eventually
pending interrupt move, which has not yet been made effective is kept
pending even if the outgoing CPU is the sole target of the pending affinity
mask. What's worse is, that the pending affinity mask is discarded even if
it would contain a valid subset of the online CPUs.

Implement a helper function which allows to avoid these issues.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235444.691345468@linutronix.de
2017-06-22 18:21:13 +02:00
Thomas Gleixner
087cdfb662 genirq/debugfs: Add proper debugfs interface
Debugging (hierarchical) interupt domains is tedious as there is no
information about the hierarchy and no information about states of
interrupts in the various domain levels.

Add a debugfs directory 'irq' and subdirectories 'domains' and 'irqs'.

The domains directory contains the domain files. The content is information
about the domain. If the domain is part of a hierarchy then the parent
domains are printed as well.

# ls /sys/kernel/debug/irq/domains/
default     INTEL-IR-2	    INTEL-IR-MSI-2  IO-APIC-IR-2  PCI-MSI
DMAR-MSI    INTEL-IR-3	    INTEL-IR-MSI-3  IO-APIC-IR-3  unknown-1
INTEL-IR-0  INTEL-IR-MSI-0  IO-APIC-IR-0    IO-APIC-IR-4  VECTOR
INTEL-IR-1  INTEL-IR-MSI-1  IO-APIC-IR-1    PCI-HT

# cat /sys/kernel/debug/irq/domains/VECTOR 
name:   VECTOR
 size:   0
 mapped: 216
 flags:  0x00000041

# cat /sys/kernel/debug/irq/domains/IO-APIC-IR-0 
name:   IO-APIC-IR-0
 size:   24
 mapped: 19
 flags:  0x00000041
 parent: INTEL-IR-3
    name:   INTEL-IR-3
     size:   65536
     mapped: 167
     flags:  0x00000041
     parent: VECTOR
        name:   VECTOR
         size:   0
         mapped: 216
         flags:  0x00000041

Unfortunately there is no per cpu information about the VECTOR domain (yet).

The irqs directory contains detailed information about mapped interrupts.

# cat /sys/kernel/debug/irq/irqs/3
handler:  handle_edge_irq
status:   0x00004000
istate:   0x00000000
ddepth:   1
wdepth:   0
dstate:   0x01018000
            IRQD_IRQ_DISABLED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
node:     0
affinity: 0-143
effectiv: 0
pending:  
domain:  IO-APIC-IR-0
 hwirq:   0x3
 chip:    IR-IO-APIC
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-3
     hwirq:   0x20000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x3
         chip:    APIC
          flags:   0x0

This was developed to simplify the debugging of the managed affinity
changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235444.537566163@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22 18:21:13 +02:00
Thomas Gleixner
9dc6be3d41 genirq/irqdomain: Add map counter
Add a map counter instead of counting radix tree entries for
diagnosis. That also gives correct information for linear domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235444.459397746@linutronix.de
2017-06-22 18:21:12 +02:00
Thomas Gleixner
d59f6617ee genirq: Allow fwnode to carry name information only
In order to provide proper debug interface it's required to have domain
names available when the domain is added. Non fwnode based architectures
like x86 have no way to do so.

It's not possible to use domain ops or host data for this as domain ops
might be the same for several instances, but the names have to be unique.

Extend the irqchip fwnode to allow transporting the domain name. If no node
is supplied, create a 'unknown-N' placeholder.

Warn if an invalid node is supplied and treat it like no node. This happens
e.g. with i2 devices on x86 which hand in an ACPI type node which has no
interface for retrieving the name.

[ Folded a fix from Marc to make DT name parsing work ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235443.588784933@linutronix.de
2017-06-22 18:21:08 +02:00
Thomas Gleixner
0165308a2f genirq/msi: Prevent overwriting domain name
Prevent overwriting an already assigned domain name. Remove the extra check
for chip->name, because if domain->name is NULL overwriting it with NULL is
not a problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235443.510684976@linutronix.de
2017-06-22 18:21:08 +02:00
Rafael J. Wysocki
d07ff6523b Merge branch 'uuid-types'
Merge branch 'uuid-types' from git://git.infradead.org/users/hch/uuid.git
to satisfy dependencies.
2017-06-22 16:28:35 +02:00
Frederic Weisbecker
387bc8b553 sched/fair: Spare idle load balancing on nohz_full CPUs
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.

We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:02 +02:00
Frederic Weisbecker
a0db971e4e nohz: Move idle balancer registration to the idle path
The idle load balancing registration path assumes that we only stop the
tick when the CPU is idle, ignoring the nohz full case. As a result, a
nohz full CPU that is running a task may be chosen to perform idle load
balancing.

Lets make sure that only CPUs in dynticks idle mode can be picked as
idle load balancers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:01 +02:00
Frederic Weisbecker
3c85d6db5e sched/loadavg: Generalize "_idle" naming to "_nohz"
The loadavg naming code still assumes that nohz == idle whereas its code
is actually handling well both nohz idle and nohz full.

So lets fix the naming according to what the code actually does, to
unconfuse the reader.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:01 +02:00
Ingo Molnar
f9e1698831 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 10:19:14 +02:00
David S. Miller
3d09198243 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 17:35:22 -04:00
Linus Torvalds
dcba71086e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "Fix the way how livepatches are being stacked with respect to RCU,
  from Petr Mladek"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Fix stacking of patches with respect to RCU
2017-06-21 12:02:48 -07:00
Bartosz Golaszewski
30fd8fc5c9 irq/generic-chip: Provide devm_irq_setup_generic_chip()
Provide a resource managed variant of irq_setup_generic_chip().

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/1496246820-13250-6-git-send-email-brgl@bgdev.pl
2017-06-21 15:53:11 +02:00
Bartosz Golaszewski
1c3e36309f irq/generic-chip: Provide devm_irq_alloc_generic_chip()
Provide a resource managed variant of irq_alloc_generic_chip().

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/1496246820-13250-5-git-send-email-brgl@bgdev.pl
2017-06-21 15:53:11 +02:00
Bartosz Golaszewski
f160203986 irq/generic-chip: Export irq_init_generic_chip() locally
This function will be used in the devres variant of
irq_alloc_generic_chip().

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/1496246820-13250-4-git-send-email-brgl@bgdev.pl
2017-06-21 15:53:11 +02:00
Hendrik Brueckner
8a1898db51 perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
If the event for which an AUX area is about to be allocated, does
not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP.

This error condition is being returned unfiltered to the user space,
and, for example, the perf tools fails with:

  failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22)

This error can be easily seen with "perf record -m 128,256 -e cpu-clock".

The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP
error code shall be only used within the kernel.  So the correct error
code would then be -EOPNOTSUPP.

With this commit, the perf tool then reports:

  failed to mmap with 95 (Operation not supported)

which is more clear.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Hou <bjhoupu@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com>
Cc: acme@kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1497954399-6355-1-git-send-email-brueckner@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-21 11:58:30 +02:00
Thomas Gleixner
17d9d6875c Merge branch 'fortglx/4.13/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Merge time(keeping) updates from John Stultz:

  "Just a small set of changes, the biggest changes being the MONOTONIC_RAW
   handling cleanup, and a new kselftest from Miroslav. Also a a clear
   warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc
   and ia64."
2017-06-21 09:08:13 +02:00
Thomas Gleixner
f0cd9ae5d0 Merge branch 'timers/urgent' into timers/core
Pick up dependent changes.
2017-06-21 09:07:52 +02:00
John Stultz
369adf04d8 time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago
to allow a transition from the old vsyscall implementations to
the new method (which simplified internal accounting and made
timekeeping more precise).

However, PPC and IA64 have yet to make the transition, despite
in some cases me sending test patches to try to help it along.

http://patches.linaro.org/patch/30501/
http://patches.linaro.org/patch/35412/

If its helpful, my last pass at the patches can be found here:
https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup

So I think its time to set a deadline and make it clear this
is going away. So this patch adds warnings about this
functionality being dropped. Likely to be in v4.15.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-06-20 22:14:10 -07:00
John Stultz
fc6eead7c1 time: Clean up CLOCK_MONOTONIC_RAW time handling
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-06-20 22:13:59 -07:00
Thomas Gleixner
b50fb7c992 Merge branch 'linus' into irq/core
Get upstream changes so pending patches won't conflict.
2017-06-20 22:08:32 +02:00
Thomas Gleixner
098b0e01a9 posix-cpu-timers: Make timespec to nsec conversion safe
The expiry time of a posix cpu timer is supplied through sys_timer_set()
via a struct timespec. The timespec is validated for correctness.

In the actual set timer implementation the timespec is converted to a
scalar nanoseconds value. If the tv_sec part of the time spec is large
enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit.

Mitigate that by using the timespec_to_ktime() conversion function, which
checks the tv_sec part for a potential mult overflow and clamps the result
to KTIME_MAX, which is about 292 years. 

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170620154113.588276707@linutronix.de
2017-06-20 21:33:56 +02:00
Thomas Gleixner
35eb7258c0 itimer: Make timeval to nsec conversion range limited
The expiry time of a itimer is supplied through sys_setitimer() via a
struct timeval. The timeval is validated for correctness.

In the actual set timer implementation the timeval is converted to a
scalar nanoseconds value. If the tv_sec part of the time spec is large
enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit.

Mitigate that by using the timeval_to_ktime() conversion function, which
checks the tv_sec part for a potential mult overflow and clamps the result
to KTIME_MAX, which is about 292 years. 

Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170620154113.505981643@linutronix.de
2017-06-20 21:33:56 +02:00
Peter Meerwald-Stadler
d15bc69aff timers: Fix parameter description of try_to_del_timer_sync()
Signed-off-by: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Link: http://lkml.kernel.org/r/20170530194103.7454-1-pmeerw@pmeerw.net
Cc: John Stultz <john.stultz@linaro.org>
Cc: trivial@rustcorp.com.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-20 21:33:55 +02:00
Davidlohr Bueso
f11cc0760b sched/core: Drop the unused try_get_task_struct() helper function
This function was introduced by:

  150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")

... to allow easier usage of task_rcu_dereference(), however no users
were ever added. Drop the helper.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/20170615023730.22827-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:48:37 +02:00
Ingo Molnar
902b319413 Merge branch 'WIP.sched/core' into sched/core
Conflicts:
	kernel/sched/Makefile

Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:28:21 +02:00
Daniel Axtens
c5ae366e12 sched/fair: WARN() and refuse to set buddy when !se->on_rq
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.

Detect when we would be about to do that, throw a warning and
then refuse to actually set it.

This has been suggested at least twice:

  https://marc.info/?l=linux-kernel&m=146651668921468&w=2
  https://lkml.org/lkml/2016/6/16/663

I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.

Just do it.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:26:52 +02:00
Ingo Molnar
6d3aed3d8a sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
This definition of SCHED_WARN_ON():

 #define SCHED_WARN_ON(x)        ((void)(x))

is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
has no return value, so it cannot be used in conditionals.

Fix it.

Cc: Daniel Axtens <dja@axtens.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:26:52 +02:00
Ingo Molnar
2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
5822a454d6 sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
The key hashed waitqueue data structures and their initialization
was done in the main scheduler file for no good reason, move them
to sched/wait_bit.c instead.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
5dd43ce2f6 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

  include/linux/wait_bit.h  for types and APIs
  kernel/sched/wait_bit.c   for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:09 +02:00
Ingo Molnar
76c85ddc46 sched/wait: Standardize wait_bit_queue naming
So wait-bit-queue head variables are often named:

	struct wait_bit_queue *q

... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!

They are misnomers in two ways:

 - the 'wait_bit_queue' leaves open the question of whether
   it's an entry or a head

 - the 'q' parameter and local variable naming falsely implies
   that it's a 'queue' - while it's an entry.

This resulted in sometimes confusing cases such as:

	finish_wait(wq, &q->wait);

where the 'q' is not a wait-queue head, but a wait-bit-queue entry.

So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:

	struct wait_bit_queue   => struct wait_bit_queue_entry
	q			=> wbq_entry

Which makes it all a much clearer:

	struct wait_bit_queue_entry *wbq_entry

... and turns the former confusing piece of code into:

	finish_wait(wq_head, &wbq_entry->wq_entry;

which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.

I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:29 +02:00
Ingo Molnar
2141713616 sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.

Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:28 +02:00
Ingo Molnar
9d9d676f59 sched/wait: Standardize internal naming of wait-queue heads
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:

	wait_queue_head_t *q
	wait_queue_head_t *wq
	wait_queue_head_t *head

In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.

( Not to mention the confusion of any readers coming over from
  workqueue-land. )

Standardize all this around a single, unambiguous parameter and
variable name:

	struct wait_queue_head *wq_head

which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:

	struct wait_queue_entry *wq_entry

Also rename:

	struct __wait_queue_head => struct wait_queue_head

... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.

Don't touch any external users and preserve the main wait_queue_head_t
typedef.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:28 +02:00
Ingo Molnar
50816c4899 sched/wait: Standardize internal naming of wait-queue entries
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:

	wait_queue_entry_t *wait
	wait_queue_entry_t *__wait	(even in plain C code!)
	wait_queue_entry_t *q		(!)
	wait_queue_entry_t *new		(making anyone who knows C++ cringe)
	wait_queue_entry_t *old

I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.

( Some of the documentation talks about a 'wait descriptor', which is
  the wait-queue entry itself - further adding to the confusion. )

The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.

To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:

	struct wait_queue_entry *wq_entry

The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.

wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.

( We don't touch external users, and we preserve the typedef as well
  for actual wait-queue users, to reduce unnecessary churn. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Levin, Alexander (Sasha Levin)
cde50a6739 locking/rtmutex: Don't initialize lockdep when not required
pi_mutex isn't supposed to be tracked by lockdep, but just
passing NULLs for name and key will cause lockdep to spew a
warning and die, which is not what we want it to do.

Skip lockdep initialization if the caller passed NULLs for
name and key, suggesting such initialization isn't desired.

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f5694788ad ("rt_mutex: Add lockdep annotations")
Link: http://lkml.kernel.org/r/20170618140548.4763-1-alexander.levin@verizon.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 11:53:09 +02:00
Ingo Molnar
2eb0fc9bfe Linux 4.12-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQEbBAABAgAGBQJZR92EAAoJEHm+PkMAQRiGT1oH+KW2FLrRaYxtut+KyGA6l7tc
 R/hFx1n9BibkjXeqD+y6/4SjRTe6/pT8Zkihv3/19eZ5algUWeQa0Hm+/455sl58
 IdIXx/pzuCO3kqR3//fP9ZFD657GNDsuQ0fYnZESItFwiWQtO1TNfZD6KQjkqBdI
 L3MVhDUVBZA2ZtPwC4ERei5/sToV9woykKoJ/A3+OkWjgX6w4SBimqgkSEFk4uE8
 xS0pycyDZci93rJlECi1UueewdODTKSmhwdC80qvGEiYXzsC6UFtaF0Fj66XO1AL
 UMjxqI/gkm5ZuCIjRsmPmJjgL5q1RT6UX/qtw9yn71XTmcgMiPW2/DF/v/OaTg==
 =XbW2
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc6' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 10:48:44 +02:00
Petr Mladek
842c088464 livepatch: Fix stacking of patches with respect to RCU
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
access and manipulation of the list of patches that modify the same function.
In particular, it is the variable func_stack that is accessible from the ftrace
handler via struct ftrace_ops and klp_ops.

Of course, it synchronizes also some states of the patch on the top of the
stack, e.g. func->transition in klp_ftrace_handler.

At the same time, this mechanism guards also the manipulation of
task->patch_state. It is modified according to the state of the transition and
the state of the process.

Now, all this works well as long as RCU works well. Sadly livepatching might
get into some corner cases when this is not true. For example, RCU is not
watching when rcu_read_lock() is taken in idle threads.  It is because they
might sleep and prevent reaching the grace period for too long.

There are ways how to make RCU watching even in idle threads, see
rcu_irq_enter(). But there is a small location inside RCU infrastructure when
even this does not work.

This small problematic location can be detected either before calling
rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
Sadly, there is no safe way how to handle it.  Once we detect that RCU was not
watching, we might see inconsistent state of the function stack and the related
variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
incompatible implementation of the function and break the consistency of the
system. We could warn but we could not avoid the damage.

Fortunately, ftrace has similar problems and they seem to be solved well there.
It uses a heavy weight implementation of some RCU operations. In particular, it
replaces:

  + rcu_read_lock() with preempt_disable_notrace()
  + rcu_read_unlock() with preempt_enable_notrace()
  + synchronize_rcu() with schedule_on_each_cpu(sync_work)

My understanding is that this is RCU implementation from a stone age. It meets
the core RCU requirements but it is rather ineffective. Especially, it does not
allow to batch or speed up the synchronize calls.

On the other hand, it is very trivial. It allows to safely trace and/or
livepatch even the RCU core infrastructure.  And the effectiveness is a not a
big issue because using ftrace or livepatches on productive systems is a rare
operation.  The safety is much more important than a negligible extra load.

Note that the alternative implementation follows the RCU principles. Therefore,
     we could and actually must use list_*_rcu() variants when manipulating the
     func_stack.  These functions allow to access the pointers in the right
     order and with the right barriers. But they do not use any other
     information that would be set only by rcu_read_lock().

Also note that there are actually two problems solved in ftrace:

First, it cares about the consistency of RCU read sections.  It is being solved
the way as described and used in this patch.

Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
when it is being freed. For this, it also calls synchronize_rcu_tasks() in
preemptive kernel in ftrace_shutdown().

Livepatch has similar problem but it is solved by ftrace for free.
klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
registered with FTRACE_OPS_FL_DYNAMIC. It causes that
unregister_ftrace_function() calls:

	* schedule_on_each_cpu(ftrace_sync) - always
	* synchronize_rcu_tasks() - in preemptive kernel

The effect is that nobody is neither inside the dynamic trampoline nor inside
the ftrace handler after unregister_ftrace_function() returns.

[jkosina@suse.cz: reformat changelog, fix comment]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-06-20 10:42:19 +02:00
John Stultz
3d88d56c58 time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.

This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.

Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Daniel Mentz <danielmentz@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "stable #4 . 8+" <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-20 10:41:50 +02:00
John Stultz
ceea5e3771 time: Fix clock->read(clock) race around clocksource changes
In tests, which excercise switching of clocksources, a NULL
pointer dereference can be observed on AMR64 platforms in the
clocksource read() function:

u64 clocksource_mmio_readl_down(struct clocksource *c)
{
	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}

This is called from the core timekeeping code via:

	cycle_now = tkr->read(tkr->clock);

tkr->read is the cached tkr->clock->read() function pointer.
When the clocksource is changed then tkr->clock and tkr->read
are updated sequentially. The code above results in a sequential
load operation of tkr->read and tkr->clock as well.

If the store to tkr->clock hits between the loads of tkr->read
and tkr->clock, then the old read() function is called with the
new clock pointer. As a consequence the read() function
dereferences a different data structure and the resulting 'reg'
pointer can point anywhere including NULL.

This problem was introduced when the timekeeping code was
switched over to use struct tk_read_base. Before that, it was
theoretically possible as well when the compiler decided to
reload clock in the code sequence:

     now = tk->clock->read(tk->clock);

Add a helper function which avoids the issue by reading
tk_read_base->clock once into a local variable clk and then issue
the read function via clk->read(clk). This guarantees that the
read() function always gets the proper clocksource pointer handed
in.

Since there is now no use for the tkr.read pointer, this patch
also removes it, and to address stopping the fast timekeeper
during suspend/resume, it introduces a dummy clocksource to use
rather then just a dummy read function.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: stable <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-20 10:41:50 +02:00
Andreas Schwab
204a2be30a m68k: Remove ptrace_signal_deliver
This fixes debugger syscall restart interactions.  A debugger that
modifies the tracee's program counter is expected to set the orig_d0
pseudo register to -1, to disable a possible syscall restart.

This removes the last user of the ptrace_signal_deliver hook in the ptrace
signal handling, so remove that as well.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2017-06-19 19:41:48 +02:00
Linus Torvalds
4f51d57f3f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Three fixlets for timers:

   - Two hot-fixes for the alarmtimer based posix timers, which prevent
     a nasty DOS by self rescheduling timers. The proper cleanup of that
     mess is queued for 4.13

   - Make a function static"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/broadcast: Make tick_broadcast_setup_oneshot() static
  alarmtimer: Rate limit periodic intervals
  alarmtimer: Prevent overflow of relative timers
2017-06-18 18:46:51 +09:00
Linus Torvalds
0be5255c88 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two small fixes for the schedulre core:

   - Use the proper switch_mm() variant in idle_task_exit() because that
     code is not called with interrupts disabled.

   - Fix a confusing typo in a printk"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
  sched/fair: Fix typo in printk message
2017-06-18 18:45:17 +09:00
Linus Torvalds
2277ba7cfd Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "Add a missing resource release to an error path"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Release resources in __setup_irq() error path
2017-06-18 18:40:41 +09:00
Eric W. Biederman
57db7e4a2d signal: Only reschedule timers on signals timers have sent
Thomas Gleixner  wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
>   The kernel prevents sending of siginfo with positive si_code, because
>   these codes are reserved for kernel.  I think we can allow a task to
>   send such a siginfo to itself.  This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
>    id = timer_create(CLOCK_XXX, ..... signo = SIGX);
>    timer_set(id, ....);
>    info->si_signo = SIGX;
>    info->si_code = SI_TIMER:
>    info->_sifields._timer._tid = id;
>    info->_sifields._timer._sys_private = 2;
>    rt_[tg]sigqueueinfo(..., SIGX, info);
>    sigemptyset(&sigset);
>    sigaddset(&sigset, SIGX);
>    rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
>   info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
>  do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.

This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.

The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.

Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.

Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers.   This makes it clear why the code cares about preallocated
timers.

Cc: stable@vger.kernel.org
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-06-17 12:20:22 -05:00
Paul Moore
cd33f5f2cb audit: make sure we never skip the multicast broadcast
When the auditd connection is reset, either intentionally or due to
a failure, any records that were in the main backlog queue would not
be sent in a multicast broadcast.  This patch fixes this problem by
not flushing the main backlog queue on a connection reset, the main
kauditd_thread() will take care of that normally.

Resolves: https://github.com/linux-audit/audit-kernel/issues/41
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-06-16 11:51:00 -04:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Rafael J. Wysocki
f63e4f7d41 Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-devfreq'
* pm-cpufreq:
  cpufreq: conservative: Allow down_threshold to take values from 1 to 10
  Revert "cpufreq: schedutil: Reduce frequencies slower"

* pm-cpuidle:
  cpuidle: dt: Add missing 'of_node_put()'

* pm-devfreq:
  PM / devfreq: exynos-ppmu: Staticize event list
  PM / devfreq: exynos-ppmu: Handle return value of clk_prepare_enable
  PM / devfreq: exynos-nocp: Handle return value of clk_prepare_enable
2017-06-15 01:51:33 +02:00
Rafael J. Wysocki
33e4f80ee6 ACPI / PM: Ignore spurious SCI wakeups from suspend-to-idle
The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ
during suspend-to-idle transitions and, consequently, any events
signaled through it wake up the system from that state.  However,
on some systems some of the events signaled via the ACPI SCI while
suspended to idle should not cause the system to wake up.  In fact,
quite often they should just be discarded.

Arguably, systems should not resume entirely on such events, but in
order to decide which events really should cause the system to resume
and which are spurious, it is necessary to resume up to the point
when ACPI SCIs are actually handled and processed, which is after
executing dpm_resume_noirq() in the system resume path.

For this reasons, add a loop around freeze_enter() in which the
platforms can process events signaled via multiplexed IRQ lines
like the ACPI SCI and add suspend-to-idle hooks that can be
used for this purpose to struct platform_freeze_ops.

In the ACPI case, the ->wake hook is used for checking if the SCI
has triggered while suspended and deferring the interrupt-induced
system wakeup until the events signaled through it are actually
processed sufficiently to decide whether or not the system should
resume.  In turn, the ->sync hook allows all of the relevant event
queues to be flushed so as to prevent events from being missed due
to race conditions.

In addition to that, some ACPI code processing wakeup events needs
to be modified to use the "hard" version of wakeup triggers, so that
it will cause a system resume to happen on device-induced wakeup
events even if the "soft" mechanism to prevent the system from
suspending is not enabled.  However, to preserve the existing
behavior with respect to suspend-to-RAM, this only is done in
the suspend-to-idle case and only if an SCI has occurred while
suspended.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-15 00:55:44 +02:00
Tejun Heo
b6053d40e3 cgroup: fix lockdep warning in debug controller
The debug controller grabs cgroup_mutex from interface file show
functions which can deadlock and triggers lockdep warnings.  Fix it by
using cgroup_kn_lock_live()/cgroup_kn_unlock() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
2017-06-14 16:01:41 -04:00
Tejun Heo
2866c0b4cf cgroup: refactor cgroup_masks_read() in the debug controller
Factor out cgroup_masks_read_one() out of cgroup_masks_read() for
simplicity.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
2017-06-14 16:01:36 -04:00
Tejun Heo
8cc38fa7fa cgroup: make debug an implicit controller on cgroup2
Make debug an implicit controller on cgroup2 which is enabled by
"cgroup_debug" boot param.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
2017-06-14 16:01:32 -04:00
Waiman Long
575313f40f cgroup: Make debug cgroup support v2 and thread mode
Besides supporting cgroup v2 and thread mode, the following changes
are also made:
 1) current_* cgroup files now resides only at the root as we don't
    need duplicated files of the same function all over the cgroup
    hierarchy.
 2) The cgroup_css_links_read() function is modified to report
    the number of tasks that are skipped because of overflow.
 3) The number of extra unaccounted references are displayed.
 4) The current_css_set_read() function now prints out the addresses of
    the css'es associated with the current css_set.
 5) A new cgroup_subsys_states file is added to display the css objects
    associated with a cgroup.
 6) A new cgroup_masks file is added to display the various controller
    bit masks in the cgroup.

tj: Dropped thread mode related information for now so that debug
    controller changes aren't blocked on the thread mode.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-14 16:01:21 -04:00
Waiman Long
23b0be480f cgroup: Make Kconfig prompt of debug cgroup more accurate
The Kconfig prompt and description of the debug cgroup controller
more accurate by saying that it is for debug purpose only and its
interfaces are unstable.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-14 16:01:21 -04:00
Waiman Long
a28f8f5e99 cgroup: Move debug cgroup to its own file
The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it makes
sense to put the code into its own file as it will no longer be v1
specific. There is no change to the debug cgroup specific code.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-14 16:01:21 -04:00
Waiman Long
73a7242a06 cgroup: Keep accurate count of tasks in each css_set
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
    Refreshed on top of cgroup/for-v4.13 which dropped on
    css_set_populated() -> nr_tasks conversion.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-14 16:01:21 -04:00
Yonghong Song
31fd85816d bpf: permits narrower load from bpf program context fields
Currently, verifier will reject a program if it contains an
narrower load from the bpf context structure. For example,
        __u8 h = __sk_buff->hash, or
        __u16 p = __sk_buff->protocol
        __u32 sample_period = bpf_perf_event_data->sample_period
which are narrower loads of 4-byte or 8-byte field.

This patch solves the issue by:
  . Introduce a new parameter ctx_field_size to carry the
    field size of narrower load from prog type
    specific *__is_valid_access validator back to verifier.
  . The non-zero ctx_field_size for a memory access indicates
    (1). underlying prog type specific convert_ctx_accesses
         supporting non-whole-field access
    (2). the current insn is a narrower or whole field access.
  . In verifier, for such loads where load memory size is
    less than ctx_field_size, verifier transforms it
    to a full field load followed by proper masking.
  . Currently, __sk_buff and bpf_perf_event_data->sample_period
    are supporting narrowing loads.
  . Narrower stores are still not allowed as typical ctx stores
    are just normal stores.

Because of this change, some tests in verifier will fail and
these tests are removed. As a bonus, rename some out of bound
__sk_buff->cb access to proper field name and remove two
redundant "skb cb oob" tests.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 14:56:25 -04:00
Thomas Gleixner
938e7cf2d5 posix-timers: Make nanosleep timespec argument const
No nanosleep implementation modifies the rqtp argument. Mark is const.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-14 00:00:47 +02:00
Thomas Gleixner
343d8fc208 posix-cpu-timers: Avoid timespec conversion in do_cpu_nanosleep()
No point in converting the expiry time back and forth.

No point either to update the value in the caller supplied variable. mark
the rqtp argument const.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-14 00:00:46 +02:00
Al Viro
2b2d02856b time: Move compat_gettimeofday()/settimeofday() to native
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-16-viro@ZenIV.linux.org.uk
2017-06-14 00:00:46 +02:00
Al Viro
b180db2c8c time: Move compat_time()/stime() to native
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-15-viro@ZenIV.linux.org.uk
2017-06-14 00:00:45 +02:00
Al Viro
2482097c6c posix-timers: Move compat_timer_create() to native, get rid of set_fs()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-14-viro@ZenIV.linux.org.uk
2017-06-14 00:00:45 +02:00
Al Viro
d822cdcce4 posix-timers: Move compat versions of clock_gettime/settime/getres
Move them to the native implementations and get rid of the set_fs() hackery.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-13-viro@ZenIV.linux.org.uk
2017-06-14 00:00:44 +02:00
Al Viro
54ad9c46c2 itimers: Move compat itimer syscalls to native ones
get rid of set_fs(), sanitize compat copyin/copyout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-12-viro@ZenIV.linux.org.uk
2017-06-14 00:00:44 +02:00
Al Viro
b0dc12426e posix-timers: Take compat timer_gettime(2) to native one
... and get rid of set_fs() in there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-11-viro@ZenIV.linux.org.uk
2017-06-14 00:00:43 +02:00
Al Viro
1acbe7708b posix-timers: Take compat timer_settime(2) to native one
... and get rid of set_fs() in there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-10-viro@ZenIV.linux.org.uk
2017-06-14 00:00:43 +02:00
Al Viro
3a4d44b616 ntp: Move adjtimex related compat syscalls to native counterparts
Get rid of set_fs() mess and sanitize compat_{get,put}_timex(),
while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-9-viro@ZenIV.linux.org.uk
2017-06-14 00:00:43 +02:00
Al Viro
fb923c4a3c posix-timers: Kill ->nsleep_restart()
No more users.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-8-viro@ZenIV.linux.org.uk
2017-06-14 00:00:42 +02:00
Al Viro
ce41aaf47a hrtimers/posix-timers: Merge nanosleep timespec copyout logics into a new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-7-viro@ZenIV.linux.org.uk
2017-06-14 00:00:42 +02:00
Al Viro
edbeda4632 time/posix-timers: Move the compat copyouts to the nanosleep implementations
Turn restart_block.nanosleep.{rmtp,compat_rmtp} into a tagged union (kind =
1 -> native, kind = 2 -> compat, kind = 0 -> nothing) and make the places
doing actual copyout handle compat as well as native (that will become a
helper in the next commit).  Result: compat wrappers, messing with
reassignments, etc. are gone.

[ tglx: Folded in a variant of Peter Zijlstras enum patch ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-6-viro@ZenIV.linux.org.uk
2017-06-14 00:00:42 +02:00
Al Viro
99e6c0e6ec posix-timers: Store rmtp into restart_block in sys_clock_nanosleep()
... instead of doing that in every ->nsleep() instance

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-5-viro@ZenIV.linux.org.uk
2017-06-14 00:00:41 +02:00
Al Viro
a7602681fc hrtimer: Move copyout of remaining time to do_nanosleep()
The hrtimer nanosleep() implementation can be simplified by moving the copy
out of the remaining time to do_nanosleep() which is shared between the
real nanosleep function and the restart function.

The pointer to the timespec64 which is updated is already stored in the
restart block at the call site, so the seperate handling of nanosleep and
restart function can be avoided.

[ tglx: Added changelog ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-4-viro@ZenIV.linux.org.uk
2017-06-14 00:00:41 +02:00
Al Viro
192a82f900 hrtimer_nanosleep(): Pass rmtp in restart_block
Store the pointer to the timespec which gets updated with the remaining
time in the restart block and remove the function argument.

[ tglx: Added changelog ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-3-viro@ZenIV.linux.org.uk
2017-06-14 00:00:40 +02:00
Al Viro
15f27ce24c alarmtimer: Move copyout and freeze handling into alarmtimer_do_nsleep()
The alarmtimer nanosleep() implementation can be simplified by moving the
copy out of the remaining time to alarmtimer_do_nsleep() which is shared
between the real nanosleep function and the restart function.

The pointer to the timespec64 which is updated has to be stored in the
restart block anyway. Instead of storing it only in the restart case, store
it before calling alarmtimer_do_nsleep() and copy the remaining time in the
signal exit path.

[ tglx: Added changelog ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-2-viro@ZenIV.linux.org.uk
2017-06-14 00:00:40 +02:00
Al Viro
86a9c446c1 posix-cpu-timers: Move copyout of timespec into do_cpu_nanosleep()
The posix-cpu-timer nanosleep() implementation can be simplified by moving
the copy out of the remaining time to do_cpu_nanosleep() which is shared
between the real nanosleep function and the restart function.

The pointer to the timespec64 which is updated has to be stored in the
restart block anyway. Instead of storing it only in the restart case, store
it before calling do_cpu_nanosleep() and copy the remaining time in the
signal exit path.

[ tglx: Added changelog ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170607084241.28657-1-viro@ZenIV.linux.org.uk
2017-06-14 00:00:40 +02:00
Jeremy Linton
681bec0367 tracing: Rename update the enum_map file
The enum_map file is used to display a list of symbol
to name conversions. As its now used to resolve sizeof
lets update the name and description.

Link: http://lkml.kernel.org/r/20170531215653.3240-13-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:13:06 -04:00
Jeremy Linton
67ec0d8595 tracing: Rename enum_replace to eval_replace
The enum_replace stanza works as is for sizeof()
calls as well as enums. Rename it as well.

Link: http://lkml.kernel.org/r/20170531215653.3240-9-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:34 -04:00
Jeremy Linton
f57a41434f trace: rename enum_map functions
Rename the core trace enum routines to use eval, to
reflect their use by more than just enum to value mapping.

Link: http://lkml.kernel.org/r/20170531215653.3240-8-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:23 -04:00
Jeremy Linton
5f60b351a7 trace: rename trace.c enum functions
Rename the init and trace_enum_jmp_to_tail() routines
to reflect their use by more than enumerated types.

Link: http://lkml.kernel.org/r/20170531215653.3240-7-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:12 -04:00
Jeremy Linton
1793ed939b trace: rename trace_enum_mutex to trace_eval_mutex
There is a lock protecting the trace_enum_map, rename
it to reflect the use by more than enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-6-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:10:01 -04:00
Jeremy Linton
23bf8cb8dc trace: rename trace enum data structures in trace.c
The enum map entries can be exported to userspace
via a sys enum_map file. Rename those functions
and structures to reflect the fact that we are using
them for more than enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-5-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:09:50 -04:00
Jeremy Linton
99be647c58 trace: rename struct module entry for trace enums
Each module has a list of enum's its contributing to the
enum map, rename that entry to reflect its use by more than
enums.

Link: http://lkml.kernel.org/r/20170531215653.3240-4-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:09:31 -04:00
Jeremy Linton
00f4b652b6 trace: rename trace_enum_map to trace_eval_map
Each enum is loaded into the trace_enum_map, as we
are now using this for more than enums rename it.

Link: http://lkml.kernel.org/r/20170531215653.3240-3-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:08:57 -04:00
Jeremy Linton
02fd7f68f5 trace: rename kernel enum section to eval
The kernel and its modules have sections containing the enum
string to value conversions. Rename this section because we
intend to store more than enums in it.

Link: http://lkml.kernel.org/r/20170531215653.3240-2-jeremy.linton@arm.com

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 17:08:46 -04:00
Paul Moore
c81be52a3a audit: fix a race condition with the auditd tracking code
Originally reported by Adam and Dusty, it appears we have a small
race window in kauditd_thread(), as documented in the Fedora BZ:

 * https://bugzilla.redhat.com/show_bug.cgi?id=1459326#c35

 "This issue is partly due to the read-copy nature of RCU, and
  partly due to how we sync the auditd_connection state across
  kauditd_thread and the audit control channel.  The kauditd_thread
  thread is always running so it can service the record queues and
  emit the multicast messages, if it happens to be just past the
  "main_queue" label, but before the "if (sk == NULL || ...)"
  if-statement which calls auditd_reset() when the new auditd
  connection is registered it could end up resetting the auditd
  connection, regardless of if it is valid or not.  This is a rather
  small window and the variable nature of multi-core scheduling
  explains why this is proving rather difficult to reproduce."

The fix is to have functions only call auditd_reset() when they
believe that the kernel/auditd connection is still valid, e.g.
non-NULL, and to have these callers pass their local copy of the
auditd_connection pointer to auditd_reset() where it can be compared
with the current connection state before resetting.  If the caller
has a stale state tracking pointer then the reset is ignored.

We also make a small change to kauditd_thread() so that if the
kernel/auditd connection is dead we skip the retry queue and send the
records straight to the hold queue.  This is necessary as we used to
rely on auditd_reset() to occasionally purge the retry queue but we
are going to be calling the reset function much less now and we want
to make sure the retry queue doesn't grow unbounded.

Reported-by: Adam Williamson <awilliam@redhat.com>
Reported-by: Dusty Mabe <dustymabe@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-06-13 15:19:43 -04:00
Nicolas Iooss
f4e981cba2 printk: add __printf attributes to internal functions
When compiling with -Wsuggest-attribute=format, gcc complains that some
functions in kernel/printk/printk_safe.c transmit their argument to
printf-like functions without having a printf attribute. Silence these
warnings by adding relevant __printf attributes.

Link: http://lkml.kernel.org/r/20170524054950.6722-1-nicolas.iooss_linux@m4x.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-06-13 16:36:02 +02:00
Joel Fernandes
6a5ae63a0c tracing: Remove unused declaration of trace_stop_cmdline_recording
trace_stop_cmdline_recording declaration isn't in use, remove it.

Link: http://lkml.kernel.org/r/20170609025327.9508-2-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-13 09:40:52 -04:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Frederic Weisbecker
d4af6d933c nohz: Fix spurious warning when hrtimer and clockevent get out of sync
The sanity check ensuring that the tick expiry cache (ts->next_tick)
is actually in sync with the hardware clock (dev->next_event) makes the
wrong assumption that the clock can't be programmed later than the
hrtimer deadline.

In fact the clock hardware can be programmed later on some conditions
such as:

    * The hrtimer deadline is already in the past.
    * The hrtimer deadline is earlier than the minimum delay supported
      by the hardware.

Such conditions can be met when we program the tick, for example if the
last jiffies update hasn't been seen by the current CPU yet, we may
program the hrtimer to a deadline that is earlier than ktime_get()
because last_jiffies_update is our timestamp base to compute the next
tick.

As a result, we can randomly observe such warning:

	WARNING: CPU: 5 PID: 0 at kernel/time/tick-sched.c:794 tick_nohz_stop_sched_tick kernel/time/tick-sched.c:791 [inline]
	Call Trace:
	 tick_nohz_irq_exit
	 tick_irq_exit
	 irq_exit
	 exiting_irq
	 smp_call_function_interrupt
	 smp_call_function_single_interrupt
	 call_function_single_interrupt

Therefore, let's rather make sure that the tick expiry cache is sync'ed
with the tick hrtimer deadline, against which it is not supposed to
drift away. The clock hardware instead has its own will and can't be
used as a reliable comparison point.

Reported-and-tested-by: Sasha Levin <alexander.levin@verizon.com>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: James Hartsock <hartsjc@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Wright <tim@binbash.co.uk>
Link: http://lkml.kernel.org/r/1497326654-14122-1-git-send-email-fweisbec@gmail.com
[ Minor readability edit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13 08:45:43 +02:00
Ingo Molnar
567b64aaef Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 "The largest feature of this series is shrinking and simplification,
  with the following diffstat summary:

     79 files changed, 1496 insertions(+), 4211 deletions(-)

  In other words, this series represents a net reduction of more than 2700
  lines of code."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13 08:40:03 +02:00
Heiner Kallweit
fa07ab72cb genirq: Release resources in __setup_irq() error path
In case __irq_set_trigger() fails the resources requested via
irq_request_resources() are not released.

Add the missing release call into the error handling path.

Fixes: c1bacbae81 ("genirq: Provide irq_request/release_resources chip callbacks")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com
2017-06-13 00:40:39 +02:00
Derek Robson
e4c1a0d15b audit: style fix
Fixed checkpatch.pl warnings of "function definition argument FOO
should also have an identifier name"

Signed-off-by: Derek Robson <robsonde@gmail.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-06-12 18:07:43 -04:00
Thomas Gleixner
67edab48ca posix-timers: Handle relative posix-timers correctly
The recent rework of the posix timer internals broke the magic posix
mechanism, which requires that relative timers are not affected by
modifications of the underlying clock. That means relative CLOCK_REALTIME
timers cannot use CLOCK_REALTIME, because that can be set and adjusted. The
underlying hrtimer switches the clock for these timers to CLOCK_MONOTONIC.

That still works, but reading the remaining time of such a timer has been
broken in the rework. The old code used the hrtimer internals directly and
avoided the posix clock callbacks. Now common_timer_get() uses the
underlying kclock->timer_get() callback, which is still CLOCK_REALTIME
based. So the remaining time of such a timer is calculated against the
wrong time base.

Handle it by switching the k_itimer->kclock pointer according to the
resulting hrtimer mode. k_itimer->it_clock still contains CLOCK_REALTIME
because the timer might be set with ABSTIME later and then it needs to
switch back to the realtime posix clock implementation.

Fixes: eae1c4ae27 ("posix-timers: Make use of cancel/arm callbacks")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20170609201156.GB21491@outlook.office365.com
2017-06-12 21:07:41 +02:00
Thomas Gleixner
5c7a3a3d20 posix-timers: Zero out oldval itimerspec
The recent posix timer rework moved the clearing of the itimerspec to the
real syscall implementation, but forgot that the kclock->timer_get() is
used by timer_settime() as well. That results in an uninitialized variable
and bogus values returned to user space.

Add the missing memset to timer_settime().

Fixes: eabdec0438 ("posix-timers: Zero settings value in common code")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20170609201156.GB21491@outlook.office365.com
2017-06-12 21:07:40 +02:00
Arnd Bergmann
57de72125d cpu/hotplug: Remove unused check_for_tasks() function
clang -Wunused-function found one remaining function that was
apparently meant to be removed in a recent code cleanup:

kernel/cpu.c:565:20: warning: unused function 'check_for_tasks' [-Wunused-function]

Sebastian explained: The function became unused unintentionally, but there
is already a failure check, when a task cannot be removed from the outgoing
cpu in the scheduler code, so bringing it back is not really giving any
extra value.

Fixes: 530e9b76ae ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: http://lkml.kernel.org/r/20170608085544.2257132-1-arnd@arndb.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-12 19:00:55 +02:00
Stephen Boyd
94114c3675 tick/broadcast: Make tick_broadcast_setup_oneshot() static
This function isn't used outside of tick-broadcast.c, so let's
mark it static.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170608063603.13276-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-12 18:56:01 +02:00
Thomas Gleixner
c6503be587 posix-timers: Fix inverted SIGEV_NONE logic in common_timer_get()
The refactoring of the posix-timer core to allow better code sharing
introduced inverted logic vs. SIGEV_NONE timers in common_timer_get().

That causes hrtimer_forward() to be called on active timers, which
rightfully triggers the warning hrtimer_forward().

Make sig_none what it says: signal mode == SIGEV_NONE.

Fixes: 91d57bae08 ("posix-timers: Make use of forward/remaining callbacks")
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170609104457.GA39907@inn.lkp.intel.com
2017-06-12 17:29:07 +02:00
Jens Axboe
8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Rafael J. Wysocki
ff0a6d6f93 Revert "cpufreq: schedutil: Reduce frequencies slower"
Revert commit 39b64aa1c0 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.

Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-12 14:16:16 +02:00
Greg Kroah-Hartman
069a0f32c9 Merge 4.12-rc5 into char-misc-next
We want the char/misc driver fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-12 08:18:10 +02:00
Andy Lutomirski
252d2a4117 sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11 10:58:17 +02:00
Marcin Nowakowski
f67abed585 sched/fair: Fix typo in printk message
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.

Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11 10:00:33 +02:00
Daniel Borkmann
36e24c0030 bpf: reset id on spilled regs in clear_all_pkt_pointers
Right now, we don't reset the id of spilled registers in case of
clear_all_pkt_pointers(). Given pkt_pointers are highly likely to
contain an id, do so by reusing __mark_reg_unknown_value().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 19:05:45 -04:00
Daniel Borkmann
4a2ff55aa4 bpf: reset id on CONST_IMM transition
Whenever we set the register to the type CONST_IMM, we currently don't
reset the id to 0. id member is not used in CONST_IMM case, so don't
let it become stale, where pruning won't be able to match later on.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 19:05:45 -04:00
Daniel Borkmann
d25da6caa2 bpf: don't check spilled reg state for non-STACK_SPILLed type slots
spilled_regs[] state is only used for stack slots of type STACK_SPILL,
never for STACK_MISC. Right now, in states_equal(), even if we have
old and current stack state of type STACK_MISC, we compare spilled_regs[]
for that particular offset. Just skip these like we do everywhere else.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 19:05:45 -04:00
Daniel Borkmann
20b9d7ac48 bpf: avoid excessive stack usage for perf_sample_data
perf_sample_data consumes 386 bytes on stack, reduce excessive stack
usage and move it to per cpu buffer. It's allowed due to preemption
being disabled for tracing, xdp and tc programs, thus at all times
only one program can run on a specific CPU and programs cannot run
from interrupt. We similarly also handle bpf_pt_regs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 19:05:45 -04:00
David S. Miller
95c4629d92 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc 2017-06-10 14:06:46 -07:00
Linus Torvalds
45b44f0f28 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
 "An error handling corner case fix"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Drop the device lock on error
2017-06-10 10:49:42 -07:00
Linus Torvalds
6b7ed4588c Merge branch 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
 "Fix an SRCU bug affecting KVM IRQ injection"

* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  srcu: Allow use of Classic SRCU from both process and interrupt context
  srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
2017-06-10 10:22:35 -07:00
Linus Torvalds
f701d860af Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This is mostly tooling fixes, plus an instruction pointer filtering
  fix.

  It's more fixes than usual - Arnaldo got back from a longer vacation
  and there was a backlog"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  perf symbols: Kill dso__build_id_is_kmod()
  perf symbols: Keep DSO->symtab_type after decompress
  perf tests: Decompress kernel module before objdump
  perf tools: Consolidate error path in __open_dso()
  perf tools: Decompress kernel module when reading DSO data
  perf annotate: Use dso__decompress_kmodule_path()
  perf tools: Introduce dso__decompress_kmodule_{fd,path}
  perf tools: Fix a memory leak in __open_dso()
  perf annotate: Fix symbolic link of build-id cache
  perf/core: Drop kernel samples even though :u is specified
  perf script python: Remove dups in documentation examples
  perf script python: Updated trace_unhandled() signature
  perf script python: Fix wrong code snippets in documentation
  perf script: Fix documentation errors
  perf script: Fix outdated comment for perf-trace-python
  perf probe: Fix examples section of documentation
  perf report: Ensure the perf DSO mapping matches what libdw sees
  perf report: Include partial stacks unwound with libdw
  perf annotate: Add missing powerpc triplet
  perf test: Disable breakpoint signal tests for powerpc
  ...
2017-06-10 10:15:47 -07:00
Al Viro
1b3c872c83 rt_sigtimedwait(): move compat to native
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:57:12 -04:00
Al Viro
7668b679c3 put_compat_rusage(): switch to copy_to_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:56:13 -04:00
Al Viro
8f13621abc sigpending(): move compat to native
... and kill set_fs() use

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:51:38 -04:00
Al Viro
d9e968cb9f getrlimit()/setrlimit(): move compat to native
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:51:33 -04:00
Al Viro
ca2406ed58 times(2): move compat to native
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:51:28 -04:00
Al Viro
1e1fc13348 compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
unroll the inner loops, while we are at it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 23:51:17 -04:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Roberto Pereira
9e69dd0179 config: android-base: disable CONFIG_NFSD and CONFIG_NFS_FS
Disable Network file system support.

Reviewed-at: https://android-review.googlesource.com/#/c/409559/

Signed-off-by: Roberto Pereira <rpere@google.com>
[AmitP: cherry-picked this change from Android common kernel
        and updated commit message]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Chenbo Feng
2edfe6be20 config: android-base: add CGROUP_BPF
Add CONFIG_CGROUP_BPF as a default configuration in android base config
since it is used to replace XT_QTAGUID in future.

Reviewed-at: https://android-review.googlesource.com/#/c/400374/

Signed-off-by: Chenbo Feng <fengc@google.com>
[AmitP: cherry-picked this change from Android common kernel]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Greg Kroah-Hartman
2096e17063 config: android-base: add CONFIG_MODULES option
This adds CONFIG_MODULES, CONFIG_MODULE_UNLOAD, and CONFIG_MODVERSIONS
which are required by the O release.

Reviewed-at: https://android-review.googlesource.com/#/c/364554/

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[AmitP: cherry-picked this change from Android common kernel]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Greg Kroah-Hartman
5b89db2fa5 config: android-base: add CONFIG_IKCONFIG option
This adds CONFIG_IKCONFIG and CONFIG_IKCONFIG_PROC options, which are a
requirement for the O release.

Reviewed-at: https://android-review.googlesource.com/#/c/364553/

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[AmitP: cherry-picked this change from Android common kernel]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Sami Tolvanen
fb0b153898 config: android-recommended: enable CONFIG_CPU_SW_DOMAIN_PAN
Enable CPU domain PAN to ensure that normal kernel accesses are
unable to access userspace addresses.

Reviewed-at: https://android-review.googlesource.com/#/c/334035/

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
[AmitP: cherry-picked this change from Android common kernel, updated
        the commit message and re-placed the CONFIG_STRICT_KERNEL_RWX
        config in sorted order]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Max Shi
c1ebc2febd config: android-base: disable CONFIG_USELIB and CONFIG_FHANDLE
Turn off the two kernel configs to disable related system ABI.

Reviewed-at: https://android-review.googlesource.com/#/c/264976/

Signed-off-by: Max Shi <meixuanshi@google.com>
[AmitP: cherry-picked this change from Android common kernel]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:38 +02:00
Sami Tolvanen
0c9238c7a1 config: android-recommended: enable CONFIG_ARM64_SW_TTBR0_PAN
Enable PAN emulation using TTBR0_EL1 switching.

Reviewed-at: https://android-review.googlesource.com/#/c/325997/

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
[AmitP: cherry-picked this change from Android common kernel
        and updated the commit message]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:37 +02:00
Jeff Vander Stoep
67e707bd68 config: android-recommended: enable fstack-protector-strong
If compiler has stack protector support, set
CONFIG_CC_STACKPROTECTOR_STRONG.

Reviewed-at: https://android-review.googlesource.com/#/c/238388/

Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
[AmitP: cherry-picked this change from Android common kernel]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:47:37 +02:00
Ingo Molnar
8affb06737 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent
Pull RCU fix from Paul E. McKenney:

" This series enables srcu_read_lock() and srcu_read_unlock() to be used from
  interrupt handlers, which fixes a bug in KVM's use of SRCU in delivery
  of interrupts to guest OSes. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-09 08:17:10 +02:00
Paul E. McKenney
6d48152eaf rcu: Remove RCU CPU stall warnings from Tiny RCU
Tiny RCU's job is to be tiny, so this commit removes its RCU CPU
stall warning code.  After this, there is no longer any need for
rcu_sched_ctrlblk and rcu_bh_ctrlblk to be in tiny_plugin.h, so this
commit also moves them to tiny.c.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:45 -07:00
Paul E. McKenney
c23484f0e7 rcu: Remove event tracing from Tiny RCU
This commit saves a few lines by getting rid of Tiny RCU's event tracing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:45 -07:00
Paul E. McKenney
43a0a2a7d7 rcu: Move RCU debug Kconfig options to kernel/rcu
RCU's debugging Kconfig options are in the unintuitive location
lib/Kconfig.debug, and there are enough of them that it would be good for
them to be more centralized.  This commit therefore extracts RCU's Kconfig
options from init/Kconfig into a new kernel/rcu/Kconfig.debug file.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:44 -07:00
Paul E. McKenney
0af92d4609 rcu: Move RCU non-debug Kconfig options to kernel/rcu
RCU's Kconfig options are scattered, and there are enough of them
that it would be good for them to be more centralized.  This commit
therefore extracts RCU's Kconfig options from init/Kconfig into a new
kernel/rcu/Kconfig file.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:44 -07:00
Paul E. McKenney
44c65ff2e3 rcu: Eliminate NOCBs CPU-state Kconfig options
The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing and
are redundant with the rcu_nocbs= boot parameter.  This commit therefore
removes these three Kconfig options and adjusts the rcutorture scripts
to use the boot parameter instead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:43 -07:00
Paul E. McKenney
ae91aa0adb rcu: Remove debugfs tracing
RCU's debugfs tracing used to be the only reasonable low-level debug
information available, but ftrace and event tracing has since surpassed
the RCU debugfs level of usefulness.  This commit therefore removes
RCU's debugfs tracing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:43 -07:00
Paul E. McKenney
bd8cc5a062 srcu: Remove Classic SRCU
Classic SRCU was only ever intended to be a fallback in case of issues
with Tree/Tiny SRCU, and the latter two are doing quite well in testing.
This commit therefore removes Classic SRCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:42 -07:00
Paul E. McKenney
7f0cd63330 srcu: Fix rcutorture-statistics typo
The function srcutorture_get_gp_data() duplicated the check for
sp->batch_check0.head instead of also checking sp->batch_check1.head.
The only effect of this typo would be for rcutorture statistics to
understate the fraction of time that an SRCU grace period was in flight,
and only for Classic SRCU.  This commit fixes this typo.

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:42 -07:00
Paul E. McKenney
c4a09ff752 rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
The PROVE_RCU_REPEATEDLY Kconfig option was initially added due to
the volume of messages from PROVE_RCU: Doing just one per boot would
have required excessive numbers of boots to locate them all.  However,
PROVE_RCU messages are now relatively rare, so there is no longer any
reason to need more than one such message per boot.  This commit therefore
removes the PROVE_RCU_REPEATEDLY Kconfig option.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
2017-06-08 18:52:41 -07:00
Paul E. McKenney
4e4bea7427 rcu: Remove typecheck() from RCU locking wrapper functions
Because raw_spin_lock_irqsave() and raw_spin_unlock_irqrestore()
both do typecheck() on their flags argument, there is no point in
duplicating this check in raw_spin_lock_irqsave_rcu_node() and
raw_spin_unlock_irqrestore_rcu_node().  This commit therefore saves
a few lines by removing this duplicated check.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:40 -07:00
Paul E. McKenney
fe5ac724d8 rcu: Remove nohz_full full-system-idle state machine
The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013
by commit 0edd1b1784 ("nohz_full: Add full-system-idle state machine"),
but has not been used.  This commit therefore removes it.

If it turns out to be needed later, this commit can always be reverted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-08 18:52:39 -07:00
Paul E. McKenney
f7a10a9750 rcu: Remove the RCU_KTHREAD_PRIO Kconfig option
Anything that can be done with the RCU_KTHREAD_PRIO Kconfig option can
also be done with the rcutree.kthread_prio kernel boot parameter.
This commit therefore removes this Kconfig option.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
2017-06-08 18:52:39 -07:00
Paul E. McKenney
90040c9e30 rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead.  The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.

This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead.  However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed.  TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:38 -07:00
Paul E. McKenney
a3883df393 srcu: Use rnp->lock wrappers to replace explicit memory barriers
This commit uses TREE RCU's rnp->lock wrappers to replace a few explicit
memory barriers.  This change also has the advantage of making SRCU's
memory-ordering properties be implemented in roughly the same way as they
are in Tree RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:38 -07:00
Paul E. McKenney
83d40bd3bc rcu: Move rnp->lock wrappers for SRCU use
This commit moves the now-generic rnp->lock wrapper macros from
kernel/rcu/tree.h to kernel/rcu/rcu.h, thus allowing SRCU to use them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:38 -07:00
Paul E. McKenney
bf32c76540 rcu: Convert rnp->lock wrappers to macros for SRCU use
Use of smp_mb__after_unlock_lock() would allow SRCU to omit a full
memory barrier during callback execution, so this commit converts
raw_spin_lock_rcu_node() from inline functions to type-generic macros
to allow them to handle locks in srcu_node structures as well as
rcu_node structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:37 -07:00
Paul E. McKenney
681fbec881 lockdep: Use consistent printing primitives
Commit a5dd63efda ("lockdep: Use "WARNING" tag on lockdep splats")
substituted pr_warn() for printk() in places called out by Dmitry Vyukov.
However, this resulted in an ugly mix of pr_warn() and printk().  This
commit therefore changes printk() to pr_warn() or pr_cont(), depending
on the absence or presence of KERN_CONT.  This is done in all functions
that had printk() changed to pr_warn() by the aforementioned commit.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:36 -07:00
Paul E. McKenney
2464dd940e srcu: Apply trivial callback lists to shrink Tiny SRCU
The rcu_segcblist structure provides quite a bit of functionality, and
Tiny SRCU needs almost none of it.  So this commit replaces Tiny SRCU's
uses of rcu_segcblist with a simple singly linked list with tail pointer.
This change significantly reduces Tiny SRCU's memory footprint, more
than making up for the growth caused by the creation of rcu_segcblist.c

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:35 -07:00
Paul E. McKenney
5a0465e17a srcu: Shrink srcu.h by moving docbook and private function
The call_srcu() docbook entry is currently in include/linux/srcu.h,
which causes needless processing for each include point.  This commit
therefore moves this entry to kernel/rcu/srcutree.c, which the compiler
reads only once.  In addition, the srcu_batches_completed() function is
used only within RCU and its torture-test suites.  This commit therefore
also moves this function's declaration from include/linux/srcutiny.h,
include/linux/srcutree.h, and include/linux/srcuclassic.h to
kernel/rcu/rcu.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:35 -07:00
Paul E. McKenney
c350c00829 srcu: Prevent sdp->srcu_gp_seq_needed counter wrap
If a given CPU never happens to ever start an SRCU grace period, the
grace-period sequence counter might wrap.  If this CPU were to decide to
finally start a grace period, the state of its sdp->srcu_gp_seq_needed
might make it appear that it has already requested this grace period,
which would prevent starting the grace period.  If no other CPU ever started
a grace period again, this would look like a grace-period hang.  Even
if some other CPU took pity and started the needed grace period, the
leaf rcu_node structure's ->srcu_data_have_cbs field won't have record
of the fact that this CPU has a callback pending, which would look like
a very localized grace-period hang.

This might seem very unlikely, but SRCU grace periods can take less than
a microsecond on small systems, which means that overflow can happen
in much less than an hour on a 32-bit embedded system.  And embedded
systems are especially likely to have long-term idle CPUs.  Therefore,
it makes sense to prevent this scenario from happening.

This commit therefore scans each srcu_data structure occasionally,
with frequency controlled by the srcutree.counter_wrap_check kernel
boot parameter.  This parameter can be set to something like 255
in order to exercise the counter-wrap-prevention code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:34 -07:00
Paul E. McKenney
fe21a27e8c rcu: Move rcu_request_urgent_qs_task() out of rcutiny.h and rcutree.h
The rcu_request_urgent_qs_task() function is used only within RCU,
so there is no point in exporting it to the rest of the kernel from
nclude/linux/rcutiny.h and include/linux/rcutree.h.  This commit therefore
moves this function to kernel/rcu/rcu.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:33 -07:00
Paul E. McKenney
e3c8d51e1a rcu: Move torture-related functions out of rcutiny.h and rcutree.h
The various functions similar to rcu_batches_started(), the
function show_rcu_gp_kthreads(), the various functions similar to
rcu_force_quiescent_state(), and the variables rcutorture_testseq and
rcutorture_vernum are used only within RCU.  There is therefore no point
in exporting them to the kernel at large from include/linux/rcutiny.h
and include/linux/rcutree.h.  This commit therefore moves all of these
to kernel/rcu/rcu.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:33 -07:00
Paul E. McKenney
b8989b7605 rcu: Move rcu_ftrace_dump() from rcupdate.h to rcu.h
The rcu_ftrace_dump() function is used only internally to RCU.  This
commit therefore moves its declaration from include/linux/rcupdate.h
to kernel/rcu/rcu.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:32 -07:00
Paul E. McKenney
3d54f7983f rcu: Move rcu_is_nocb_cpu() from rcupdate.h to rcu.h
The rcu_is_nocb_cpu() function is used only internally to RCU.  This
commit therefore moves its declaration from include/linux/rcupdate.h
to kernel/rcu/rcu.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:31 -07:00
Paul E. McKenney
fa3c664769 rcu: Improve __call_rcu() debug-objects error message
The "__call_rcu(): Leaked duplicate callback" error message from
__call_rcu() has proven to be unhelpful.  This commit therefore changes
it to "__call_rcu(): Double-freed CB" and adds the value of the pointer
passed in.  The value of the pointer improves debuggability by allowing
correlation with tracing output, for example, the rcu:rcu_callback trace
event.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:31 -07:00
Paul E. McKenney
82118249d0 rcu: Move the RCU_SCHEDULER_ definitions from rcupdate.h
The RCU_SCHEDULER_INACTIVE, RCU_SCHEDULER_INIT, and RCU_SCHEDULER_RUNNING
definitions are used only within RCU, so this commit moves them from
include/linux/rcupdate.h to kernel/rcu/rcu.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:30 -07:00
Paul E. McKenney
791875d16e rcu: Eliminate the unused __rcu_is_watching() function
The __rcu_is_watching() function is currently not used, aside from
to implement the rcu_is_watching() function.  This commit therefore
eliminates __rcu_is_watching(), which has the beneficial side-effect
of shrinking include/linux/rcupdate.h a bit.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:30 -07:00
Paul E. McKenney
cad7b38972 rcu: Move torture-related definitions from rcupdate.h to rcu.h
The include/linux/rcupdate.h file contains a number of definitions that
are used only to communicate between rcutorture, rcuperf, and the RCU code
itself.  There is no point in having these definitions exposed globally
throughout the kernel, so this commit moves them to kernel/rcu/rcu.h.
This change has the added benefit of shrinking rcupdate.h.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:28 -07:00
Paul E. McKenney
25c36329a3 rcu: Move expediting-related access/control out of rcupdate.h
The rcu_gp_is_normal(), rcu_gp_is_expedited(), rcu_expedite_gp(), and
rcu_unexpedite_gp() functions are intended only for use within the
RCU implementation itself -- the sysfs access is what should be used
outside of RCU.  This commit therefore moves the declarations for
these functions to kernel/rcu/rcu.h, and also includes this file into
kernel/rcu/rcutorture.c and kernel/rcu/rcuperf.c.  This also has the
beneficial effect of shrinking rcupdate.c a bit.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:28 -07:00
Paul E. McKenney
3caec62fbb rcu: Move rcu_expedited and rcu_normal externs from rcupdate.h
The rcu_expedited and rcu_normal variables are used only by sysctl
and kernel/rcu/update.c, so it does not make sense to their extern
declarations in rcupdate.h.  This commit therefore moves these
extern declarations to update.c.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:27 -07:00
Paul E. McKenney
a68a2bb28b rcu: Move docbook comments out of rcupdate.h
The include/linux/rcupdate.h file is included by more than 200
files, so shrinking it should provide some build-time benefits.
This commit therefore moves several docbook comments from rcupdate.h to
kernel/rcu/update.c, kernel/rcu/tree.c, and kernel/rcu/tree_plugin.h, thus
reducing the number of times that the compiler has to scan these comments.
This likely provides only a small benefit, but every little bit helps.

This commit also fixes a malformed bulleted list noted by the 0day
Test Robot.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:27 -07:00
Paul E. McKenney
6b5fc3a133 rcu: Add memory barriers for NOCB leader wakeup
Wait/wakeup operations do not guarantee ordering on their own.  Instead,
either locking or memory barriers are required.  This commit therefore
adds memory barriers to wake_nocb_leader() and nocb_leader_wait().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: <stable@vger.kernel.org> # 4.6.x
2017-06-08 18:51:59 -07:00
Linus Torvalds
0d22df90c7 Power management fixes for v4.12-rc5
- Revert a recent commit that attempted to avoid spurious wakeups
    from suspend-to-idle via ACPI SCI, but introduced regressions on
    some systems (Rafael Wysocki).
 
    We will get back to the problem it tried to address in the next
    cycle.
 
  - Fix a possible division by 0 during intel_pstate initialization
    due to a missing check (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZOeX+AAoJEILEb/54YlRxYksQAIczTx5toDqcmMbyXUbS/KBp
 /kQK5iwGo9WV5V4ZsWugmwrTY9R4xwBdeBow/nacO5NGVtjYix95wTlGlw84RLIi
 2QO4SzN0ZKqh8DUxvjro/exIe+YPqX4Bodvo9CX4hr60xjutcZRVz5Dq6ZS80KPw
 o0/fWUlQzT1BM9IorfJ0YT2XEdMpdVbPWrwTpm8l2G42vWXSQwsd6fFnOLpTrd46
 580JAH5Dux6YNMsSejTazQon/3P0sChYxbJkpm6nvv819EMbFDy8p+ebIgceBlos
 l6Zlckd7ETwDwL3G3OGi5/Zcpb/YMg5Slm3+IGM/J5ccVIfdG8gjqTJklrxH7I4S
 /+0MzwlXUbRYEvurB6nsP3kIofrZN+t+c609ewmIFLy2QIDJF9BiVhKnRrjNfsuU
 KrY0zFATtxGy/0CfkZCmSWZBid06tAQ0ZZ1dYkO/1Qf5dn1ge+Yr8tcc0WKqJqbq
 NR+BfsCVa84b6s+uBsMdKR6kAg0tz7uG5njXlH7bupH3ZCtttbbWp0znmd/ow/jU
 usJyEHStvzhjC4T9s0tRzEi96B2/MlsGYmL+qq9GBScdhYKc6K/4xzdUkp+yGQwD
 311sheKvQZ06kwj0v+hK1aBOH2y3pjcBMvKjCdr/IiMtX8/kD0tsKx1+0QESR91D
 H80L6EjjEAUm1KEsQcpy
 =oWdX
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These revert one problematic commit related to system sleep and fix
  one recent intel_pstate regression.

  Specifics:

   - Revert a recent commit that attempted to avoid spurious wakeups
     from suspend-to-idle via ACPI SCI, but introduced regressions on
     some systems (Rafael Wysocki).

     We will get back to the problem it tried to address in the next
     cycle.

   - Fix a possible division by 0 during intel_pstate initialization
     due to a missing check (Rafael Wysocki)"

* tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
  cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
2017-06-08 17:40:32 -07:00
Rafael J. Wysocki
fbd78afe34 Merge branches 'intel_pstate' and 'pm-sleep'
* intel_pstate:
  cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()

* pm-sleep:
  Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
2017-06-09 01:25:16 +02:00
Linus Torvalds
dc0cf5a77d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk fix from Petr Mladek:
 "This reverts a fix added into 4.12-rc1. It caused the kernel log to be
  printed on another console when two consoles of the same type were
  defined, e.g. console=ttyS0 console=ttyS1.

  This configuration was never supported by kernel itself, but it
  started to make sense with systemd. In other words, the commit broke
  userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  Revert "printk: fix double printing with earlycon"
2017-06-08 10:50:04 -07:00
Paul E. McKenney
511324e462 rcu: Use RCU_NOCB_WAKE rather than RCU_NOGP_WAKE
The RCU_NOGP_WAKE_NOT, RCU_NOGP_WAKE, and RCU_NOGP_WAKE_FORCE flags
are used to mediate wakeups for the no-CBs CPU kthreads.  The "NOGP"
really doesn't make any sense, so this commit does s/NOGP/NOCB/.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:40 -07:00
Paul E. McKenney
d7d34d5e46 sched: Rely on synchronize_rcu_mult() de-duplication
The synchronize_rcu_mult() function now detects duplicate requests
for the same grace-period flavor and waits only once for each flavor.
This commit therefore removes the ugly #ifdef from sched_cpu_deactivate()
because synchronize_rcu_mult(call_rcu, call_rcu_sched) now does what
the #ifdef used to be needed for.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2017-06-08 08:25:39 -07:00
Paul E. McKenney
68ab0b4263 rcu: Make synchronize_rcu_mult() check for duplicates
Currently, doing synchronize_rcu_mult(call_rcu, call_rcu) might
(or might not) wait for two RCU grace periods.  One approach is
of course "don't do that!", but in CONFIG_PREEMPT=n kernels,
synchronize_rcu_mult(call_rcu, call_rcu_sched) does exactly that.
This results in an ugly #ifdef in sched_cpu_deactivate().

This commit therefore makes __wait_rcu_gp() check for duplicates,
which in turn allows duplicates to be passed to synchronize_rcu_mult()
without risk of waiting twice on the same type of grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:39 -07:00
Paul E. McKenney
a602538e46 srcu: Add DEBUG_OBJECTS_RCU_HEAD functionality
This commit adds DEBUG_OBJECTS_RCU_HEAD checking to detect call_srcu()
counterparts to double-free bugs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:39 -07:00
Paul E. McKenney
d4efe6c5ad srcu: Shrink Tiny SRCU a bit
In Tiny SRCU, __srcu_read_lock() is a trivial function, outweighed by
its EXPORT_SYMBOL_GPL(), and on many architectures, its call sequence.
This commit therefore moves it to srcutiny.h so that it can be inlined.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:38 -07:00
Paul E. McKenney
ea9b0c8a26 rcu: Add lockdep_assert_held() teeth to tree_plugin.h
Comments can be helpful, but assertions carry more force.  This commit
therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN() calls to
enforce lock-held and interrupt-disabled preconditions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:37 -07:00
Paul E. McKenney
c0b334c5bf rcu: Add lockdep_assert_held() teeth to tree.c
Comments can be helpful, but assertions carry more force.  This
commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN()
calls to enforce lock-held and interrupt-disabled preconditions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:37 -07:00
Paul E. McKenney
0c8e0e3c37 srcu: Print non-default exp_holdoff values at boot time
This commit makes srcu_bootup_announce() check for non-default values
of the auto-expedite holdoff time exp_holdoff and print a message if so.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:36 -07:00
Paul E. McKenney
b5815e6cd3 srcu: Make exp_holdoff module parameter be static
Because exp_holdoff is not used outside of srcutree.c, it can be static.
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:36 -07:00
Paul E. McKenney
17c7798bea rcu: Update rcu_bootup_announce_oddness()
This commit updates rcu_bootup_announce_oddness() to check additional
Kconfig options and module/boot parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:35 -07:00
Paul E. McKenney
59d80fd835 rcu: Print out rcupdate.c non-default boot-time settings
This commit adds a rcupdate_announce_bootup_oddness() function to
print out non-default values of significant kernel boot parameter
settings to aid in debugging.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:35 -07:00
Paul E. McKenney
f4687d2637 rcu: Add preemptibility checks in rcu_sched_qs() and rcu_bh_qs()
This commit adds WARN_ON_ONCE() calls that trigger if either
rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled.
In the immortal words of Peter Zijlstra: "these are much harder to ignore
than comments".

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:34 -07:00
Paul E. McKenney
820687a7b9 rcuperf: Add writer_holdoff boot parameter
This commit adds a writer_holdoff boot parameter to rcuperf, which is
intended to be used to test Tree SRCU's auto-expediting.  This
boot parameter is in microseconds, and defaults to zero (that is,
disabled).  Set it to a bit larger than srcutree.exp_holdoff,
keeping the nanosecond/microsecond conversion, to force Tree SRCU
to auto-expedite more aggressively.

This commit also adds documentation for this parameter, and fixes some
alphabetization while in the neighborhood.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:32 -07:00
Paul E. McKenney
492b95e597 rcuperf: Set more user-friendly defaults
Common-case use of rcuperf must set rcuperf.nreaders=0 and if not built
as a module, rcuperf.shutdown.  This commit therefore sets the default
for rcuperf.nreaders to zero and sets the default for rcuperf.shutdown
to zero if rcuperf is built as a module and to one otherwise.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:31 -07:00
Paul E. McKenney
3ddf20c953 srcu: Shrink Tiny SRCU a bit more
This commit rearranges Tiny SRCU's srcu_struct structure, substitutes
u8 for bool, and shrinks counters down to short.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:31 -07:00
Paul E. McKenney
1f4f6da1c8 srcu: Make Classic and Tree SRCU announce themselves at bootup
Currently, the only way to tell whether a given kernel is running
Classic, Tiny, or Tree SRCU is to look at the .config file, which
can easily be lost or associated with the wrong kernel.  This commit
therefore has Classic and Tree SRCU identify themselves at boot time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:30 -07:00
Paul E. McKenney
f60cb4d4c8 rcuperf: Add test for dynamically initialized srcu_struct
This commit adds a perf_type of "srcud", which species that rcuperf
test SRCU on a dynamically initialized srcu_struct.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:28 -07:00
Paul E. McKenney
dcfc315b7b rcu: Make sync_rcu_preempt_exp_done() return bool
The sync_rcu_preempt_exp_done() function returns a logical expression,
but its return type is nevertheless int.  This commit therefore changes
the return type to bool.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:27 -07:00
Paul E. McKenney
881ed593a3 rcuperf: Add ability to performance-test call_rcu() and friends
This commit upgrades rcuperf so that it can do performance testing on
asynchronous grace-period primitives such as call_srcu().  There is
a new rcuperf.gp_async module parameter that specifies this new behavior,
with the pre-existing rcuperf.gp_exp testing expedited grace periods such as
synchronize_rcu_expedited, and with the default being to test synchronous
non-expedited grace periods such as synchronize_rcu().

There is also a new rcuperf.gp_async_max module parameter that specifies
the maximum number of outstanding callbacks per writer kthread, defaulting
to 1,000.  When this limit is exceeded, the writer thread invokes the
appropriate flavor of rcu_barrier() to wait for callbacks to drain.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Removed the redundant initialization noted by Arnd Bergmann. ]
2017-06-08 08:25:26 -07:00
Paul E. McKenney
e28371c891 rcu: Remove obsolete reference to synchronize_kernel()
The synchronize_kernel() primitive was removed in favor of
synchronize_sched() more than a decade ago, and it seems likely that
rather few kernel hackers are familiar with it.  Its continued presence
is therefore providing more confusion than enlightenment.  This commit
therefore removes the reference from the synchronize_sched() header
comment, and adds the corresponding information to the synchronize_rcu(0
header comment.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:25 -07:00
Paul E. McKenney
9683937df9 rcuperf: Defer expedited/normal check to end of test
Current rcuperf startup checks to see if the user asked to measure
only expedited grace periods, yet constrained all grace periods to be
normal, or if the user asked to measure only normal grace periods, yet
constrained all grace periods to be expedited.  Useless tests of this
sort are aborted.

Unfortunately, making RCU work through the mid-boot dead zone [1] puts
RCU into expedited-only mode during that zone.  Which happens to also
be the exact time that rcuperf carries out the aforementioned check.
So if the user asks rcuperf to measure only normal grace periods (the
default), rcuperf will now always complain and terminate the test.

This commit therefore moves the checks to rcu_perf_cleanup().  This has
the disadvantage of failing to abort useless tests, but avoids the need to
create yet another kthread and the need to do fiddly checks involving the
holdoff time.  (Yes, another approach is to do the checks in a late-stage
init function, but that would require some way to communicate badness
to rcuperf's kthreads, and seems not worth the bother.)

[1] https://lwn.net/Articles/716148/

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:24 -07:00
Paul E. McKenney
5b72f9643b rcu: Complain if blocking in preemptible RCU read-side critical section
Although preemptible RCU allows its read-side critical sections to be
preempted, general blocking is forbidden.  The reason for this is that
excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
voluntarily blocked task doesn't care how high you boost its priority.
Because preemptible RCU is a global mechanism, one ill-behaved reader
hurts everyone.  Hence the prohibition against general blocking in
RCU-preempt read-side critical sections.  Preemption yes, blocking no.

This commit enforces this prohibition.

There is a special exception for the -rt patchset (which they kindly
volunteered to implement):  It is OK to block (as opposed to merely being
preempted) within an RCU-preempt read-side critical section, but only if
the blocking is subject to priority inheritance.  This exception permits
CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.

Why doesn't this exception also apply to mainline's rt_mutex?  Because
of the possibility that someone does general blocking while holding
an rt_mutex.  Yes, the priority boosting will affect the rt_mutex,
but it won't help with the task doing general blocking while holding
that rt_mutex.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:24 -07:00
Paul E. McKenney
881ec9d209 srcu: Eliminate possibility of destructive counter overflow
Earlier versions of Tree SRCU were subject to a counter overflow bug that
could theoretically result in too-short grace periods.  This commit
eliminates this problem by adding an update-side memory barrier.
The short explanation is that if the updater sums the unlock counts
too late to see a given __srcu_read_unlock() increment, that CPU's
next __srcu_read_lock() must see the new value of ->srcu_idx, thus
incrementing the other bank of counters.  This eliminates the possibility
of destructive counter overflow as long as the srcu_read_lock() nesting
level does not exceed floor(ULONG_MAX/NR_CPUS/2), which should be an
eminently reasonable nesting limit, especially on 64-bit systems.

Reported-by: Lance Roy <ldr709@gmail.com>
Suggested-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:23 -07:00
Paul E. McKenney
f92c734f02 rcu: Prevent rcu_barrier() from starting needless grace periods
Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
on each CPU with a non-empty callback list.  This works, but means
that rcu_barrier() forces grace periods that are not otherwise needed.
The key point is that rcu_barrier() never needs to wait for a grace
period, but instead only for all pre-existing callbacks to be invoked.
This means that rcu_barrier()'s new callbacks should be placed in
the callback-list segment containing the last pre-existing callback.

This commit makes this change using the new rcu_segcblist_entrain()
function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:22 -07:00
Paolo Bonzini
1123a60416 srcu: Allow use of Classic SRCU from both process and interrupt context
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Cc: stable@vger.kernel.org
Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:19 -07:00
Paolo Bonzini
cdf7abc461 srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device.  This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq().  If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods.  It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).

However, the docs are overly conservative.  You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts.  In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU.  For those two implementations, only srcu_read_lock()
is unsafe.

When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller.  Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.

Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler.  Therefore, the only change required for Tiny SRCU
is to its comments.

Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 08:24:26 -07:00
Petr Mladek
dac8bbbae1 Revert "printk: fix double printing with earlycon"
This reverts commit cf39bf58af.

The commit regression to users that define both console=ttyS1
and console=ttyS0 on the command line, see
https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain

The kernel log messages always appeared only on one serial port. It is
even documented in Documentation/admin-guide/serial-console.rst:

"Note that you can only define one console per device type (serial,
video)."

The above mentioned commit changed the order in which the command line
parameters are searched. As a result, the kernel log messages go to
the last mentioned ttyS* instead of the first one.

We long thought that using two console=ttyS* on the command line
did not make sense. But then we realized that console= parameters
were handled also by systemd, see
http://0pointer.de/blog/projects/serial-console.html

"By default systemd will instantiate one serial-getty@.service on
the main kernel console, if it is not a virtual terminal."

where

"[4] If multiple kernel consoles are used simultaneously, the main
console is the one listed first in /sys/class/tty/console/active,
which is the last one listed on the kernel command line."

This puts the original report into another light. The system is running
in qemu. The first serial port is used to store the messages into a file.
The second one is used to login to the system via a socket. It depends
on systemd and the historic kernel behavior.

By other words, systemd causes that it makes sense to define both
console=ttyS1 console=ttyS0 on the command line. The kernel fix
caused regression related to userspace (systemd) and need to be
reverted.

In addition, it went out that the fix helped only partially.
The messages still were duplicated when the boot console was
removed early by late_initcall(printk_late_init). Then the entire
log was replayed when the same console was registered as a normal one.

Link: 20170606160339.GC7604@pathway.suse.cz
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
Cc: linux-serial@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-06-08 12:06:00 +02:00
Peter Zijlstra
f5694788ad rt_mutex: Add lockdep annotations
Now that (PI) futexes have their own private RT-mutex interface and
implementation we can easily add lockdep annotations to the existing
RT-mutex interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:35:49 +02:00
Aubrey Li
ebfa4c02fa sched/idle: Add deferrable vmstat_updater back
Deferrable vmstat_updater was missing in commit:

  c1de45ca83 ("sched/idle: Add support for tasks that inject idle")

Add it back.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:09 +02:00
Nicolas Pitre
f5832c1998 sched/core: Omit building stop_sched_class when !SMP
The stop class is invoked through stop_machine only.
This is dead code on UP builds.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:04 +02:00
Daniel Bristot de Oliveira
3effcb4247 sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.

One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.

However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:

  runtime   = 5 ms
  deadline  = 7 ms
  [density] = 5 / 7 = 0.71
  period    = 1000 ms

If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:

  remaining runtime = 4
  laxity = 5

presenting a absolute density of 4 / 5 = 0.80.

In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.

The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.

Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.

A revised version of the idea is:

At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:

  runtime / (deadline - t) <= dl_runtime / dl_deadline

Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:

  runtime = (dl_runtime / dl_deadline) * (deadline - t)

For instance, in our previous example, the task could still run:

  runtime = ( 5 / 7 ) * 5
  runtime = 3.57 ms

Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:

  df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline")

Which throttles a constrained deadline task activated after the
deadline.

Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.

Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:32:03 +02:00
Xunlei Pang
ae83b56a56 sched/deadline: Zero out positive runtime after throttling constrained tasks
When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:58 +02:00
Luca Abeni
daec579836 sched/deadline: Reclaim bandwidth not used by dl tasks
This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
The GRUB accounting rule is modified to add this "extra utilization"
to the inactive utilization of the runqueue, and to avoid reclaiming
more than a maximum fraction of the CPU time.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:55 +02:00
Luca Abeni
9f0d1a5077 sched/deadline: Base GRUB reclaiming on the inactive utilization
Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
utilization and Uinact is the "inactive utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves a fairness issue with "traditional" global GRUB
reclaiming: using the traditional GRUB algorithm, if tasks are
allocated to the various cores in a non-uniform way, the
reclaiming mechanism allows some tasks to reclaim more time than
others. This issue is visible starting 11 time-consuming tasks with
runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
system: some tasks will receive much more than the reserved runtime
(thanks to the reclaiming mechanism), while other tasks will receive
less than the reserved runtime.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:54 +02:00
Luca Abeni
8fd27231c3 sched/deadline: Track the "total rq utilization" too
The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:53 +02:00
Luca Abeni
2d4283e9d5 sched/deadline: Make GRUB a task's flag
This patch introduces the SCHED_FLAG_RECLAIM flag to specify
that a DL task is allowed to reclaim unused CPU time (using
the GRUB algorithm).

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:52 +02:00
Luca Abeni
4da3abcefe sched/deadline: Do not reclaim the whole CPU bandwidth
Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time, stored in the new "bw_ratio"
field of the dl runqueue structure.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:51 +02:00
Luca Abeni
c52f14d384 sched/deadline: Implement GRUB accounting
According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:51 +02:00
Luca Abeni
387e31300b sched/deadline: Fix the update of the total -deadline utilization
Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:50 +02:00
Luca Abeni
209a0cbda7 sched/deadline: Improve the tracking of active utilization
This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Tested-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:31:49 +02:00
Luca Abeni
e36d8677bf sched/deadline: Track the active utilization
Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution is presented in the
next patches.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:27:56 +02:00
Peter Zijlstra
1ad3aaf3fc sched/core: Implement new approach to scale select_idle_cpu()
Hackbench recently suffered a bunch of pain, first by commit:

  4c77b18cf8 ("sched/fair: Make select_idle_cpu() more aggressive")

and then by commit:

  c743f0a5c5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")

which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.

This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:25:17 +02:00
Matthias Kaehlcke
d0fabd1cb8 perf/core: Remove unused perf_cgroup_event_cgrp_time() function
The function was added by commit e5d1367f17 ("perf: Add cgroup
support") in 2011 and hasn't been used since then. Removing it fixes the
following warning when building with Clang:

    kernel/events/core.c:696:19: error: unused function 'perf_cgroup_event_cgrp_time' [-Werror,-Wunused-function]

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170523215132.189049-1-mka@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:12:39 +02:00
Peter Zijlstra
ba5213ae6b perf/core: Correct event creation with PERF_FORMAT_GROUP
Andi was asking about PERF_FORMAT_GROUP vs inherited events, which led
to the discovery of a bug from commit:

  3dab77fb1b ("perf: Rework/fix the whole read vs group stuff")

 -       PERF_SAMPLE_GROUP                       = 1U << 4,
 +       PERF_SAMPLE_READ                        = 1U << 4,

 -       if (attr->inherit && (attr->sample_type & PERF_SAMPLE_GROUP))
 +       if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))

is a clear fail :/

While this changes user visible behaviour; it was previously possible
to create an inherited event with PERF_SAMPLE_READ; this is deemed
acceptible because its results were always incorrect.

Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Fixes:  3dab77fb1b ("perf: Rework/fix the whole read vs group stuff")
Link: http://lkml.kernel.org/r/20170530094512.dy2nljns2uq7qa3j@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:12:38 +02:00
Ingo Molnar
a5506c46a4 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:12:12 +02:00
Jin Yao
cc1582c231 perf/core: Drop kernel samples even though :u is specified
When doing sampling, for example:

  perf record -e cycles:u ...

On workloads that do a lot of kernel entry/exits we see kernel
samples, even though :u is specified. This is due to skid existing.

This might be a security issue because it can leak kernel addresses even
though kernel sampling support is disabled.

The patch drops the kernel samples if exclude_kernel is specified.

For example, test on Haswell desktop:

  perf record -e cycles:u <mgen>
  perf report --stdio

Before patch applied:

    99.77%  mgen     mgen              [.] buf_read
     0.20%  mgen     mgen              [.] rand_buf_init
     0.01%  mgen     [kernel.vmlinux]  [k] apic_timer_interrupt
     0.00%  mgen     mgen              [.] last_free_elem
     0.00%  mgen     libc-2.23.so      [.] __random_r
     0.00%  mgen     libc-2.23.so      [.] _int_malloc
     0.00%  mgen     mgen              [.] rand_array_init
     0.00%  mgen     [kernel.vmlinux]  [k] page_fault
     0.00%  mgen     libc-2.23.so      [.] __random
     0.00%  mgen     libc-2.23.so      [.] __strcasestr
     0.00%  mgen     ld-2.23.so        [.] strcmp
     0.00%  mgen     ld-2.23.so        [.] _dl_start
     0.00%  mgen     libc-2.23.so      [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so        [.] _start

We can see kernel symbols apic_timer_interrupt and page_fault.

After patch applied:

    99.79%  mgen     mgen           [.] buf_read
     0.19%  mgen     mgen           [.] rand_buf_init
     0.00%  mgen     libc-2.23.so   [.] __random_r
     0.00%  mgen     mgen           [.] rand_array_init
     0.00%  mgen     mgen           [.] last_free_elem
     0.00%  mgen     libc-2.23.so   [.] vfprintf
     0.00%  mgen     libc-2.23.so   [.] rand
     0.00%  mgen     libc-2.23.so   [.] __random
     0.00%  mgen     libc-2.23.so   [.] _int_malloc
     0.00%  mgen     libc-2.23.so   [.] _IO_doallocbuf
     0.00%  mgen     ld-2.23.so     [.] do_lookup_x
     0.00%  mgen     ld-2.23.so     [.] open_verify.constprop.7
     0.00%  mgen     ld-2.23.so     [.] _dl_important_hwcaps
     0.00%  mgen     libc-2.23.so   [.] sched_setaffinity@@GLIBC_2.3.4
     0.00%  mgen     ld-2.23.so     [.] _start

There are only userspace symbols.

Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Cc: kan.liang@intel.com
Cc: mark.rutland@arm.com
Cc: will.deacon@arm.com
Cc: yao.jin@intel.com
Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:11:50 +02:00
Masami Hiramatsu
c3ca46ef71 ftrace/kprobes: selftests: Check kretprobe maxactive is supported
Check the kretprobe maxactive is supported by kprobe_events
interface. To ensure the kernel feature, this changes ftrace
README to describe it.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-06-07 10:07:22 -06:00
David S. Miller
216fe8f021 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just some simple overlapping changes in marvell PHY driver
and the DSA core code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 22:20:08 -04:00
Rafael J. Wysocki
f3b7eaae1b Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
Revert commit eed4d47efe (ACPI / sleep: Ignore spurious SCI wakeups
from suspend-to-idle) as it turned out to be premature and triggered
a number of different issues on various systems.

That includes, but is not limited to, premature suspend-to-RAM aborts
on Dell XPS 13 (9343) reported by Dominik.

The issue the commit in question attempted to address is real and
will need to be taken care of going forward, but evidently more work
is needed for this purpose.

Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-07 00:57:37 +02:00
Daniel Borkmann
92046578ac bpf: cgroup skb progs cannot access ld_abs/ind
Commit fb9a307d11 ("bpf: Allow CGROUP_SKB eBPF program to
access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB
type to use ld_abs/ind instructions. However, at this point,
we cannot use them, since offsets relative to SKF_LL_OFF will
end up pointing skb_mac_header(skb) out of bounds since in the
egress path it is not yet set at that point in time, but only
after __dev_queue_xmit() did a general reset on the mac header.
bpf_internal_load_pointer_neg_helper() will then end up reading
data from a wrong offset.

BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes()
already to access packet data, which is also more flexible than
the insns carried over from cBPF.

Fixes: fb9a307d11 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 16:39:48 -04:00
Martin KaFai Lau
1e27097690 bpf: Add BPF_OBJ_GET_INFO_BY_FD
A single BPF_OBJ_GET_INFO_BY_FD cmd is used to obtain the info
for both bpf_prog and bpf_map.  The kernel can figure out the
fd is associated with a bpf_prog or bpf_map.

The suggested struct bpf_prog_info and struct bpf_map_info are
not meant to be a complete list and it is not the goal of this patch.
New fields can be added in the future patch.

The focus of this patch is to create the interface,
BPF_OBJ_GET_INFO_BY_FD cmd for exposing the bpf_prog's and
bpf_map's info.

The obj's info, which will be extended (and get bigger) over time, is
separated from the bpf_attr to avoid bloating the bpf_attr.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:24 -04:00
Martin KaFai Lau
bd5f5f4ecb bpf: Add BPF_MAP_GET_FD_BY_ID
Add BPF_MAP_GET_FD_BY_ID command to allow user to get a fd
from a bpf_map's ID.

bpf_map_inc_not_zero() is added and is called with map_idr_lock
held.

__bpf_map_put() is also added which has the 'bool do_idr_lock'
param to decide if the map_idr_lock should be acquired when
freeing the map->id.

In the error path of bpf_map_inc_not_zero(), it may have to
call __bpf_map_put(map, false) which does not need
to take the map_idr_lock when freeing the map->id.

It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:23 -04:00
Martin KaFai Lau
b16d9aa4c2 bpf: Add BPF_PROG_GET_FD_BY_ID
Add BPF_PROG_GET_FD_BY_ID command to allow user to get a fd
from a bpf_prog's ID.

bpf_prog_inc_not_zero() is added and is called with prog_idr_lock
held.

__bpf_prog_put() is also added which has the 'bool do_idr_lock'
param to decide if the prog_idr_lock should be acquired when
freeing the prog->id.

In the error path of bpf_prog_inc_not_zero(), it may have to
call __bpf_prog_put(map, false) which does not need
to take the prog_idr_lock when freeing the prog->id.

It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:23 -04:00
Martin KaFai Lau
34ad5580f8 bpf: Add BPF_(PROG|MAP)_GET_NEXT_ID command
This patch adds BPF_PROG_GET_NEXT_ID and BPF_MAP_GET_NEXT_ID
to allow userspace to iterate all bpf_prog IDs and bpf_map IDs.

The API is trying to be consistent with the existing
BPF_MAP_GET_NEXT_KEY.

It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:23 -04:00
Martin KaFai Lau
f3f1c054c2 bpf: Introduce bpf_map ID
This patch generates an unique ID for each created bpf_map.
The approach is similar to the earlier patch for bpf_prog ID.

It is worth to note that the bpf_map's ID and bpf_prog's ID
are in two independent ID spaces and both have the same valid range:
[1, INT_MAX).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:22 -04:00
Martin KaFai Lau
dc4bb0e235 bpf: Introduce bpf_prog ID
This patch generates an unique ID for each BPF_PROG_LOAD-ed prog.
It is worth to note that each BPF_PROG_LOAD-ed prog will have
a different ID even they have the same bpf instructions.

The ID is generated by the existing idr_alloc_cyclic().
The ID is ranged from [1, INT_MAX).  It is allocated in cyclic manner,
so an ID will get reused every 2 billion BPF_PROG_LOAD.

The bpf_prog_alloc_id() is done after bpf_prog_select_runtime()
because the jit process may have allocated a new prog.  Hence,
we need to ensure the value of pointer 'prog' will not be changed
any more before storing the prog to the prog_idr.

After bpf_prog_select_runtime(), the prog is read-only.  Hence,
the id is stored in 'struct bpf_prog_aux'.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:41:22 -04:00
Linus Torvalds
ba7b2387ad Merge branch 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Two cgroup fixes. One to address RCU delay of cpuset removal affecting
  userland visible behaviors. The other fixes a race condition between
  controller disable and cgroup removal"

* 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: consider dying css as offline
  cgroup: Prevent kill_css() from being called more than once
2017-06-05 15:37:03 -07:00
Christoph Hellwig
680895d6ef sysctl: switch to use uuid_t
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:15 +02:00
Frederic Weisbecker
f99973e18b nohz: Fix buggy tick delay on IRQ storms
When the tick is stopped and we reach the dynticks evaluation code on
IRQ exit, we perform a soft tick restart if we observe an expired timer
from there. It means we program the nearest possible tick but we stay in
dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
again after that expired timer is handled.

Now this solution works most of the time but if we suffer an IRQ storm
and those interrupts trigger faster than the hardware clockevents min
delay, our tick won't fire until that IRQ storm is finished.

Here is the problem: on IRQ exit we reprog the timer to at least
NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
reschedule again to NOW() + min_clockevents_delay, etc... The tick
is eternally rescheduled min_clockevents_delay ahead.

A solution is to simply remove this soft tick restart. After all
the normal dynticks evaluation path can handle 0 delay just fine. And
by doing that we benefit from the optimization branch which avoids
clock reprogramming if the clockevents deadline hasn't changed since
the last reprog. This fixes our issue because we don't do repetitive
clock reprog that always add hardware min delay.

As a side effect it should even optimize the 0 delay path in general.

Reported-and-tested-by: Octavian Purdila <octavian.purdila@nxp.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496328429-13317-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-05 09:33:50 +02:00
Alexei Starovoitov
f91840a32d perf, bpf: Add BPF support to all perf_event types
Allow BPF_PROG_TYPE_PERF_EVENT program types to attach to all
perf_event types, including HW_CACHE, RAW, and dynamic pmu events.
Only tracepoint/kprobe events are treated differently which require
BPF_PROG_TYPE_TRACEPOINT/BPF_PROG_TYPE_KPROBE program types accordingly.

Also add support for reading all event counters using
bpf_perf_event_read() helper.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 21:58:01 -04:00
Thomas Gleixner
f2c45807d3 alarmtimer: Switch over to generic set/get/rearm routine
All required callbacks are in place. Switch the alarm timer based posix
interval timer callbacks to the common implementation and remove the
incorrect private implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.825471962@linutronix.de
2017-06-04 15:40:32 +02:00
Thomas Gleixner
b3bf6f369d alarmtimer: Implement arm callback
Preparatory change to utilize the common posix timer mechanisms.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.747567162@linutronix.de
2017-06-04 15:40:31 +02:00
Thomas Gleixner
e344c9e76b alarmtimer: Implement try_to_cancel callback
Preparatory change to utilize the common posix timer mechanisms.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.670026824@linutronix.de
2017-06-04 15:40:31 +02:00
Thomas Gleixner
d653d8457c alarmtimer: Implement remaining callback
Preparatory change to utilize the common posix timer mechanisms.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.592676753@linutronix.de
2017-06-04 15:40:31 +02:00
Thomas Gleixner
e7561f1633 alarmtimer: Implement forward callback
Preparatory change to utilize the common posix timer mechanisms.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.513694229@linutronix.de
2017-06-04 15:40:30 +02:00
Thomas Gleixner
b3db80f77a alarmtimer: Implement timer_rearm() callback
Preparatory change to utilize the common posix timer mechanisms.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.434598989@linutronix.de
2017-06-04 15:40:30 +02:00
Thomas Gleixner
eae1c4ae27 posix-timers: Make use of cancel/arm callbacks
Replace the hrtimer calls by calls to the new try_to_cancel()/arm() kclock
callbacks and move the hrtimer specific implementation into the
corresponding callback functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.355396667@linutronix.de
2017-06-04 15:40:29 +02:00
Thomas Gleixner
525b8ed916 posix-timers: Add cancel/arm callbacks
Add timer_try_to_cancel() and timer_arm() callbacks to kclock which allow
to make common_timer_set() usable by both hrtimer and alarmtimer based
clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.278022962@linutronix.de
2017-06-04 15:40:28 +02:00
Thomas Gleixner
eabdec0438 posix-timers: Zero settings value in common code
Zero out the settings struct in the common code so the callbacks do not
have to do it themself.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.200870713@linutronix.de
2017-06-04 15:40:28 +02:00
Thomas Gleixner
91d57bae08 posix-timers: Make use of forward/remaining callbacks
Replace the hrtimer calls by calls to the new forward/remaining kclock
callbacks and move the hrtimer specific implementation into the
corresponding callback functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.121437232@linutronix.de
2017-06-04 15:40:27 +02:00
Thomas Gleixner
63841b2a69 posix-timers: Add forward/remaining callbacks
Add two callbacks to kclock which allow using common_)timer_get() for both
hrtimer and alarm timer based clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211657.044915536@linutronix.de
2017-06-04 15:40:27 +02:00
Thomas Gleixner
21e55c1f83 posix-timers: Add active flag to k_itimer
Keep track of the activation state of posix timers. This is a preparatory
change for making common_timer_get() usable by both hrtimer and alarm timer
implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.967783982@linutronix.de
2017-06-04 15:40:26 +02:00
Thomas Gleixner
f37fb0aa4f posix-timers: Use timer_rearm() callback in posixtimer_rearm()
Use the new timer_rearm() callback to replace the conditional hardcoded
calls into the hrtimer and cpu timer code.

This allows later to bring the same logic to alarmtimers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.889661919@linutronix.de
2017-06-04 15:40:26 +02:00
Thomas Gleixner
96fe3b072f posix-timers: Rename do_schedule_next_timer
That function is a misnomer. Rename it with a proper prefix to
posixtimer_rearm().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.811362578@linutronix.de
2017-06-04 15:40:25 +02:00
Thomas Gleixner
3080294589 posix-timers: Add timer_rearm() callback
Add a timer_rearm() callback which is used to make the rescheduling of
posix interval timers independent of the underlying clock implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.732632167@linutronix.de
2017-06-04 15:40:25 +02:00
Thomas Gleixner
d97bb75ddd posix-timers: Store k_clock pointer in k_itimer
Having the k_clock pointer in the k_itimer struct avoids the lookup in
several code pathes and makes the next steps of unification of the hrtimer
and alarmtimer based posix timers simpler.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.641222072@linutronix.de
2017-06-04 15:40:25 +02:00
Thomas Gleixner
80105cd0e6 posix-timers: Move interval out of the union
Preparatory patch to unify the alarm timer and hrtimer based posix interval
timer handling.

The interval is used as a criteria for rearming decisions so moving it out
of the clock specific data structures allows later unification.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.563922908@linutronix.de
2017-06-04 15:40:24 +02:00
Thomas Gleixner
af888d677a posix-timers: Unify overrun/requeue_pending handling
hrtimer based posix-timers and posix-cpu-timers handle the update of the
rearming and overflow related status fields differently.

Move that update to the common rearming code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.484936964@linutronix.de
2017-06-04 15:40:24 +02:00
Thomas Gleixner
bab0aae9dc posix-timers: Move posix-timer internals to core
None of these declarations is required outside of kernel/time. Move them to
an internal header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170530211656.394803853@linutronix.de
2017-06-04 15:40:23 +02:00
Thomas Gleixner
6631fa12c1 posix-timers: Avoid gazillions of forward declarations
Move it below the actual implementations as there are new callbacks coming
which would require even more forward declarations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.238209952@linutronix.de
2017-06-04 15:40:23 +02:00
Thomas Gleixner
3a06c7ac24 posix-clocks: Remove interval timer facility and mmap/fasync callbacks
The only user of this facility is ptp_clock, which does not implement any of
those functions.

Remove them to prevent accidental users. Especially the interval timer
interfaces are now more or less impossible to implement because the
necessary infrastructure has been confined to the core code. Aside of that
it's really complex to make these callbacks implemented according to spec
as the alarm timer implementation demonstrates. If at all then a nanosleep
callback might be a reasonable extension. For now keep just what ptp_clock
needs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.145036286@linutronix.de
2017-06-04 15:40:22 +02:00
Thomas Gleixner
a81129e5a1 posix-timers: Remove unused export of posix_timer_event()
Since the removal of the mmtimer driver the export is not longer needed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.052744418@linutronix.de
2017-06-04 15:40:22 +02:00
Thomas Gleixner
18c700c4e3 alarmtimer: Remove pointless config conditional
Having a IF_ENABLED(CONFIG_POSIX_TIMERS) inside of a
#ifdef CONFIG_POSIX_TIMERS section is pointless.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211655.975218056@linutronix.de
2017-06-04 15:40:22 +02:00
Thomas Gleixner
978267b643 Merge branch 'timers/urgent' into WIP.timers
Pick up urgent fixes to avoid conflicts.
2017-06-04 15:21:52 +02:00
Thomas Gleixner
ff86bf0c65 alarmtimer: Rate limit periodic intervals
The alarmtimer code has another source of potentially rearming itself too
fast. Interval timers with a very samll interval have a similar CPU hog
effect as the previously fixed overflow issue.

The reason is that alarmtimers do not implement the normal protection
against this kind of problem which the other posix timer use:

  timer expires -> queue signal -> deliver signal -> rearm timer

This scheme brings the rearming under scheduler control and prevents
permanently firing timers which hog the CPU.

Bringing this scheme to the alarm timer code is a major overhaul because it
lacks all the necessary mechanisms completely.

So for a quick fix limit the interval to one jiffie. This is not
problematic in practice as alarmtimers are usually backed by an RTC for
suspend which have 1 second resolution. It could be therefor argued that
the resolution of this clock should be set to 1 second in general, but
that's outside the scope of this fix.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
2017-06-04 15:21:18 +02:00
Thomas Gleixner
f4781e76f9 alarmtimer: Prevent overflow of relative timers
Andrey reported a alartimer related RCU stall while fuzzing the kernel with
syzkaller.

The reason for this is an overflow in ktime_add() which brings the
resulting time into negative space and causes immediate expiry of the
timer. The following rearm with a small interval does not bring the timer
back into positive space due to the same issue.

This results in a permanent firing alarmtimer which hogs the CPU.

Use ktime_add_safe() instead which detects the overflow and clamps the
result to KTIME_SEC_MAX.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
2017-06-04 15:21:18 +02:00
Christoph Hellwig
31ea70e030 posix-timers: Move the do_schedule_next_timer declaration
Having it in asm-generic/siginfo.h doesn't make any sense as it is in no way
architecture specific.  Move it to posix-timers.h instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-ia64@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: sparclinux@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170603190102.28866-4-hch@lst.de
2017-06-04 15:11:46 +02:00
Thomas Gleixner
04c848d398 genirq: Warn when IRQ_NOAUTOEN is used with shared interrupts
Shared interrupts do not go well with disabling auto enable:

1) The sharing interrupt might request it while it's still disabled and
   then wait for interrupts forever.

2) The interrupt might have been requested by the driver sharing the line
   before IRQ_NOAUTOEN has been set. So the driver which expects that
   disabled state after calling request_irq() will not get what it wants.
   Even worse, when it calls enable_irq() later, it will trigger the
   unbalanced enable_irq() warning.

Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: dianders@chromium.org
Cc: jeffy <jeffy.chen@rock-chips.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: tfiga@chromium.org
Link: http://lkml.kernel.org/r/20170531100212.210682135@linutronix.de
2017-06-04 14:38:41 +02:00
Thomas Gleixner
201d7f47f3 genirq: Handle NOAUTOEN interrupt setup proper
If an interrupt is marked NOAUTOEN then request_irq() installs the action,
but does not enable the interrupt via startup_irq().  The interrupt is
enabled via enable_irq() later from the driver. enable_irq() calls
irq_enable().

That means that for interrupts which have a irq_startup() callback this
callback is never invoked. Neither is irq_domain_activate_irq() invoked for
such interrupts.

If an interrupt depends on irq_startup() or irq_domain_activate_irq() then
the enable via irq_enable() is not enough.

Add a status flag IRQD_IRQ_STARTED_UP and use this to select the proper
mechanism in enable_irq(). Use the flag also to avoid pointless calls into
the low level functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: dianders@chromium.org
Cc: jeffy <jeffy.chen@rock-chips.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: tfiga@chromium.org
Link: http://lkml.kernel.org/r/20170531100212.130986205@linutronix.de
2017-06-04 14:35:13 +02:00
Sebastian Andrzej Siewior
40da1b11f0 cpu/hotplug: Drop the device lock on error
If a custom CPU target is specified and that one is not available _or_
can't be interrupted then the code returns to userland without dropping a
lock as notices by lockdep:

|echo 133 > /sys/devices/system/cpu/cpu7/hotplug/target
| ================================================
| [ BUG: lock held when returning to user space! ]
| ------------------------------------------------
| bash/503 is leaving the kernel with locks still held!
| 1 lock held by bash/503:
|  #0:  (device_hotplug_lock){+.+...}, at: [<ffffffff815b5650>] lock_device_hotplug_sysfs+0x10/0x40

So release the lock then.

Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170602142714.3ogo25f2wbq6fjpj@linutronix.de
2017-06-03 09:35:04 +02:00
Alexander Levin
e5aeee51f6 perf/core: Don't release cred_guard_mutex if not taken
If we failed to acquire task's cred_guard_mutex we shouldn't proceed
to release it in the error path.

Fixes: a63fbed776 ("perf/tracing/cpuhotplug: Fix locking order")
Signed-off-by: Alexander Levin <alexander.levin@verizon.com>
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: mhiramat@kernel.org
Cc: paulmck@linux.vnet.ibm.com
Cc: bigeasy@linutronix.de
Link: http://lkml.kernel.org/r/20170603033903.12056-1-alexander.levin@verizon.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-03 09:28:45 +02:00
Chenbo Feng
80b7d81912 bpf: Remove the capability check for cgroup skb eBPF program
Currently loading a cgroup skb eBPF program require a CAP_SYS_ADMIN
capability while attaching the program to a cgroup only requires the
user have CAP_NET_ADMIN privilege. We can escape the capability
check when load the program just like socket filter program to make
the capability requirement consistent.

Change since v1:
Change the code style in order to be compliant with checkpatch.pl
preference

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02 14:24:40 -04:00
Chenbo Feng
fb9a307d11 bpf: Allow CGROUP_SKB eBPF program to access sk_buff
This allows cgroup eBPF program to classify packet based on their
protocol or other detail information. Currently program need
CAP_NET_ADMIN privilege to attach a cgroup eBPF program, and A
process with CAP_NET_ADMIN can already see all packets on the system,
for example, by creating an iptables rules that causes the packet to
be passed to userspace via NFLOG.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02 14:24:40 -04:00
Linus Torvalds
035f1456f9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "Kconfig dependency fix for livepatching infrastructure from Miroslav
  Benes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMS
2017-06-02 08:59:17 -07:00
Alexei Starovoitov
b870aa901f bpf: use different interpreter depending on required stack size
16 __bpf_prog_run() interpreters for various stack sizes add .text
but not a lot comparing to run-time stack savings

   text	   data	    bss	    dec	    hex	filename
  26350   10328     624   37302    91b6 kernel/bpf/core.o.before_split
  25777   10328     624   36729    8f79 kernel/bpf/core.o.after_split
  26970	  10328	    624	  37922	   9422	kernel/bpf/core.o.now

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31 19:29:48 -04:00
Alexei Starovoitov
80a58d0255 bpf: reconcile bpf_tail_call and stack_depth
The next set of patches will take advantage of stack_depth tracking,
so make sure that the program that does bpf_tail_call() has
stack depth large enough for the callee.
We could have tracked the stack depth of the prog_array owner program
and only allow insertion of the programs with stack depth less
than the owner, but it will break existing applications.
Some of them have trivial root bpf program that only does
multiple bpf_tail_calls and at init time the prog array is empty.
In the future we may add a flag to do such tracking optionally,
but for now play simple and safe.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31 19:29:47 -04:00
Alexei Starovoitov
8726679a0f bpf: teach verifier to track stack depth
teach verifier to track bpf program stack depth

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31 19:29:47 -04:00
Alexei Starovoitov
f696b8f471 bpf: split bpf core interpreter
split __bpf_prog_run() interpreter into stack allocation and execution parts.
The code section shrinks which helps interpreter performance in some cases.
   text	   data	    bss	    dec	    hex	filename
  26350	  10328	    624	  37302	   91b6	kernel/bpf/core.o.before
  25777	  10328	    624	  36729	   8f79	kernel/bpf/core.o.after

Very short programs got slower (due to extra function call):
Before:
test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 7 PASS
test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 8 PASS
test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 7 PASS
test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 11 PASS
test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 7 PASS
After:
test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 11 PASS
test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 11 PASS
test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 11 PASS
test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 14 PASS
test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 10 PASS

Longer programs got faster:
Before:
test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 20286 20513 PASS
test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 31853 31768 PASS
test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9815 PASS
test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 6 PASS
test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13959 PASS
test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 210 PASS
test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 21724 PASS
test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19118 PASS
After:
test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 19008 18827 PASS
test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 29238 28450 PASS
test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9485 PASS
test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 12 PASS
test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13257 PASS
test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 213 PASS
test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 19389 PASS
test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19583 PASS

For real world production programs the difference is noise.

This patch is first step towards reducing interpreter stack consumption.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31 19:29:47 -04:00
Alexei Starovoitov
71189fa9b0 bpf: free up BPF_JMP | BPF_CALL | BPF_X opcode
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
indirect call by register and use kernel internal opcode to
mark call instruction into bpf_tail_call() helper.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31 19:29:47 -04:00
Richard Guy Briggs
7786f6b6df audit: add ambient capabilities to CAPSET and BPRM_FCAPS records
Capabilities were augmented to include ambient capabilities in v4.3
commit 58319057b7 ("capabilities: ambient capabilities").

Add ambient capabilities to the audit BPRM_FCAPS and CAPSET records.

The record contains fields "old_pp", "old_pi", "old_pe", "new_pp",
"new_pi", "new_pe" so in keeping with the previous record
normalizations, change the "new_*" variants to simply drop the "new_"
prefix.

A sample of the replaced BPRM_FCAPS record:
RAW: type=BPRM_FCAPS msg=audit(1491468034.252:237): fver=2
fp=0000000000200000 fi=0000000000000000 fe=1 old_pp=0000000000000000
old_pi=0000000000000000 old_pe=0000000000000000 old_pa=0000000000000000
pp=0000000000200000 pi=0000000000000000 pe=0000000000200000
pa=0000000000000000

INTERPRET: type=BPRM_FCAPS msg=audit(04/06/2017 04:40:34.252:237):
fver=2 fp=sys_admin fi=none fe=chown old_pp=none old_pi=none
old_pe=none old_pa=none pp=sys_admin pi=none pe=sys_admin pa=none

A sample of the replaced CAPSET record:
RAW: type=CAPSET msg=audit(1491469502.371:242): pid=833
cap_pi=0000003fffffffff cap_pp=0000003fffffffff cap_pe=0000003fffffffff
cap_pa=0000000000000000

INTERPRET: type=CAPSET msg=audit(04/06/2017 05:05:02.371:242) : pid=833
cap_pi=chown,dac_override,dac_read_search,fowner,fsetid,kill,
setgid,setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,
net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,
sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,sys_time,
sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,
mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read
cap_pp=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid,
setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,
net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,
sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,
sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,
mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read
cap_pe=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid,
setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,
net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,
sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,
sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,
mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read
cap_pa=none

See: https://github.com/linux-audit/audit-kernel/issues/40

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-30 17:36:11 -04:00
Frederic Weisbecker
7c25904508 nohz: Reset next_tick cache even when the timer has no regs
Handle tick interrupts whose regs are NULL, out of general paranoia. It happens
when hrtimer_interrupt() is called from non-interrupt contexts, such as hotplug
CPU down events.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-30 18:35:32 +02:00
Al Viro
bcfe8ad8ef do_sigaltstack(): lift copying to/from userland into callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27 15:38:17 -04:00
Al Viro
613763a1f0 take compat_sys_old_getrlimit() to native syscall
... and sanitize the ifdefs in there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27 15:38:06 -04:00
Linus Torvalds
39b8ab31bc Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixlet from Thomas Gleixner:
 "Silence dmesg spam by making the posix cpu timer printks depend on
  print_fatal_signals"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-timers: Make signal printks conditional
2017-05-27 09:14:24 -07:00
Linus Torvalds
805f286907 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
 "A fix for a state leak which was introduced in the recent rework of
  futex/rtmutex interaction"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
2017-05-27 08:59:37 -07:00
Linus Torvalds
d024baa58a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kthread fix from Thomas Gleixner:
 "A single fix which prevents a use after free when kthread fork fails"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Fix use-after-free if kthread fork fails
2017-05-27 08:52:27 -07:00
Linus Torvalds
77d6465695 There's been a few memory issues found with ftrace.
One was simply a memory leak where not all was being freed that should
 have been in releasing a file pointer on set_graph_function.
 
 Then Thomas found that the ftrace trampolines were marked for read/write
 as well as execute. To shrink the possible attack surface, he added
 calls to set them to ro. Which also uncovered some other issues with
 freeing module allocated memory that had its permissions changed.
 
 Kprobes had a similar issue which is fixed and a selftest was added
 to trigger that issue again.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZKOiVFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 vBoH/jxVozuAEVCv+Nbj6fhRxe4emjo0lZZb32EbEaSV/nUQGqHIZFdDQtbt+ld+
 sn06/BSMBI+L4BqLj1BCAW0e/zIn/4birIg53SX5jQwc3AlhUG7HS2d+RJZZCrp9
 Zofq9L6xZ4Hl2XjkPXqwEgtrwxQtkIPLlJqeYDJ6BVrlPfOPEwB7bfR7B684wiYT
 6h2Qo7f/ZQzgJ1sK8N2IjHEnAgE08KCYcj4IB4WHJk6SqQz3bv1Y00WBg2UQihVT
 TPPSVhYLnrSw53fxyALqZbHo2DvnQf1TnNadWxvSIpbvgm/T5GG60FDtvHgNfbwz
 yKuKAog+P9xBLkoAcfvODLY9O5s=
 =75TZ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "There's been a few memory issues found with ftrace.

  One was simply a memory leak where not all was being freed that should
  have been in releasing a file pointer on set_graph_function.

  Then Thomas found that the ftrace trampolines were marked for
  read/write as well as execute. To shrink the possible attack surface,
  he added calls to set them to ro. Which also uncovered some other
  issues with freeing module allocated memory that had its permissions
  changed.

  Kprobes had a similar issue which is fixed and a selftest was added to
  trigger that issue again"

* tag 'trace-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  x86/ftrace: Make sure that ftrace trampolines are not RWX
  x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range()
  selftests/ftrace: Add a testcase for many kprobe events
  kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
  ftrace: Fix memory leak in ftrace_graph_release()
2017-05-27 08:30:30 -07:00
Thomas Gleixner
b6b3b80fce alarmtimer: Fix posix-timer constification fallout
Some freezer related variables are only used when either CONFIG_POSIX_TIMER
or CONFIG_RTC_CLASS are enabled. Hide them when both are off.

Fixes: d3ba5a9a34 ("posix-timers: Make posix_clocks immutable")
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Helwig <hch@lst.de>
2017-05-27 12:23:47 +02:00
Christoph Hellwig
d3ba5a9a34 posix-timers: Make posix_clocks immutable
There are no more modular users providing a posix clock. The register
function is now pointless so the posix clock array can be initialized
statically at compile time and the array including the various k_clock
structs can be marked 'const'.

Inspired by changes in the Grsecurity patch set, but done proper.

[ tglx: Massaged changelog and fixed the POSIX_TIMER=n case ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Link: http://lkml.kernel.org/r/20170526090311.3377-3-hch@lst.de
2017-05-27 09:46:35 +02:00
Masami Hiramatsu
c93f5cf571 kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
Fix kprobes to set(recover) RWX bits correctly on trampoline
buffer before releasing it. Releasing readonly page to
module_memfree() crash the kernel.

Without this fix, if kprobes user register a bunch of kprobes
in function body (since kprobes on function entry usually
use ftrace) and unregister it, kernel hits a BUG and crash.

Link: http://lkml.kernel.org/r/149570868652.3518.14120169373590420503.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: d0381c81c2 ("kprobes/x86: Set kprobes pages read-only")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-26 22:37:00 -04:00
Luis Henriques
f9797c2f20 ftrace: Fix memory leak in ftrace_graph_release()
ftrace_hash is being kfree'ed in ftrace_graph_release(), however the
->buckets field is not.  This results in a memory leak that is easily
captured by kmemleak:

unreferenced object 0xffff880038afe000 (size 8192):
  comm "trace-cmd", pid 238, jiffies 4294916898 (age 9.736s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff815f561e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8113964d>] __kmalloc+0x12d/0x1a0
    [<ffffffff810bf6d1>] alloc_ftrace_hash+0x51/0x80
    [<ffffffff810c0523>] __ftrace_graph_open.isra.39.constprop.46+0xa3/0x100
    [<ffffffff810c05e8>] ftrace_graph_open+0x68/0xa0
    [<ffffffff8114003d>] do_dentry_open.isra.1+0x1bd/0x2d0
    [<ffffffff81140df7>] vfs_open+0x47/0x60
    [<ffffffff81150f95>] path_openat+0x2a5/0x1020
    [<ffffffff81152d6a>] do_filp_open+0x8a/0xf0
    [<ffffffff811411df>] do_sys_open+0x12f/0x200
    [<ffffffff811412ce>] SyS_open+0x1e/0x20
    [<ffffffff815fa6e0>] entry_SYSCALL_64_fastpath+0x13/0x94
    [<ffffffffffffffff>] 0xffffffffffffffff

Link: http://lkml.kernel.org/r/20170525152038.7661-1-lhenriques@suse.com

Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-26 22:35:48 -04:00
Miroslav Benes
5720acf4bf livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMS
If TRIM_UNUSED_KSYMS is enabled, all unneeded exported symbols are made
unexported. Two-pass build of the kernel is done to find out which
symbols are needed based on a configuration. This effectively
complicates things for out-of-tree modules.

Livepatch exports functions to (un)register and enable/disable a live
patch. The only in-tree module which uses these functions is a sample in
samples/livepatch/. If the sample is disabled, the functions are
trimmed and out-of-tree live patches cannot be built.

Note that live patches are intended to be built out-of-tree.

Suggested-by: Michal Marek <mmarek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-05-27 00:27:37 +02:00
Linus Torvalds
6741d51699 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix state pruning in bpf verifier wrt. alignment, from Daniel
    Borkmann.

 2) Handle non-linear SKBs properly in SCTP ICMP parsing, from Davide
    Caratti.

 3) Fix bit field definitions for rss_hash_type of descriptors in mlx5
    driver, from Jesper Brouer.

 4) Defer slave->link updates until bonding is ready to do a full commit
    to the new settings, from Nithin Sujir.

 5) Properly reference count ipv4 FIB metrics to avoid use after free
    situations, from Eric Dumazet and several others including Cong Wang
    and Julian Anastasov.

 6) Fix races in llc_ui_bind(), from Lin Zhang.

 7) Fix regression of ESP UDP encapsulation for TCP packets, from
    Steffen Klassert.

 8) Fix mdio-octeon driver Kconfig deps, from Randy Dunlap.

 9) Fix regression in setting DSCP on ipv6/GRE encapsulation, from Peter
    Dawson.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  ipv4: add reference counting to metrics
  net: ethernet: ax88796: don't call free_irq without request_irq first
  ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets
  sctp: fix ICMP processing if skb is non-linear
  net: llc: add lock_sock in llc_ui_bind to avoid a race condition
  bonding: Don't update slave->link until ready to commit
  test_bpf: Add a couple of tests for BPF_JSGE.
  bpf: add various verifier test cases
  bpf: fix wrong exposure of map_flags into fdinfo for lpm
  bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data
  bpf: properly reset caller saved regs after helper call and ld_abs/ind
  bpf: fix incorrect pruning decision when alignment must be tracked
  arp: fixed -Wuninitialized compiler warning
  tcp: avoid fastopen API to be used on AF_UNSPEC
  net: move somaxconn init from sysctl code
  net: fix potential null pointer dereference
  geneve: fix fill_info when using collect_metadata
  virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
  be2net: Fix offload features for Q-in-Q packets
  vlan: Fix tcp checksum offloads in Q-in-Q vlans
  ...
2017-05-26 13:51:01 -07:00
Vincent Legoll
5a29ef2209 genirq: Make early_irq_init() print out more informative
The printk in early_irq_init() is cryptic and badly formatted:

  NR_IRQS:33024 nr_irqs:968 16

The last number is the number of preallocated interrupts, so add a prefix
to it:

  NR_IRQS: 33024, nr_irqs: 968, preallocated irqs: 16

Cleanup the formatting for better readability as well.

Signed-off-by: Vincent Legoll <vincent.legoll@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494318849-6733-1-git-send-email-vincent.legoll@gmail.com
2017-05-26 14:54:05 +02:00
Thomas Gleixner
49dfe2a677 cpuhotplug: Link lock stacks for hotplug callbacks
The CPU hotplug callbacks are not covered by lockdep versus the cpu hotplug
rwsem.

CPU0						CPU1
cpuhp_setup_state(STATE, startup, teardown);
 cpus_read_lock();
  invoke_callback_on_ap();
    kick_hotplug_thread(ap);
    wait_for_completion();			hotplug_thread_fn()
    						  lock(m);
						  do_stuff();
						  unlock(m);

Lockdep does not know about this dependency and will not trigger on the
following code sequence:

	  lock(m);
	  cpus_read_lock();
	  
Add a lockdep map and connect the initiators lock chain with the hotplug
thread lock chain, so potential deadlocks can be detected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081549.709375845@linutronix.de
2017-05-26 10:10:48 +02:00
Thomas Gleixner
fc8dffd379 cpu/hotplug: Convert hotplug locking to percpu rwsem
There are no more (known) nested calls to get_online_cpus() and all
observed lock ordering problems have been addressed.

Replace the magic nested 'rwsem' hackery with a percpu-rwsem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081549.447014063@linutronix.de
2017-05-26 10:10:46 +02:00
Thomas Gleixner
2d1e38f566 kprobes: Cure hotplug lock ordering issues
Converting the cpu hotplug locking to a percpu rwsem unearthed hidden lock
ordering problems.

There is a wide range of locks involved in this: kprobe_mutex,
jump_label_mutex, ftrace_lock, text_mutex, event_mutex, module_mutex,
func_hash->regex_lock and a gazillion of lock order permutations with
nested get_online_cpus() calls.

Some of those permutations are potential deadlocks even with the current
nesting hotplug locking scheme, but they can't be discovered by lockdep.

The conversion of the hotplug locking to a percpu rwsem requires to prevent
nested locking, so it's required to take the hotplug rwsem early in the
call chain and establish a proper lock order.

After quite some analysis and going down the wrong road severa times the
following lock order has been chosen:

kprobe_mutex -> cpus_rwsem -> jump_label_mutex -> text_mutex

For kprobes which hook on an ftrace function trace point, it's required to
drop cpus_rwsem before calling into the ftrace code to avoid a deadlock on
the func_hash->regex_lock.

[ Steven: Ftrace interaction fixes ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20170524081549.104864779@linutronix.de
2017-05-26 10:10:45 +02:00
Thomas Gleixner
f2545b2d4c jump_label: Reorder hotplug lock and jump_label_lock
The conversion of the hotplug locking to a percpu rwsem unearthed lock
ordering issues all over the place.

The jump_label code has two issues:

 1) Nested get_online_cpus() invocations

 2) Ordering problems vs. the cpus rwsem and the jump_label_mutex

To cure these, the following lock order has been established;

   cpus_rwsem -> jump_label_lock -> text_mutex

Even if not all architectures need protection against CPU hotplug, taking
cpus_rwsem before jump_label_lock is now mandatory in code pathes which
actually modify code and therefor need text_mutex protection.

Move the get_online_cpus() invocations into the core jump label code and
establish the proper lock order where required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Link: http://lkml.kernel.org/r/20170524081549.025830817@linutronix.de
2017-05-26 10:10:45 +02:00
Thomas Gleixner
a63fbed776 perf/tracing/cpuhotplug: Fix locking order
perf, tracing, kprobes and jump_labels have a gazillion of ways to create
dependency lock chains. Some of those involve nested invocations of
get_online_cpus().

The conversion of the hotplug locking to a percpu rwsem requires to avoid
such nested calls. sys_perf_event_open() protects most of the syscall logic
against cpu hotplug. This causes nested calls and lock inversions versus
ftrace and kprobes in various interesting ways.

It's impossible to move the hotplug locking to the outer end of all call
chains in the involved facilities, so the hotplug protection in
sys_perf_event_open() needs to be solved differently.

Introduce 'pmus_mutex' which protects a perf private online cpumask. This
mutex is taken when the mask is updated in the cpu hotplug callbacks and
can be taken in sys_perf_event_open() to protect the swhash setup/teardown
code and when the final judgement about a valid event has to be made.

[ tglx: Produced changelog and fixed the swhash interaction ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: http://lkml.kernel.org/r/20170524081548.930941109@linutronix.de
2017-05-26 10:10:44 +02:00
Sebastian Andrzej Siewior
210e21331f cpu/hotplug: Use stop_machine_cpuslocked() in takedown_cpu()
takedown_cpu() is a cpu hotplug function invoking stop_machine(). The cpu
hotplug machinery holds the hotplug lock for write.

stop_machine() invokes get_online_cpus() as well. This is correct, but
prevents the conversion of the hotplug locking to a percpu rwsem.

Use stop_machine_cpuslocked() to avoid the nested call.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081548.423292433@linutronix.de
2017-05-26 10:10:42 +02:00
Sebastian Andrzej Siewior
c5a81c8ff8 padata: Avoid nested calls to cpus_read_lock() in pcrypt_init_padata()
pcrypt_init_padata()
   cpus_read_lock()
   padata_alloc_possible()
     padata_alloc()
       cpus_read_lock()

The nested call to cpus_read_lock() works with the current implementation,
but prevents the conversion to a percpu rwsem.

The other caller of padata_alloc_possible() is pcrypt_init_padata() which
calls from a cpus_read_lock() protected region as well.

Remove the cpus_read_lock() call in padata_alloc() and document the
calling convention.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-crypto@vger.kernel.org
Link: http://lkml.kernel.org/r/20170524081547.571278910@linutronix.de
2017-05-26 10:10:37 +02:00
Thomas Gleixner
9596695ee1 padata: Make padata_alloc() static
No users outside of padata.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-crypto@vger.kernel.org
Link: http://lkml.kernel.org/r/20170524081547.491457256@linutronix.de
2017-05-26 10:10:37 +02:00
Sebastian Andrzej Siewior
fe5595c074 stop_machine: Provide stop_machine_cpuslocked()
Some call sites of stop_machine() are within a get_online_cpus() protected
region.

stop_machine() calls get_online_cpus() as well, which is possible in the
current implementation but prevents converting the hotplug locking to a
percpu rwsem.

Provide stop_machine_cpuslocked() to avoid nested calls to get_online_cpus().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081547.400700852@linutronix.de
2017-05-26 10:10:36 +02:00
Thomas Gleixner
9805c67333 cpu/hotplug: Add __cpuhp_state_add_instance_cpuslocked()
Add cpuslocked() variants for the multi instance registration so this can
be called from a cpus_read_lock() protected region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081547.321782217@linutronix.de
2017-05-26 10:10:35 +02:00
Sebastian Andrzej Siewior
71def423fe cpu/hotplug: Provide cpuhp_setup/remove_state[_nocalls]_cpuslocked()
Some call sites of cpuhp_setup/remove_state[_nocalls]() are within a
cpus_read locked region.

cpuhp_setup/remove_state[_nocalls]() call cpus_read_lock() as well, which
is possible in the current implementation but prevents converting the
hotplug locking to a percpu rwsem.

Provide locked versions of the interfaces to avoid nested calls to
cpus_read_lock().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081547.239600868@linutronix.de
2017-05-26 10:10:35 +02:00
Thomas Gleixner
8f553c498e cpu/hotplug: Provide cpus_read|write_[un]lock()
The counting 'rwsem' hackery of get|put_online_cpus() is going to be
replaced by percpu rwsem.

Rename the functions to make it clear that it's locking and not some
refcount style interface. These new functions will be used for the
preparatory patches which make the code ready for the percpu rwsem
conversion.

Rename all instances in the cpu hotplug code while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081547.080397752@linutronix.de
2017-05-26 10:10:34 +02:00
Babu Moger
9ab6055f95 kernel/locking: Fix compile error with qrwlock.c
Saw these compile errors on SPARC when queued rwlock feature is enabled.

 CC      kernel/locking/qrwlock.o
kernel/locking/qrwlock.c: In function ‘queued_read_lock_slowpath’:
kernel/locking/qrwlock.c:89: error: implicit declaration of function ‘arch_spin_lock’
kernel/locking/qrwlock.c:102: error: implicit declaration of function ‘arch_spin_unlock’
make[4]: *** [kernel/locking/qrwlock.o] Error 1

Include spinlock.h in qrwlock.c to fix it.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 12:06:50 -07:00
Daniel Borkmann
a316338cb7 bpf: fix wrong exposure of map_flags into fdinfo for lpm
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:28 -04:00
Daniel Borkmann
a9789ef9af bpf: properly reset caller saved regs after helper call and ld_abs/ind
Currently, after performing helper calls, we clear all caller saved
registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
specification. The way we reset these regs can affect pruning decisions
in later paths, since we only reset register's imm to 0 and type to
NOT_INIT. However, we leave out clearing of other variables such as id,
min_value, max_value, etc, which can later on lead to pruning mismatches
due to stale data.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:27 -04:00
Daniel Borkmann
1ad2f5838d bpf: fix incorrect pruning decision when alignment must be tracked
Currently, when we enforce alignment tracking on direct packet access,
the verifier lets the following program pass despite doing a packet
write with unaligned access:

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (61) r7 = *(u32 *)(r1 +8)
  3: (bf) r0 = r2
  4: (07) r0 += 14
  5: (25) if r7 > 0x1 goto pc+4
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  6: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  7: (63) *(u32 *)(r0 -4) = r0
  8: (b7) r0 = 0
  9: (95) exit

  from 6 to 8:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
  8: (b7) r0 = 0
  9: (95) exit

  from 5 to 10:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=2 R10=fp
  10: (07) r0 += 1
  11: (05) goto pc-6
  6: safe                           <----- here, wrongly found safe
  processed 15 insns

However, if we enforce a pruning mismatch by adding state into r8
which is then being mismatched in states_equal(), we find that for
the otherwise same program, the verifier detects a misaligned packet
access when actually walking that path:

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (61) r7 = *(u32 *)(r1 +8)
  3: (b7) r8 = 1
  4: (bf) r0 = r2
  5: (07) r0 += 14
  6: (25) if r7 > 0x1 goto pc+4
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  7: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  8: (63) *(u32 *)(r0 -4) = r0
  9: (b7) r0 = 0
  10: (95) exit

  from 7 to 9:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=0,max_value=1
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  9: (b7) r0 = 0
  10: (95) exit

  from 6 to 11:
   R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
   R3=pkt_end R7=inv,min_value=2
   R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
  11: (07) r0 += 1
  12: (b7) r8 = 0
  13: (05) goto pc-7                <----- mismatch due to r8
  7: (2d) if r0 > r3 goto pc+1
   R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
   R3=pkt_end R7=inv,min_value=2
   R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
  8: (63) *(u32 *)(r0 -4) = r0
  misaligned packet access off 2+15+-4 size 4

The reason why we fail to see it in states_equal() is that the
third test in compare_ptrs_to_packet() ...

  if (old->off <= cur->off &&
      old->off >= old->range && cur->off >= cur->range)
          return true;

... will let the above pass. The situation we run into is that
old->off <= cur->off (14 <= 15), meaning that prior walked paths
went with smaller offset, which was later used in the packet
access after successful packet range check and found to be safe
already.

For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
as in above program to it, results in R0=pkt(id=0,off=14,r=0)
before the packet range test. Now, testing this against R3=pkt_end
with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
for the case when we're within bounds. A write into the packet
at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
aligned (2 is for NET_IP_ALIGN). After processing this with
all fall-through paths, we later on check paths from branches.
When the above skb->mark test is true, then we jump near the
end of the program, perform r0 += 1, and jump back to the
'if r0 > r3 goto out' test we've visited earlier already. This
time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
that part because this time we'll have a larger safe packet
range, and we already found that with off=14 all further insn
were already safe, so it's safe as well with a larger off.
However, the problem is that the subsequent write into the packet
with 2 + 15 -4 is then unaligned, and not caught by the alignment
tracking. Note that min_align, aux_off, and aux_off_align were
all 0 in this example.

Since we cannot tell at this time what kind of packet access was
performed in the prior walk and what minimal requirements it has
(we might do so in the future, but that requires more complexity),
fix it to disable this pruning case for strict alignment for now,
and let the verifier do check such paths instead. With that applied,
the test cases pass and reject the program due to misalignment.

Fixes: d117441674 ("bpf: Track alignment of register values in the verifier.")
Reference: http://patchwork.ozlabs.org/patch/761909/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:27 -04:00
Peter Rajnoha
f36776fafb kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.

Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.

The format for writing the uevent attribute is following:

    ACTION [UUID [KEY=VALUE ...]

There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.

The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.

The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.

If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 18:30:51 +02:00
Tejun Heo
41c25707d2 cpuset: consider dying css as offline
In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-24 12:43:30 -04:00
Linus Torvalds
2426125ab4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace fix from Eric Biederman:
 "This fixes a brown paper bag bug. When I fixed the ptrace interaction
  with user namespaces I added a new field ptracer_cred in struct_task
  and I failed to properly initialize it on fork.

  This dangling pointer wound up breaking runing setuid applications run
  from the enlightenment window manager.

  As this is the worst sort of bug. A regression breaking user space for
  no good reason let's get this fixed"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ptrace: Properly initialize ptracer_cred on fork
2017-05-24 08:28:59 -07:00
Peter Zijlstra
45aea32167 sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()
The more strict early boot preemption warnings found that
__set_sched_clock_stable() was incorrectly assuming we'd still be
running on a single CPU:

  BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
  caller is debug_smp_processor_id+0x1c/0x1e
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1
  Call Trace:
   dump_stack+0x110/0x192
   check_preemption_disabled+0x10c/0x128
   ? set_debug_rodata+0x25/0x25
   debug_smp_processor_id+0x1c/0x1e
   sched_clock_init_late+0x27/0x87
  [...]

Fix it by disabling IRQs.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: lkp@01.org
Cc: tipbuild@zytor.com
Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24 09:10:00 +02:00
Thomas Gleixner
43fe8b8eb8 posix-timers: Make signal printks conditional
A recent commit added extra printks for CPU/RT limits. This can result in
excessive spam in dmesg.

Make the printks conditional on print_fatal_signals.

Fixes: e7ea7c9806 ("rlimits: Print more information when CPU/RT limits are exceeded")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arun Raghavan <arun@arunraghavan.net>
2017-05-23 23:39:57 +02:00
Kees Cook
3e2e857f9c module: Add module name to modinfo
Accessing the mod structure (e.g. for mod->name) prior to having completed
check_modstruct_version() can result in writing garbage to the error logs
if the layout of the mod structure loaded from disk doesn't match the
running kernel's mod structure layout. This kind of mismatch will become
much more likely if a kernel is built with different randomization seed
for the struct layout randomization plugin.

Instead, add and use a new modinfo string for logging the module name.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-05-23 14:08:31 -07:00
Kees Cook
4901942696 module: Pass struct load_info into symbol checks
Since we're already using values from struct load_info, just pass this
pointer in directly and use what's needed as we need it. This allows us
to access future fields in struct load_info too.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-05-23 14:08:18 -07:00
Richard Guy Briggs
4b3e4ed6b0 audit: unswing cap_* fields in PATH records
The cap_* fields swing in and out of PATH records.
If no capabilities are set, the cap_* fields are completely missing and when
one of the cap_fi or cap_fp values is empty, that field is omitted.

Original:
type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL
type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fe=1 cap_fver=2

Normalize the PATH record by always printing all 4 cap_* fields.

Fixed:
type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fi=none cap_fe=1 cap_fver=2

See: https://github.com/linux-audit/audit-kernel/issues/42

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-23 16:50:02 -04:00
Eric W. Biederman
c70d9d809f ptrace: Properly initialize ptracer_cred on fork
When I introduced ptracer_cred I failed to consider the weirdness of
fork where the task_struct copies the old value by default.  This
winds up leaving ptracer_cred set even when a process forks and
the child process does not wind up being ptraced.

Because ptracer_cred is not set on non-ptraced processes whose
parents were ptraced this has broken the ability of the enlightenment
window manager to start setuid children.

Fix this by properly initializing ptracer_cred in ptrace_init_task

This must be done with a little bit of care to preserve the current value
of ptracer_cred when ptrace carries through fork.  Re-reading the
ptracer_cred from the ptracing process at this point is inconsistent
with how PT_PTRACE_CAP has been maintained all of these years.

Tested-by: Takashi Iwai <tiwai@suse.de>
Fixes: 64b875f7ac ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 07:40:44 -05:00
Thomas Gleixner
1c3c5eab17 sched/core: Enable might_sleep() and smp_processor_id() checks early
might_sleep() and smp_processor_id() checks are enabled after the boot
process is done. That hides bugs in the SMP bringup and driver
initialization code.

Enable it right when the scheduler starts working, i.e. when init task and
kthreadd have been created and right before the idle task enables
preemption.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:38 +02:00
Thomas Gleixner
ff48cd26fc printk: Adjust system_state checks
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in boot_delay_msec() to handle the extra
states.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170516184736.027534895@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:37 +02:00
Thomas Gleixner
0594729c24 extable: Adjust system_state checks
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in core_kernel_text() to handle the extra
states, i.e. to cover init text up to the point where the system switches
to state RUNNING.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170516184735.949992741@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:37 +02:00
Thomas Gleixner
b4def42724 async: Adjust system_state checks
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in async_run_entry_fn() and
async_synchronize_cookie_domain() to handle the extra states.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170516184735.865155020@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:37 +02:00
Vlastimil Babka
8655d54977 sched/numa: Use down_read_trylock() for the mmap_sem
A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Dave Kleikamp
c249f255aa sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.

If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Steven Rostedt (VMware)
896bbb2522 sched/core: Allow __sched_setscheduler() in interrupts when PI is not used
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it
added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI
is not a common occurrence, lockdep will likely never trigger if
sched_setscheduler was called from interrupt context. A BUG_ON() was added
to trigger if __sched_setscheduler() was ever called from interrupt context
because there was a possibility to take the wait_lock.

Today the wait_lock is irq safe, but the path to taking it in
sched_setscheduler() is the same as the path to taking it from normal
context. The wait_lock is taken with raw_spin_lock_irq() and released with
raw_spin_unlock_irq() which will indiscriminately enable interrupts,
which would be bad in interrupt context.

The problem is that normalize_rt_tasks, which is called by triggering the
sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this
is done from interrupt context!

Now __sched_setscheduler() takes a "pi" parameter that is used to know if
the priority inheritance should be called or not. As the BUG_ON() only cares
about calling the PI code, it should only bug if called from interrupt
context with the "pi" parameter set to true.

Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: dbc7f069b9 ("sched: Use replace normalize_task() with __sched_setscheduler()")
Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Byungchul Park
a776b968e5 sched/deadline: Remove unnecessary condition in push_dl_task()
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task))
when it returns a task other than NULL, which means that task_cpu(task)
must be rq->cpu. So if task == next_task, then task_cpu(next_task) must
be rq->cpu as well. Remove the redundant condition and make the code simpler.

This way one unnecessary branch and two LOAD operations can be avoided.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Byungchul Park
de16b91eff sched/rt: Remove unnecessary condition in push_rt_task()
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when
it returns a task other than NULL, which means that task_cpu(task) must
be rq->cpu. So if task == next_task, then task_cpu(next_task) must be
rq->cpu as well. Remove the redundant condition and make the code simpler.

This way one unnecessary branch and two LOAD operations can be avoided.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Byungchul Park
73215849df sched/core: Use the new llist_for_each_entry_safe() primitive
Now that we've added llist_for_each_entry_safe(), use it to simplify
an open coded version of it in sched_ttwu_pending().

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:33 +02:00
Peter Zijlstra
6c8557bdb2 smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu()
The cpumasks in smp_call_function_many() are private and not subject
to concurrency, atomic bitops are pointless and expensive.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:32 +02:00
Aaron Lu
3fc5b3b6a8 smp: Avoid sending needless IPI in smp_call_function_many()
Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
process' mm_cpumask() shows the process has ever run on other CPUs. page
migration, page reclaim all need IPIs. The number of IPI needed to send
to different CPUs is especially large for multi-threaded workload since
mm_cpumask() is per process.

For smp_call_function_many(), whenever a CPU queues a CSD to a target
CPU, it will send an IPI to let the target CPU to handle the work.
This isn't necessary - we need only send IPI when queueing a CSD
to an empty call_single_queue.

The reason:

flush_smp_call_function_queue() that is called upon a CPU receiving an
IPI will empty the queue and then handle all of the CSDs there. So if
the target CPU's call_single_queue is not empty, we know that:
i.  An IPI for the target CPU has already been sent by 'previous queuers';
ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
Thus, it's safe for us to just queue our CSD there without sending an
addtional IPI. And for the 'previous queuers', we can limit it to the
first queuer.

To demonstrate the effect of this patch, a multi-thread workload that
spawns 80 threads to equally consume 100G memory is used. This is tested
on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
after 32G memory is used up, page reclaiming starts to happen a lot.

With this patch, IPI number dropped 88% and throughput increased about
15% for the above workload.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20170519075331.GE2084@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:32 +02:00
Ingo Molnar
386b554888 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 09:50:35 +02:00
Dan Carpenter
36cc2b9222 perf/core: Fix error handling in perf_event_alloc()
We don't set an error code here which means that perf_event_alloc()
returns ERR_PTR(0) (in other words NULL).  The callers are not expecting
that and would Oops.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20170522090418.hvs6icgpdo53wkn5@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 09:50:05 +02:00
Dan Carpenter
85c617abc7 perf/core: Remove some dead code
perf_init_event() can't return NULL.  If it did, the error handling is
incomplete and we would crash.  I have removed this confusing dead code.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170522090348.5g7yyld5en3yeky4@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 09:50:05 +02:00
Linus Torvalds
801099bed0 Power management updates for v4.12-rc3
- Fix RTC wakeup from suspend-to-idle broken by the recent rework
    of ACPI wakeup handling (Rafael Wysocki).
 
  - Update intel_pstate driver documentation to reflect the current
    code and explain how it works in more detail (Rafael Wysocki).
 
  - Fix an issue related to CPU idleness detection on systems with
    shared cpufreq policies in the schedutil governor (Juri Lelli).
 
  - Fix a possible build issue in the dbx500 cpufreq driver (Arnd
    Bergmann).
 
  - Fix a function in the power capping framework core to return
    an error code instead of 0 when there's an error (Dan Carpenter).
 
  - Clean up variable definition in the hibernation core (Pushkar
    Jambhlekar).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZIzszAAoJEILEb/54YlRxMG0P/R4VpPMB1l+wxQRmCMwzOupC
 GJ1jTa2mQQpPy57QPjaCDlUPxSaZA97S4MO0eMn4Or6LX3rG7kTUoe1WaYvRhWNk
 Ul2UfoLdVeFJwvQrzOZKB2xnEGA/nD2jlsD/9zYzy9FxMPjiG0F//RZvhZJVChpg
 wycz9Rw1T2x+1URAD5wkS4xLWzQEv5NqH6mc/KAoP/ntxe+7ahs5SnWmF9MLpHj7
 jXM9651BUSYp3QzHCHFObvsVZfbZz7isFIADmwsxzTy7vTPb1oIyo7EQ5QMcsivS
 LlJjrYy9JN0alwND0mistVlAmFVvvldckjR8zHSEiFt8IeMccrFw0inGir2ngghY
 53kMnJ/QoL1A/C539MHoAmfnpqB0QUd56QjXngungC47YpVHi5DaSXU7rln2xy/C
 7o7gbHUKUbStSvDLjRcQ915HANOuXkJk84BMIGUSlT3K/MvGAMKUNxZV7KOOngpb
 WR4G2lxjYTIHKB+YP5AmG2kMF4GlbGnIQts5Ryd5FijIH3/MYJ4W2Kas+GvbnoBb
 7NtDjyBJgjxleTv3fV89Pod+dKdFzrTRl+mr6bsn/WCiMjUHoXcTnOHh3OO/fJ8F
 AW/dywk9+Hx5DyjY04EJyklflfne97T7/NjJ99Zjzh/EC+uePeM+dMd+o66PpYG5
 +FJgyPc5ZaX1f2thAgv+
 =2sNW
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix RTC wakeup from suspend-to-idle broken recently, fix CPU
  idleness detection condition in the schedutil cpufreq governor, fix a
  cpufreq driver build failure, fix an error code path in the power
  capping framework, clean up the hibernate core and update the
  intel_pstate documentation.

  Specifics:

   - Fix RTC wakeup from suspend-to-idle broken by the recent rework of
     ACPI wakeup handling (Rafael Wysocki).

   - Update intel_pstate driver documentation to reflect the current
     code and explain how it works in more detail (Rafael Wysocki).

   - Fix an issue related to CPU idleness detection on systems with
     shared cpufreq policies in the schedutil governor (Juri Lelli).

   - Fix a possible build issue in the dbx500 cpufreq driver (Arnd
     Bergmann).

   - Fix a function in the power capping framework core to return an
     error code instead of 0 when there's an error (Dan Carpenter).

   - Clean up variable definition in the hibernation core (Pushkar
     Jambhlekar)"

* tag 'pm-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: dbx500: add a Kconfig symbol
  PM / hibernate: Declare variables as static
  PowerCap: Fix an error code in powercap_register_zone()
  RTC: rtc-cmos: Fix wakeup from suspend-to-idle
  PM / wakeup: Fix up wakeup_source_report_event()
  cpufreq: intel_pstate: Document the current behavior and user interface
  cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-22 19:24:32 -07:00
Marc Zyngier
a97b852b4d genirq/msi: Populate the domain name if provided by the irqchip
In order to ease debug, let's populate the domain name upfront, before any
MSI gets requested. This allows the domain to appear in the
irq_domain_mapping, and the user to easily find the expected data.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/20170512115538.10767-4-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-22 22:29:45 +02:00
Marc Zyngier
2370c00dc7 irqdomain: Let irq_domain_mapping display ACPI fwnode attributes
If the system is using ACPI, there is no of_node to display. But ACPI can
use a struct irqchip_fwid as a domain identifier, and it can be used to
display the name contained in that structure.

The output on such a system will look like this:

 pMSI      0           0           0  irqchip@00000000e1180000
 MSI      37           0           0  irqchip@00000000e1180000
 GICv2m   37           0           0  irqchip@00000000e1180000
 GICv2   448         448           0  irqchip@ffff000008003000

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/20170512115538.10767-3-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-22 22:29:44 +02:00
Marc Zyngier
fe17a42e70 irqdomain: Let irq_domain_mapping display hierarchical domains
Hierarchical domains seem to be hard to grasp, and a number of
aspiring kernel hackers find them utterly discombobulating.

In order to ease their pain, let's make them appear in
/sys/kernel/debug/irq_domain_mapping, such as the following:

   96  0x81808  MSI    0x          (null) RADIX   MSI
   96+ 0x00063  GICv2m 0xffff8003ee116980 RADIX   GICv2m
   96+ 0x00063  GICv2  0xffff00000916bfd8 LINEAR  GICv2

[output compressed to fit in a commit log]

This shows that IRQ96 is implemented by a stack of three domains,
the + sign indicating the stacking.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/20170512115538.10767-2-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-22 22:29:44 +02:00
Vegard Nossum
4d6501dce0 kthread: Fix use-after-free if kthread fork fails
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa9) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

    kthread()
     - worker_thread()
        - process_one_work()
        |  - call_usermodehelper_exec_work()
        |     - kernel_thread()
        |        - _do_fork()
        |           - copy_process()
        |              - dup_task_struct()
        |                 - arch_dup_task_struct()
        |                    - tsk->set_child_tid = current->set_child_tid // implied
        |              - ...
        |              - goto bad_fork_*
        |              - ...
        |              - free_task(tsk)
        |                 - free_kthread_struct(tsk)
        |                    - kfree(tsk->set_child_tid)
        - ...
        - schedule()
           - __schedule()
              - wq_worker_sleeping()
                 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa9 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Debugged-by: Jamie Iles <jamie.iles@oracle.com>
Fixes: 1da5c46fa9 ("kthread: Make struct kthread kmalloc'ed")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jamie Iles <jamie.iles@oracle.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-22 22:21:16 +02:00
Michael Hernandez
6f9a22bc57 PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs
min_vecs is the minimum amount of vectors needed to operate in MSI-X mode
which may just include the vectors that don't need affinity.

Disabling affinity settings causes the qla2xxx driver scsi_add_host() to fail
when blk_mq is enabled as the blk_mq_pci_map_queues() expects affinity masks
on each vector.

Fixes: dfef358bd1 ("PCI/MSI: Don't apply affinity if there aren't enough vectors left")
Signed-off-by: Michael Hernandez <michael.hernandez@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org	# v4.10+
2017-05-22 15:06:05 -05:00
Peter Zijlstra
04dc1b2fff futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
commit:

  cfafcd117d ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

The following trace shows the problem:

 ld-linux-x86-64-2161  [019] ....   410.760971: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_LOCK_PI
 ld-linux-x86-64-2161  [019] ...1   410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
 ld-linux-x86-64-2165  [011] ....   410.760978: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_UNLOCK_PI
 ld-linux-x86-64-2165  [011] d..1   410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
 ld-linux-x86-64-2165  [011] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
 ld-linux-x86-64-2161  [019] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT

Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
which then returns with -ETIMEDOUT. That wrecks the lock state, because now
the owner isn't aware it acquired the lock and removes the pending robust
list entry.

If 2161 is killed, the robust list will not clear out this futex and the
subsequent acquire on this futex will then (correctly) result in -ESRCH
which is unexpected by glibc, triggers an internal assertion and dies.

Task 2161			Task 2165

rt_mutex_wait_proxy_lock()
   timeout();
   /* T2161 is still queued in  the waiter list */
   return -ETIMEDOUT;

				futex_unlock_pi()
				spin_lock(hb->lock);
				rtmutex_unlock()
				  remove_rtmutex_waiter(T2161);
				   mark_lock_available();
				/* Make the next waiter owner of the user space side */
				futex_uval = 2161;
				spin_unlock(hb->lock);
spin_lock(hb->lock);
rt_mutex_cleanup_proxy_lock()
  if (rtmutex_owner() !== current)
     ...
     return FAIL;
....
return -ETIMEOUT;

This means that rt_mutex_cleanup_proxy_lock() needs to call
try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
assigned by the waker. If the rtmutex is owned by some other task then this
call is harmless and just confirmes that the waiter is not able to acquire
it.

While there, fix what looks like a merge error which resulted in
rt_mutex_cleanup_proxy_lock() having two calls to
fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
Both should have one, since both potentially touch the waiter list.

Fixes: 38d589f2fd ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Bug-Spotted-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-22 21:57:18 +02:00
Linus Torvalds
86ca984cef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly netfilter bug fixes in here, but we have some bits elsewhere as
  well.

   1) Don't do SNAT replies for non-NATed connections in IPVS, from
      Julian Anastasov.

   2) Don't delete conntrack helpers while they are still in use, from
      Liping Zhang.

   3) Fix zero padding in xtables's xt_data_to_user(), from Willem de
      Bruijn.

   4) Add proper RCU protection to nf_tables_dump_set() because we
      cannot guarantee that we hold the NFNL_SUBSYS_NFTABLES lock. From
      Liping Zhang.

   5) Initialize rcv_mss in tcp_disconnect(), from Wei Wang.

   6) smsc95xx devices can't handle IPV6 checksums fully, so don't
      advertise support for offloading them. From Nisar Sayed.

   7) Fix out-of-bounds access in __ip6_append_data(), from Eric
      Dumazet.

   8) Make atl2_probe() propagate the error code properly on failures,
      from Alexey Khoroshilov.

   9) arp_target[] in bond_check_params() is used uninitialized. This
      got changes from a global static to a local variable, which is how
      this mistake happened. Fix from Jarod Wilson.

  10) Fix fallout from unnecessary NULL check removal in cls_matchall,
      from Jiri Pirko. This is definitely brown paper bag territory..."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  net: sched: cls_matchall: fix null pointer dereference
  vsock: use new wait API for vsock_stream_sendmsg()
  bonding: fix randomly populated arp target array
  net: Make IP alignment calulations clearer.
  bonding: fix accounting of active ports in 3ad
  net: atheros: atl2: don't return zero on failure path in atl2_probe()
  ipv6: fix out of bound writes in __ip6_append_data()
  bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
  smsc95xx: Support only IPv4 TCP/UDP csum offload
  arp: always override existing neigh entries with gratuitous ARP
  arp: postpone addr_type calculation to as late as possible
  arp: decompose is_garp logic into a separate function
  arp: fixed error in a comment
  tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
  netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT
  ebtables: arpreply: Add the standard target sanity check
  netfilter: nf_tables: revisit chain/object refcounting from elements
  netfilter: nf_tables: missing sanitization in data from userspace
  netfilter: nf_tables: can't assume lock is acquired when dumping set elems
  netfilter: synproxy: fix conntrackd interaction
  ...
2017-05-22 12:42:02 -07:00
Rafael J. Wysocki
bb47e96417 Merge branches 'pm-sleep' and 'powercap'
* pm-sleep:
  PM / hibernate: Declare variables as static
  RTC: rtc-cmos: Fix wakeup from suspend-to-idle
  PM / wakeup: Fix up wakeup_source_report_event()

* powercap:
  PowerCap: Fix an error code in powercap_register_zone()
2017-05-22 20:32:05 +02:00
Rafael J. Wysocki
079c1812a2 Merge branches 'intel_pstate', 'pm-cpufreq' and 'pm-cpufreq-sched'
* intel_pstate:
  cpufreq: intel_pstate: Document the current behavior and user interface

* pm-cpufreq:
  cpufreq: dbx500: add a Kconfig symbol

* pm-cpufreq-sched:
  cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-22 20:28:22 +02:00
David S. Miller
e4eda884db net: Make IP alignment calulations clearer.
The assignmnet:

	ip_align = strict ? 2 : NET_IP_ALIGN;

in compare_pkt_ptr_alignment() trips up Coverity because we can only
get to this code when strict is true, therefore ip_align will always
be 2 regardless of NET_IP_ALIGN's value.

So just assign directly to '2' and explain the situation in the
comment above.

Reported-by: "Gustavo A. R. Silva" <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 12:27:07 -04:00
Linus Torvalds
970c305aa8 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix:

  Prevent idle task from ever being preempted. That makes sure that
  synchronize_rcu_tasks() which is ignoring idle task does not pretend
  that no task is stuck in preempted state. If that happens and idle was
  preempted on a ftrace trampoline the machine crashes due to
  inconsistent state"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Call __schedule() from do_idle() without enabling preemption
2017-05-21 11:52:00 -07:00
Linus Torvalds
e7a3d62749 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of small fixes for the irq subsystem:

   - Cure a data ordering problem with chained interrupts

   - Three small fixlets for the mbigen irq chip"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix chained interrupt data ordering
  irqchip/mbigen: Fix the clear register offset calculation
  irqchip/mbigen: Fix potential NULL dereferencing
  irqchip/mbigen: Fix memory mapping code
2017-05-21 11:45:26 -07:00
Al Viro
92ebce5ac5 osf_wait4: switch to kernel_wait4()
... and sanitize copying rusage to userland

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:16:26 -04:00
Al Viro
4c48abe91b waitid(): switch copyout of siginfo to unsafe_put_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:16:18 -04:00
Al Viro
76d9871e11 wait_task_zombie: consolidate info logics
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:14:59 -04:00
Al Viro
bb380ec33a kill wait_noreap_copyout()
folds into callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:13:53 -04:00
Al Viro
e61a250229 lift getrusage() from wait_noreap_copyout()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:13:17 -04:00
Al Viro
67d7ddded3 waitid(2): leave copyout of siginfo to syscall itself
have kernel_waitid() collect the information needed for siginfo into
a small structure (waitid_info) passed to it; deal with copyout in
sys_waitid()/compat_sys_waitid().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:13:07 -04:00
Al Viro
359566faef kernel_wait4()/kernel_waitid(): delay copying status to userland
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:11:07 -04:00
Al Viro
ce72a16fa7 wait4(2)/waitid(2): separate copying rusage to userland
New helpers: kernel_waitid() and kernel_wait4().  sys_waitid(),
sys_wait4() and their compat variants switched to those.  Copying
struct rusage to userland is left to syscall itself.  For
compat_sys_wait4() that eliminates the use of set_fs() completely.
For compat_sys_waitid() it's still needed (for siginfo handling);
that will change shortly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:11:00 -04:00
Al Viro
7e95a22590 move compat wait4 and waitid next to native variants
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:10:51 -04:00
Linus Torvalds
56f410cf45 This fixes a bug caused by not cleaning up the new instance unique triggers
when deleting an instance. It also creates a selftest that triggers that bug.
 
 Fix the delayed optimization happening after kprobes boot up self tests
 being removed by freeing of init memory.
 
 Comment kprobes on why the delay optimization is not a problem for removal
 of modules, to keep other developers from searching that riddle.
 
 Fix another rcu isn't watching in stack trace tracing.
 
 Naveen N. Rao (4):
       ftrace: Simplify glob handling in unregister_ftrace_function_probe_func()
       ftrace/instances: Clear function triggers when removing instances
       selftests/ftrace: Fix bashisms
       selftests/ftrace: Add test to remove instance with active event triggers
 
 Steven Rostedt (1):
       tracing: Move postpone selftests to core from early_initcall
 
 Steven Rostedt (VMware) (3):
       ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub
       kprobes: Document how optimized kprobes are removed from module unload
       tracing: Make sure RCU is watching before calling a stack trace
 
 Thomas Gleixner (1):
       tracing/kprobes: Enforce kprobes teardown after testing
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZIQapFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 A6MIAKFLb6mQ4flRBXpWd2tD2B4DQpQ0H7SovseZnlH6Q7grU6POY/qbNl9xXiBA
 3NavxqbIYokH8cxEqGAusL7ASUFPXJj6erMM1uc1WRuAzMpIjvgNacOtW5R+c5S9
 ofR1xtKlBo/854J/IP6M3J0WqrK+B7TsS1WYKohe/tFMBpolbnFloHVfMMZlaL58
 CQhCoAhkjJRsta6dJhbo+HoQy03VGyWsfFHtutBpIwsf81Naq4Stpxp7jdZLWhB8
 Di5QdOji9lDayK6Uk7DDZqHxbjC9z6cCS9nVWIGHkE4AMpR3peYtsyCaAOBjVMLV
 2OuhuREfZgKaYVMjUfdeYCayDAY=
 =1gek
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix a bug caused by not cleaning up the new instance unique triggers
   when deleting an instance. It also creates a selftest that triggers
   that bug.

 - Fix the delayed optimization happening after kprobes boot up self
   tests being removed by freeing of init memory.

 - Comment kprobes on why the delay optimization is not a problem for
   removal of modules, to keep other developers from searching that
   riddle.

 - Fix another case of rcu not watching in stack trace tracing.

* tag 'trace-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Make sure RCU is watching before calling a stack trace
  kprobes: Document how optimized kprobes are removed from module unload
  selftests/ftrace: Add test to remove instance with active event triggers
  selftests/ftrace: Fix bashisms
  ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub
  ftrace/instances: Clear function triggers when removing instances
  ftrace: Simplify glob handling in unregister_ftrace_function_probe_func()
  tracing/kprobes: Enforce kprobes teardown after testing
  tracing: Move postpone selftests to core from early_initcall
2017-05-20 23:39:03 -07:00
Linus Torvalds
894e21642d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of fixes that should go into this cycle.

   - a pull request from Christoph for NVMe, which ended up being
     manually applied to avoid pulling in newer bits in master. Mostly
     fibre channel fixes from James, but also a few fixes from Jon and
     Vijay

   - a pull request from Konrad, with just a single fix for xen-blkback
     from Gustavo.

   - a fuseblk bdi fix from Jan, fixing a regression in this series with
     the dynamic backing devices.

   - a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().

   - a request leak fix for drbd from Lars, fixing a regression in the
     last series with the kref changes. This will go to stable as well"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: release the sq ref on rdma read errors
  nvmet-fc: remove target cpu scheduling flag
  nvme-fc: stop queues on error detection
  nvme-fc: require target or discovery role for fc-nvme targets
  nvme-fc: correct port role bits
  nvme: unmap CMB and remove sysfs file in reset path
  blktrace: fix integer parse
  fuseblk: Fix warning in super_setup_bdi_name()
  block: xen-blkback: add null check to avoid null pointer dereference
  drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
2017-05-20 16:12:30 -07:00
Shaohua Li
5f3394530f blktrace: fix integer parse
sscanf is a very poor way to parse integer. For example, I input
"discard" for act_mask, it gets 0xd and completely messes up. Using
correct API to do integer parse.

This patch also makes attributes accept any base of integer.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-19 09:21:15 -06:00
Petr Mladek
719f6a7040 printk: Use the main logbuf in NMI when logbuf_lock is available
The commit 42a0bb3f71 ("printk/nmi: generic solution for safe
printk in NMI") caused that printk stores messages into a temporary
buffer in NMI context.

The buffer is per-CPU and therefore the size is rather limited.
It works quite well for NMI backtraces. But there are longer logs
that might get printed in NMI context, for example, lockdep
warnings, ftrace_dump_on_oops.

The temporary buffer is used to avoid deadlocks caused by
logbuf_lock. Also it is needed to avoid races with the other
temporary buffer that is used when PRINTK_SAFE_CONTEXT is entered.
But the main buffer can be used in NMI if the lock is available
and we did not interrupt PRINTK_SAFE_CONTEXT.

The lock is checked using raw_spin_is_locked(). It might cause
false negatives when the lock is taken on another CPU and
this CPU is in the safe context from other reasons. Note that
the safe context is used also to get console semaphore or when
calling console drivers. For this reason, we do the check in
printk_nmi_enter(). It makes the handling consistent for
the entire NMI handler and avoids reshuffling of the messages.

The patch also defines special printk context that allows
to use printk_deferred() in NMI. Note that we could not flush
the messages to the consoles because console drivers might use
many other internal locks.

The newly created vprintk_deferred() disables the preemption
only around the irq work handling. It is needed there to keep
the consistency between the two per-CPU variables. But there
is no reason to disable preemption around vprintk_emit().

Finally, the patch puts back explicit serialization of the NMI
backtraces from different CPUs. It was removed by the
commit a9edc88093 ("x86/nmi: Perform a safe
NMI stack trace on all CPUs"). It was not needed because
the flushing of the temporary per-CPU buffers was serialized.

Link: http://lkml.kernel.org/r/1493912763-24873-1-git-send-email-pmladek@suse.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rack+kernel@arm.linux.org.uk>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: x86@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-05-19 14:42:19 +02:00
Steven Rostedt (VMware)
a33d7d94ee tracing: Make sure RCU is watching before calling a stack trace
As stack tracing now requires "rcu watching", force RCU to be watching when
recording a stack trace.

Link: http://lkml.kernel.org/r/20170512172449.879684501@goodmis.org

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-18 23:57:56 -04:00
Linus Torvalds
667f867c93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't allow negative TCP reordering values, from Soheil Hassas
    Yeganeh.

 2) Don't overflow while parsing ipv6 header options, from Craig Gallek.

 3) Handle more cleanly the case where an individual route entry during
    a dump will not fit into the allocated netlink SKB, from David
    Ahern.

 4) Add missing CONFIG_INET dependency for mlx5e, from Arnd Bergmann.

 5) Allow neighbour updates to converge more quickly via gratuitous
    ARPs, from Ihar Hrachyshka.

 6) Fix compile error from CONFIG_INET is disabled, from Eric Dumazet.

 7) Fix use after free in x25 protocol init, from Lin Zhang.

 8) Valid VLAN pvid ranges passed into br_validate(), from Tobias
    Jungel.

 9) NULL out address lists in child sockets in SCTP, this is similar to
    the fix we made for inet connection sockets last week. From Eric
    Dumazet.

10) Fix NULL deref in mlxsw driver, from Ido Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  mlxsw: spectrum: Avoid possible NULL pointer dereference
  sh_eth: Do not print an error message for probe deferral
  sh_eth: Use platform device for printing before register_netdev()
  mlxsw: spectrum_router: Fix rif counter freeing routine
  mlxsw: spectrum_dpipe: Fix incorrect entry index
  cxgb4: update latest firmware version supported
  qmi_wwan: add another Lenovo EM74xx device ID
  sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
  udp: make *udp*_queue_rcv_skb() functions static
  bridge: netlink: check vlan_default_pvid range
  net: ethernet: faraday: To support device tree usage.
  net: x25: fix one potential use-after-free issue
  bpf: adjust verifier heuristics
  ipv6: Check ip6_find_1stfragopt() return value properly.
  selftests/bpf: fix broken build due to types.h
  bnxt_en: Check status of firmware DCBX agent before setting DCB_CAP_DCBX_HOST.
  bnxt_en: Call bnxt_dcb_init() after getting firmware DCBX configuration.
  net: fix compile error in skb_orphan_partial()
  ipv6: Prevent overrun when parsing v6 header options
  neighbour: update neigh timestamps iff update is effective
  ...
2017-05-18 11:40:21 -07:00
Linus Torvalds
16d95c4396 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull pid namespace fixes from Eric Biederman:
 "These are two bugs that turn out to have simple fixes that were
  reported during the merge window. Both of these issues have existed
  for a while and it just happens that they both were reported at almost
  the same time"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
  pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
2017-05-18 10:04:42 -07:00
Jonathan Corbet
6312811be2 Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook
Mauro says:

This patch series convert the remaining DocBooks to ReST.

The first version was originally
send as 3 patch series:

   [PATCH 00/36] Convert DocBook documents to ReST
   [PATCH 0/5] Convert more books to ReST
   [PATCH 00/13] Get rid of DocBook

The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.

It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.

I also added some patches here to add PDF output for all
existing ReST books.
2017-05-18 11:03:08 -06:00
Kees Cook
af777cd1b8 doc: ReSTify credentials.txt
This updates the credentials API documentation to ReST markup and moves
it under the security subsection of kernel API documentation.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-05-18 10:30:19 -06:00
Daniel Borkmann
3c2ce60bdd bpf: adjust verifier heuristics
Current limits with regards to processing program paths do not
really reflect today's needs anymore due to programs becoming
more complex and verifier smarter, keeping track of more data
such as const ALU operations, alignment tracking, spilling of
PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for
smarter matching of what LLVM generates.

This also comes with the side-effect that we result in fewer
opportunities to prune search states and thus often need to do
more work to prove safety than in the past due to different
register states and stack layout where we mismatch. Generally,
it's quite hard to determine what caused a sudden increase in
complexity, it could be caused by something as trivial as a
single branch somewhere at the beginning of the program where
LLVM assigned a stack slot that is marked differently throughout
other branches and thus causing a mismatch, where verifier
then needs to prove safety for the whole rest of the program.
Subsequently, programs with even less than half the insn size
limit can get rejected. We noticed that while some programs
load fine under pre 4.11, they get rejected due to hitting
limits on more recent kernels. We saw that in the vast majority
of cases (90+%) pruning failed due to register mismatches. In
case of stack mismatches, majority of cases failed due to
different stack slot types (invalid, spill, misc) rather than
differences in spilled registers.

This patch makes pruning more aggressive by also adding markers
that sit at conditional jumps as well. Currently, we only mark
jump targets for pruning. For example in direct packet access,
these are usually error paths where we bail out. We found that
adding these markers, it can reduce number of processed insns
by up to 30%. Another option is to ignore reg->id in probing
PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning
slightly as well by up to 7% observed complexity reduction as
stand-alone. Meaning, if a previous path with register type
PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then
in the current state a PTR_TO_MAP_VALUE_OR_NULL register for
the same map X must be safe as well. Last but not least the
patch also adds a scheduling point and bumps the current limit
for instructions to be processed to a more adequate value.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 22:55:27 -04:00
Steven Rostedt (VMware)
545a028190 kprobes: Document how optimized kprobes are removed from module unload
Thomas discovered a bug where the kprobe trace tests had a race
condition where the kprobe_optimizer called from a delayed work queue
that does the optimizing and "unoptimizing" of a kprobe, can try to
modify the text after it has been freed by the init code.

The kprobe trace selftest is a special case, and Thomas and myself
investigated to see if there's a chance that this could also be a bug
with module unloading, as the code is not obvious to how it handles
this. After adding lots of printks, I figured it out. Thomas suggested
that this should be commented so that others will not have to go
through this exercise again.

Link: http://lkml.kernel.org/r/20170516145835.3827d3aa@gandalf.local.home

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:55:58 -04:00
Steven Rostedt (VMware)
8a49f3e03c ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub
No need to add ugly #ifdefs in the code. Having a standard stub file is much
prettier.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:53:32 -04:00
Naveen N. Rao
a0e6369e4b ftrace/instances: Clear function triggers when removing instances
If instance directories are deleted while there are registered function
triggers:

  # cd /sys/kernel/debug/tracing/instances
  # mkdir test
  # echo "schedule:enable_event:sched:sched_switch" > test/set_ftrace_filter
  # rmdir test
  Unable to handle kernel paging request for data at address 0x00000008
  Unable to handle kernel paging request for data at address 0x00000008
  Faulting instruction address: 0xc0000000021edde8
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048
  NUMA
  pSeries
  Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm iptable_filter fuse binfmt_misc pseries_rng rng_core vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath virtio_net virtio_blk virtio_pci crc32c_vpmsum virtio_ring virtio
  CPU: 8 PID: 8694 Comm: rmdir Not tainted 4.11.0-nnr+ #113
  task: c0000000bab52800 task.stack: c0000000baba0000
  NIP: c0000000021edde8 LR: c0000000021f0590 CTR: c000000002119620
  REGS: c0000000baba3870 TRAP: 0300   Not tainted  (4.11.0-nnr+)
  MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
    CR: 22002422  XER: 20000000
  CFAR: 00007fffabb725a8 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
  GPR00: c00000000220f750 c0000000baba3af0 c000000003157e00 0000000000000000
  GPR04: 0000000000000040 00000000000000eb 0000000000000040 0000000000000000
  GPR08: 0000000000000000 0000000000000113 0000000000000000 c00000000305db98
  GPR12: c000000002119620 c00000000fd42c00 0000000000000000 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 c0000000bab52e90 0000000000000000
  GPR24: 0000000000000000 00000000000000eb 0000000000000040 c0000000baba3bb0
  GPR28: c00000009cb06eb0 c0000000bab52800 c00000009cb06eb0 c0000000baba3bb0
  NIP [c0000000021edde8] ring_buffer_lock_reserve+0x8/0x4e0
  LR [c0000000021f0590] trace_event_buffer_lock_reserve+0xe0/0x1a0
  Call Trace:
  [c0000000baba3af0] [c0000000021f96c8] trace_event_buffer_commit+0x1b8/0x280 (unreliable)
  [c0000000baba3b60] [c00000000220f750] trace_event_buffer_reserve+0x80/0xd0
  [c0000000baba3b90] [c0000000021196b8] trace_event_raw_event_sched_switch+0x98/0x180
  [c0000000baba3c10] [c0000000029d9980] __schedule+0x6e0/0xab0
  [c0000000baba3ce0] [c000000002122230] do_task_dead+0x70/0xc0
  [c0000000baba3d10] [c0000000020ea9c8] do_exit+0x828/0xd00
  [c0000000baba3dd0] [c0000000020eaf70] do_group_exit+0x60/0x100
  [c0000000baba3e10] [c0000000020eb034] SyS_exit_group+0x24/0x30
  [c0000000baba3e30] [c00000000200bcec] system_call+0x38/0x54
  Instruction dump:
  60000000 60420000 7d244b78 7f63db78 4bffaa09 393efff8 793e0020 39200000
  4bfffecc 60420000 3c4c00f7 3842a020 <81230008> 2f890000 409e02f0 a14d0008
  ---[ end trace b917b8985d0e650b ]---
  Unable to handle kernel paging request for data at address 0x00000008
  Faulting instruction address: 0xc0000000021edde8
  Unable to handle kernel paging request for data at address 0x00000008
  Faulting instruction address: 0xc0000000021edde8
  Faulting instruction address: 0xc0000000021edde8

To address this, let's clear all registered function probes before
deleting the ftrace instance.

Link: http://lkml.kernel.org/r/c5f1ca624043690bd94642bb6bffd3f2fc504035.1494956770.git.naveen.n.rao@linux.vnet.ibm.com

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:52:22 -04:00
Naveen N. Rao
cbab567c3d ftrace: Simplify glob handling in unregister_ftrace_function_probe_func()
Handle a NULL glob properly and simplify the check.

Link: http://lkml.kernel.org/r/5df74d4ffb4721db6d5a22fa08ca031d62ead493.1494956770.git.naveen.n.rao@linux.vnet.ibm.com

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:51:54 -04:00
Thomas Gleixner
30e7d894c1 tracing/kprobes: Enforce kprobes teardown after testing
Enabling the tracer selftest triggers occasionally the warning in
text_poke(), which warns when the to be modified page is not marked
reserved.

The reason is that the tracer selftest installs kprobes on functions marked
__init for testing. These probes are removed after the tests, but that
removal schedules the delayed kprobes_optimizer work, which will do the
actual text poke. If the work is executed after the init text is freed,
then the warning triggers. The bug can be reproduced reliably when the work
delay is increased.

Flush the optimizer work and wait for the optimizing/unoptimizing lists to
become empty before returning from the kprobes tracer selftest. That
ensures that all operations which were queued due to the probes removal
have completed.

Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.home

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 6274de498 ("kprobes: Support delayed unoptimizing")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:50:27 -04:00
Steven Rostedt
b9ef0326c0 tracing: Move postpone selftests to core from early_initcall
I hit the following lockdep splat when booting with ftrace selftests
enabled, as well as CONFIG_PREEMPT and LOCKDEP.

 Testing dynamic ftrace ops #1:
 (1 0 1 0 0)
 (1 1 2 0 0)
 (2 1 3 0 169)
 (2 2 4 0 50066)
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 13 at kernel/rcu/srcutree.c:202 check_init_srcu_struct+0x60/0x70
 Modules linked in:
 CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 4.12.0-rc1-test+ #587
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 task: ffff880119628040 task.stack: ffffc900006a4000
 RIP: 0010:check_init_srcu_struct+0x60/0x70
 RSP: 0000:ffffc900006a7d98 EFLAGS: 00010246
 RAX: 0000000000000246 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff880119628040 RSI: 00000000ffffffff RDI: ffffffff81e5fb40
 RBP: ffffc900006a7e20 R08: 00000023b403c000 R09: 0000000000000001
 R10: ffffc900006a7e40 R11: 0000000000000000 R12: ffffffff81e5fb40
 R13: 0000000000000286 R14: ffff880119628040 R15: ffffc900006a7e98
 FS:  0000000000000000(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffff88011edff000 CR3: 0000000001e0f000 CR4: 00000000001406f0
 Call Trace:
  ? __synchronize_srcu+0x6e/0x140
  ? lock_acquire+0xdc/0x1d0
  ? ktime_get_mono_fast_ns+0x5d/0xb0
  synchronize_srcu+0x6f/0x110
  ? synchronize_srcu+0x6f/0x110
  rcu_tasks_kthread+0x20a/0x540
  kthread+0x114/0x150
  ? __rcu_read_unlock+0x70/0x70
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 Code: f6 83 70 06 00 00 03 49 89 c5 74 0d be 01 00 00 00 48 89 df e8 42 fa ff ff 4c 89 ee 4c 89 e7 e8 b7 42 75 00 5b 41 5c 41 5d 5d c3 <0f> ff eb aa 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
 ---[ end trace 5c3f4206ce50f6ac ]---

What happens is that the selftests include a creating of a dynamically
allocated ftrace_ops, which requires the use of synchronize_rcu_tasks()
which uses srcu, and triggers the above warning.

It appears that synchronize_rcu_tasks() is not set up at early_initcall(),
but it is at core_initcall(). By moving the tests down to that location
works out properly.

Link: http://lkml.kernel.org/r/20170517111435.7388c033@gandalf.local.home

Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-17 21:39:38 -04:00
Waiman Long
33c35aa481 cgroup: Prevent kill_css() from being called more than once
The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
2017-05-17 16:58:32 -04:00
Frederic Weisbecker
411fe24e6b nohz: Fix collision between tick and other hrtimers, again
This restores commit:

  24b91e360e: ("nohz: Fix collision between tick and other hrtimers")

... which got reverted by commit:

  558e8e27e7: ('Revert "nohz: Fix collision between tick and other hrtimers"')

... due to a regression where CPUs spuriously stopped ticking.

The bug happened when a tick fired too early past its expected expiration:
on IRQ exit the tick was scheduled again to the same deadline but skipped
reprogramming because ts->next_tick still kept in cache the deadline.
This has been fixed now with resetting ts->next_tick from the tick
itself. Extra care has also been taken to prevent from obsolete values
throughout CPU hotplug operations.

When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.

In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.

Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.

In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:

      Task in a full dynticks CPU
      ----------------------------

      * hrtimer A is queued 2 seconds ahead
      * the tick is stopped, scheduled 1 second ahead
      * tick fires 1 second later
      * on tick exit, nohz schedules the tick 1 second ahead but sees
        the clockevent device is already programmed to that deadline,
        fooled by hrtimer A, the tick isn't rescheduled.
      * hrtimer A is cancelled before its deadline
      * tick never fires again until an interrupt happens...

In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.

On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.

Reported-and-tested-by: Tim Wright <tim@binbash.co.uk>
Reported-and-tested-by: Pavel Machek <pavel@ucw.cz>
Reported-by: James Hartsock <hartsjc@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1492783255-5051-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-17 08:19:47 +02:00
Frederic Weisbecker
ce6cf9a15d nohz: Add hrtimer sanity check
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-17 08:19:26 +02:00
Thomas Gleixner
2c4569ca26 genirq: Fix chained interrupt data ordering
irq_set_chained_handler_and_data() sets up the chained interrupt and then
stores the handler data.

That's racy against an immediate interrupt which gets handled before the
store of the handler data happened. The handler will dereference a NULL
pointer and crash.

Cure it by storing handler data before installing the chained handler.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2017-05-16 15:03:26 +02:00
Mauro Carvalho Chehab
c0c6e08505 irq: update genericirq book location
This book got converted from DocBook. Update its references to
point to the current location.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:21 -03:00
Mauro Carvalho Chehab
7b4ff1adb5 mutex, futex: adjust kernel-doc markups to generate ReST
There are a few issues on some kernel-doc markups that was
causing troubles with kernel-doc output on ReST format:

./kernel/futex.c:492: WARNING: Inline emphasis start-string without end-string.
./kernel/futex.c:1264: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:1721: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2338: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2426: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2899: WARNING: Block quote ends without a blank line; unexpected unindent.
./kernel/futex.c:2972: WARNING: Block quote ends without a blank line; unexpected unindent.

Fix them.

No functional changes.

Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:43:25 -03:00
Linus Torvalds
a95cfad947 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Track alignment in BPF verifier so that legitimate programs won't be
    rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

 2) Make tail calls work properly in arm64 BPF JIT, from Deniel
    Borkmann.

 3) Make the configuration and semantics Generic XDP make more sense and
    don't allow both generic XDP and a driver specific instance to be
    active at the same time. Also from Daniel.

 4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.

 5) Fix use-after-free in VRF driver, from Gao Feng.

 6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
    qca_spi driver, from Stefan Wahren.

 7) Always run cleanup routines in BPF samples when we get SIGTERM, from
    Andy Gospodarek.

 8) The mdio phy code should bring PHYs out of reset using the shared
    GPIO lines before invoking bus->reset(). From Florian Fainelli.

 9) Some USB descriptor access endian fixes in various drivers from
    Johan Hovold.

10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
    Pressman.

11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.

12) Cure netdev leak in AF_PACKET when using timestamping via control
    messages. From Douglas Caetano dos Santos.

13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
    Lichvar.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  ldmvsw: stop the clean timer at beginning of remove
  ldmvsw: unregistering netdev before disable hardware
  net: netcp: fix check of requested timestamping filter
  ipv6: avoid dad-failures for addresses with NODAD
  qed: Fix uninitialized data in aRFS infrastructure
  mdio: mux: fix device_node_continue.cocci warnings
  net/packet: fix missing net_device reference release
  net/mlx4_core: Use min3 to select number of MSI-X vectors
  macvlan: Fix performance issues with vlan tagged packets
  net: stmmac: use correct pointer when printing normal descriptor ring
  net/mlx5: Use underlay QPN from the root name space
  net/mlx5e: IPoIB, Only support regular RQ for now
  net/mlx5e: Fix setup TC ndo
  net/mlx5e: Fix ethtool pause support and advertise reporting
  net/mlx5e: Use the correct pause values for ethtool advertising
  vmxnet3: ensure that adapter is in proper state during force_close
  sfc: revert changes to NIC revision numbers
  net: ch9200: add missing USB-descriptor endianness conversions
  net: irda: irda-usb: fix firmware name on big-endian hosts
  net: dsa: mv88e6xxx: add default case to switch
  ...
2017-05-15 15:50:49 -07:00
Tejun Heo
a9e7f6544b sched/fair: Fix O(nr_cgroups) in load balance path
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.

This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups.  In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.

This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
[ Added cfs_rq_is_decayed() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 12:07:44 +02:00
Peter Zijlstra
502ce005ab sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
In order to allow leaf_cfs_rq_list to remove entries switch the
bandwidth hotplug code over to the task_groups list.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:35 +02:00
Peter Zijlstra
ae4df9d6c9 sched/topology: Rename sched_group_cpus()
There's a discrepancy in naming between the sched_domain and
sched_group cpumask accessor. Since we're doing changes, fix it.

  $ git grep sched_group_cpus | wc -l
  28
  $ git grep sched_domain_span | wc -l
  38

Suggests changing sched_group_cpus() into sched_group_span():

  for i  in `git grep -l sched_group_cpus`
  do
    sed -ie 's/sched_group_cpus/sched_group_span/g' $i
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:34 +02:00
Peter Zijlstra
e5c14b1fb8 sched/topology: Rename sched_group_mask()
Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.

Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
af218122b1 sched/topology: Simplify sched_group_mask() usage
While writing the comments, it occurred to me that:

  sg_cpus & sg_mask == sg_mask

at least conceptually; the !overlap case sets the all 1s mask. If we
correct that we can simplify things and directly use sg_mask.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
0c0e776a9b sched/topology: Rewrite get_group()
We want to attain:

  sg_cpus() & sg_mask() == sg_mask()

for this to be so we must initialize sg_mask() to sg_cpus() for the
!overlap case (its currently cpumask_setall()).

Since the code makes my head hurt bad, rewrite it into a simpler form,
inspired by the now fixed overlap code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:32 +02:00
Peter Zijlstra
35a566e6e8 sched/topology: Add a few comments
Try and describe what this code is about..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:31 +02:00
Peter Zijlstra
1676330ecf sched/topology: Fix overlapping sched_group_capacity
When building the overlapping groups we need to attach a consistent
sched_group_capacity structure. That is, all 'identical' sched_group's
should have the _same_ sched_group_capacity.

This can (once again) be demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

But we need at least 2 CPUs per node for this to show up, after all,
if there is only one CPU per node, our CPU @i is per definition a
unique CPU that reaches this domain (aka balance-cpu).

Given the above NUMA topo and 2 CPUs per node:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the
'same' but have a different id (0 vs 4).

To fix this, use the group balance CPU to select the SGC. This means
we have to compute the full mask for each CPU and require a second
temporary mask to store the group mask in (it otherwise lives in the
SGC).

The fixed topology looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:30 +02:00
Peter Zijlstra
005f874dd2 sched/topology: Add sched_group_capacity debugging
Add sgc::id to easier spot domain construction issues.

Take the opportunity to slightly rework the group printing, because
adding more "(id: %d)" strings makes the entire thing very hard to
read. Also the individual groups are very hard to separate, so add
explicit visual grouping, which allows replacing all the "(%s: %d)"
format things with shorter "%s=%d" variants.

Then fix up some inconsistencies in surrounding prints for domains.

The end result looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:30 +02:00
Peter Zijlstra
8d5dc5126b sched/topology: Small cleanup
Move the allocation of topology specific cpumasks into the topology
code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:29 +02:00
Peter Zijlstra
73bb059f9b sched/topology: Fix overlapping sched_group_mask
The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:28 +02:00
Peter Zijlstra
af85596c74 sched/topology: Remove FORCE_SD_OVERLAP
Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.

Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:28 +02:00
Lauro Ramos Venancio
c20e1ea4b6 sched/topology: Move comment about asymmetric node setups
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:27 +02:00
Lauro Ramos Venancio
f32d782e31 sched/topology: Optimize build_group_mask()
The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:26 +02:00
Peter Zijlstra
a420b06303 sched/topology: Verify the first group matches the child domain
We want sched_groups to be sibling child domains (or individual CPUs
when there are no child domains). Furthermore, since the first group
of a domain should include the CPU of that domain, the first group of
each domain should match the child domain.

Verify this is indeed so.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:26 +02:00
Peter Zijlstra
b0151c2554 sched/debug: Print the scheduler topology group mask
In order to determine the balance_cpu (for should_we_balance()) we need
the sched_group_mask() for overlapping domains.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:25 +02:00
Peter Zijlstra
91eaed0d61 sched/topology: Simplify build_overlap_sched_groups()
Now that the first group will always be the previous domain of this
@cpu this can be simplified.

In fact, writing the code now removed should've been a big clue I was
doing it wrong :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:24 +02:00
Peter Zijlstra
0372dd2736 sched/topology: Fix building of overlapping sched-groups
When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.

This can be readily demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) CPU1 ends up generating the following nonsensical groups:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 2 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)

Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.

With patch this looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 0 2
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:23 +02:00
Peter Zijlstra
c743f0a5c5 sched/fair, cpumask: Export for_each_cpu_wrap()
More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:23 +02:00
Lauro Ramos Venancio
8c0334697d sched/topology: Refactor function build_overlap_sched_groups()
Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:22 +02:00
Peter Zijlstra
7708d5f04d sched/clock: Print a warning recommending 'tsc=unstable'
With our switch to stable delayed until late_initcall(), the most
likely cause of hitting mark_tsc_unstable() is the watchdog. The
watchdog typically only triggers when creative BIOS'es fiddle with the
TSC to hide SMI latency.

Since the watchdog can only detect TSC fiddling after the fact all TSC
clocks (including userspace GTOD) can already have reported funny
values.

The only way to fully avoid this, is manually marking the TSC unstable
at boot. Suggest people do this on their broken systems.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:21 +02:00
Peter Zijlstra
2e44b7ddf8 sched/clock: Use late_initcall() instead of sched_init_smp()
Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
after sched_init_smp(). Luckily it appears both acpi_processor and
intel_idle (which has a similar check) are mandatory built-in.

This means we can delay switching to stable until after these drivers
have ran (if they were modules, this would be impossible).

Delay the stable switch to late_initcall() to allow these drivers to
mark TSC unstable and avoid difficult stable->unstable transitions.

Reported-by: Lofstedt, Marta <marta.lofstedt@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:21 +02:00
Peter Zijlstra
f9fccdb9ef cpuidle: Fix idle time tracking
Ville reported that on his Core2, which has TSC stop in idle, we would
always report very short idle durations. He tracked this down to
commit:

  e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")

which replaces ktime_get() with local_clock().

Add a sched_clock_idle_wakeup_event() call, which will re-sync the
clock with ktime_get_ns() when TSC is unstable and no-op otherwise.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:20 +02:00
Peter Zijlstra
3067a33d5f sched/clock: Remove watchdog touching
Commit:

  2bacec8c31 ("sched: touch softlockup watchdog after idling")

introduced the touch_softlockup_watchdog_sched() call without
justification and I feel sched_clock management is not the right
place, it should only be concerned with producing semi coherent time.

If this causes watchdog thingies, we can find a better place.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:19 +02:00
Peter Zijlstra
ac1e843f09 sched/clock: Remove unused argument to sched_clock_idle_wakeup_event()
The argument to sched_clock_idle_wakeup_event() has not been used in a
long time. Remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:18 +02:00
Peter Zijlstra
b421b22b00 x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points
Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-CPU state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-CPU state is also borked.

So hook into the clocksource watchdog and call a method after we've
found it to still be stable.

There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.

And since the watchdog runs less often than the tick, this is also an
optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:18 +02:00
Peter Zijlstra
cf15ca8ded sched/clock: Initialize all per-CPU state before switching (back) to unstable
In preparation for not keeping the sched_clock_tick() active for
stable TSC, we need to explicitly initialize all per-CPU state
before switching back to unstable.

Note: this patch looses the __gtod_offset calculation; it will be
restored in the next one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:17 +02:00
Vincent Guittot
625ed2bf04 sched/cfs: Make util/load_avg more stable
In the current implementation of load/util_avg, we assume that the
ongoing time segment has fully elapsed, and util/load_sum is divided
by LOAD_AVG_MAX, even if part of the time segment still remains to
run. As a consequence, this remaining part is considered as idle time
and generates unexpected variations of util_avg of a busy CPU in the
range [1002..1024[ whereas util_avg should stay at 1023.

In order to keep the metric stable, we should not consider the ongoing
time segment when computing load/util_avg but only the segments that
have already fully elapsed. But to not consider the current time
segment adds unwanted latency in the load/util_avg responsivness
especially when the time is scaled instead of the contribution.

Instead of waiting for the current time segment to have fully elapsed
before accounting it in load/util_avg, we can already account the
elapsed part but change the range used to compute load/util_avg
accordingly.

At the very beginning of a new time segment, the past segments have
been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
the current time segment, the max value becomes:

  LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)

In fact, the max value is:

  LOAD_AVG_MAX*y + sa->period_contrib

at any time in the time segment.

Taking advantage of the fact that:

  LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024

the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].

As the elapsed part is already accounted in load/util_sum, we update
the max value according to the current position in the time segment
instead of removing its contribution.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:13 +02:00
Steven Rostedt (VMware)
8663effb24 sched/core: Call __schedule() from do_idle() without enabling preemption
I finally got around to creating trampolines for dynamically allocated
ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
function hook callbacks, like perf, that allocate the ftrace_ops
descriptor via kmalloc() and friends, ftrace was not able to optimize
the functions being traced to use a trampoline because they would also
need to be allocated dynamically. The problem is that they cannot be
freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
was preempted on the trampoline. That was before Paul McKenney
implemented synchronize_rcu_tasks() that would make sure all tasks
(except idle) have scheduled out or have entered user space.

While testing this, I triggered this bug:

 BUG: unable to handle kernel paging request at ffffffffa0230077
 ...
 RIP: 0010:0xffffffffa0230077
 ...
 Call Trace:
  schedule+0x5/0xe0
  schedule_preempt_disabled+0x18/0x30
  do_idle+0x172/0x220

What happened was that the idle task was preempted on the trampoline.
As synchronize_rcu_tasks() ignores the idle thread, there's nothing
that lets ftrace know that the idle task was preempted on a trampoline.

The idle task shouldn't need to ever enable preemption. The idle task
is simply a loop that calls schedule or places the cpu into idle mode.
In fact, having preemption enabled is inefficient, because it can
happen when idle is just about to call schedule anyway, which would
cause schedule to be called twice. Once for when the interrupt came in
and was returning back to normal context, and then again in the normal
path that the idle loop is running in, which would be pointless, as it
had already scheduled.

The only reason schedule_preempt_disable() enables preemption is to be
able to call sched_submit_work(), which requires preemption enabled. As
this is a nop when the task is in the RUNNING state, and idle is always
in the running state, there's no reason that idle needs to enable
preemption. But that means it cannot use schedule_preempt_disable() as
other callers of that function require calling sched_submit_work().

Adding a new function local to kernel/sched/ that allows idle to call
the scheduler without enabling preemption, fixes the
synchronize_rcu_tasks() issue, as well as removes the pointless spurious
schedule calls caused by interrupts happening in the brief window where
preemption is enabled just before it calls schedule.

Reviewed: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:09:12 +02:00
Pushkar Jambhlekar
0bae5fd333 PM / hibernate: Declare variables as static
Fixing sparse warnings: 'symbol not declared. Should it be static?'

Signed-off-by: Pushkar Jambhlekar <pushkar.iit@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-14 13:33:24 +02:00
Kirill Tkhai
3fd3722621 pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
Imagine we have a pid namespace and a task from its parent's pid_ns,
which made setns() to the pid namespace. The task is doing fork(),
while the pid namespace's child reaper is dying. We have the race
between them:

Task from parent pid_ns             Child reaper
copy_process()                      ..
  alloc_pid()                       ..
  ..                                zap_pid_ns_processes()
  ..                                  disable_pid_allocation()
  ..                                  read_lock(&tasklist_lock)
  ..                                  iterate over pids in pid_ns
  ..                                    kill tasks linked to pids
  ..                                  read_unlock(&tasklist_lock)
  write_lock_irq(&tasklist_lock);   ..
  attach_pid(p, PIDTYPE_PID);       ..
  ..                                ..

So, just created task p won't receive SIGKILL signal,
and the pid namespace will be in contradictory state.
Only manual kill will help there, but does the userspace
care about this? I suppose, the most users just inject
a task into a pid namespace and wait a SIGCHLD from it.

The patch fixes the problem. It simply checks for
(pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
We do it under the tasklist_lock, and can't skip
PIDNS_HASH_ADDING as noted by Oleg:

"zap_pid_ns_processes() does disable_pid_allocation()
and then takes tasklist_lock to kill the whole namespace.
Given that copy_process() checks PIDNS_HASH_ADDING
under write_lock(tasklist) they can't race;
if copy_process() takes this lock first, the new child will
be killed, otherwise copy_process() can't miss
the change in ->nr_hashed."

If allocation is disabled, we just return -ENOMEM
like it's made for such cases in alloc_pid().

v2: Do not move disable_pid_allocation(), do not
introduce a new variable in copy_process() and simplify
the patch as suggested by Oleg Nesterov.
Account the problem with double irq enabling
found by Eric W. Biederman.

Fixes: c876ad7682 ("pidns: Stop pid allocation when init dies")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Andrei Vagin <avagin@openvz.org>
CC: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Serge Hallyn <serge@hallyn.com>
Cc: stable@vger.kernel.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-05-13 17:26:02 -05:00
Eric W. Biederman
b9a985db98 pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
The code can potentially sleep for an indefinite amount of time in
zap_pid_ns_processes triggering the hung task timeout, and increasing
the system average.  This is undesirable.  Sleep with a task state of
TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
undesirable side effects.

Apparently under heavy load this has been allowing Chrome to trigger
the hung time task timeout error and cause ChromeOS to reboot.

Reported-by: Vovo Yang <vovoy@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6347e90091 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-13 17:26:01 -05:00
Martin Liska
0538421343 gcov: support GCC 7.1
Starting from GCC 7.1, __gcov_exit is a new symbol expected to be
implemented in a profiling runtime.

[akpm@linux-foundation.org: coding-style fixes]
[mliska@suse.cz: v2]
  Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz
Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.cz
Signed-off-by: Martin Liska <mliska@suse.cz>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Deepa Dinamani
572e0ca9b9 time: delete current_fs_time()
All uses of the current_fs_time() function have been replaced by other
time interfaces.

And, its use cases can be fulfilled by current_time() or ktime_get_*
variants.

Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Linus Torvalds
e0c4a5fc75 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates/fixes from Ingo Molnar:
 "Mostly tooling updates, but also two kernel fixes: a call chain
  handling robustness fix and an x86 PMU driver event definition fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/callchain: Force USER_DS when invoking perf_callchain_user()
  tools build: Fixup sched_getcpu feature test
  perf tests kmod-path: Don't fail if compressed modules aren't supported
  perf annotate: Fix AArch64 comment char
  perf tools: Fix spelling mistakes
  perf/x86: Fix Broadwell-EP DRAM RAPL events
  perf config: Refactor a duplicated code for obtaining config file name
  perf symbols: Allow user probes on versioned symbols
  perf symbols: Accept symbols starting at address 0
  tools lib string: Adopt prefixcmp() from perf and subcmd
  perf units: Move parse_tag_value() to units.[ch]
  perf ui gtk: Move gtk .so name to the only place where it is used
  perf tools: Move HAS_BOOL define to where perl headers are used
  perf memswap: Split the byteswap memory range wrappers from util.[ch]
  perf tools: Move event prototypes from util.h to event.h
  perf buildid: Move prototypes from util.h to build-id.h
2017-05-12 10:45:36 -07:00
Linus Torvalds
1b84fc1503 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stackprotector fixlet from Ingo Molnar:
 "A single fix/enhancement to increase stackprotector canary randomness
  on 64-bit kernels with very little cost"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
2017-05-12 10:41:45 -07:00
David S. Miller
6832a333ed bpf: Handle multiple variable additions into packet pointers in verifier.
We must accumulate into reg->aux_off rather than use a plain assignment.

Add a test for this situation to test_align.

Reported-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 19:48:58 -07:00
David S. Miller
e07b98d9bf bpf: Add strict alignment flag for BPF_PROG_LOAD.
Add a new field, "prog_flags", and an initial flag value
BPF_F_STRICT_ALIGNMENT.

When set, the verifier will enforce strict pointer alignment
regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.

The verifier, in this mode, will also use a fixed value of "2" in
place of NET_IP_ALIGN.

This facilitates test cases that will exercise and validate this part
of the verifier even when run on architectures where alignment doesn't
matter.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-11 14:19:00 -04:00
David S. Miller
c5fc9692d1 bpf: Do per-instruction state dumping in verifier when log_level > 1.
If log_level > 1, do a state dump every instruction and emit it in
a more compact way (without a leading newline).

This will facilitate more sophisticated test cases which inspect the
verifier log for register state.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-11 14:19:00 -04:00
David S. Miller
d117441674 bpf: Track alignment of register values in the verifier.
Currently if we add only constant values to pointers we can fully
validate the alignment, and properly check if we need to reject the
program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

However, once an unknown value is introduced we only allow byte sized
memory accesses which is too restrictive.

Add logic to track the known minimum alignment of register values,
and propagate this state into registers containing pointers.

The most common paradigm that makes use of this new logic is computing
the transport header using the IP header length field.  For example:

	struct ethhdr *ep = skb->data;
	struct iphdr *iph = (struct iphdr *) (ep + 1);
	struct tcphdr *th;
 ...
	n = iph->ihl;
	th = ((void *)iph + (n * 4));
	port = th->dest;

The existing code will reject the load of th->dest because it cannot
validate that the alignment is at least 2 once "n * 4" is added the
the packet pointer.

In the new code, the register holding "n * 4" will have a reg->min_align
value of 4, because any value multiplied by 4 will be at least 4 byte
aligned.  (actually, the eBPF code emitted by the compiler in this case
is most likely to use a shift left by 2, but the end result is identical)

At the critical addition:

	th = ((void *)iph + (n * 4));

The register holding 'th' will start with reg->off value of 14.  The
pointer addition will transform that reg into something that looks like:

	reg->aux_off = 14
	reg->aux_off_align = 4

Next, the verifier will look at the th->dest load, and it will see
a load offset of 2, and first check:

	if (reg->aux_off_align % size)

which will pass because aux_off_align is 4.  reg_off will be computed:

	reg_off = reg->off;
 ...
		reg_off += reg->aux_off;

plus we have off==2, and it will thus check:

	if ((NET_IP_ALIGN + reg_off + off) % size != 0)

which evaluates to:

	if ((NET_IP_ALIGN + 14 + 2) % size != 0)

On strict alignment architectures, NET_IP_ALIGN is 2, thus:

	if ((2 + 14 + 2) % size != 0)

which passes.

These pointer transformations and checks work regardless of whether
the constant offset or the variable with known alignment is added
first to the pointer register.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-11 14:19:00 -04:00
Linus Torvalds
29250d301b This is a trivial patch that changes a check for a cpumask from a NULL
pointer to using cpumask_available(), which will do the check. This is
 because cpumasks when not allocated are always set, and clang complains
 about it.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZEcUIFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 zygH/051hNj3aZlSZCD1pXwPkHgeVBn7lSB9k6hcJ5J/OknL/hNXws3Dv4Lb7Dzj
 cZhg62LTwwS6PVCJtOHk+PE/c5FIdY9o1mXJpAst6wbl9Sp1lzPJbFum45UadvWn
 UyU3RP0ncSgfojyrwIu6XyND7/NatdYk9irTMWL9+cDuy9xGvJgRX1sf7tXmxj4C
 AbZzQorDw7XDczDbvFM1XyPU3ApGUDqQ7VhCEBP6ivE+5Ceoo9xi/z7yfKyjLeb+
 H7+/eA8ztaMLgTzLWwkFKdP/knqwPmAb+MHTR0DoLHcVe7fbbxFS7x+cfR8mfIA9
 8tA5SUxc7bymRvDAcN2dMrtL7f8=
 =3hKI
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "This is a trivial patch that changes a check for a cpumask from a NULL
  pointer to using cpumask_available(), which will do the check. This is
  because cpumasks when not allocated are always set, and clang
  complains about it"

* tag 'trace-v4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Use cpumask_available() to check if cpumask variable may be used
2017-05-10 11:26:25 -07:00
Linus Torvalds
de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Linus Torvalds
2e4ab937ec More power management updates for v4.12-rc1
- Add Intel Gemini Lake CPU IDs to the intel_idle and intel_rapl
    drivers (David Box).
 
  - Add a NULL pointer check to the cpuidle core to prevent it from
    crashing on platforms with incomplete cpuidle configuration (Fei
    Li).
 
  - Fix DT-related documentation in the generic power domains (genpd)
    framework and add a MAINTAINERS entry for DT-related material in
    genpd (Viresh Kumar).
 
  - Update the system suspend/resume infrastructure to improve the
    handling of aborts of suspend transitions in progress in the
    wakeup framework and rework the suspend-to-idle core loop to make
    it possible to filter out spurious wakeup events (specifically the
    ones coming from ACPI) without resuming all the way up to user
    space every time (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZEkVZAAoJEILEb/54YlRxypUP/RFSaiK7j/l9oBWfzzZYVqli
 V5WtbaG04/2thHT7kycaxQLxMknT6MrhlHsGw9hM3VtLHHJBoBCYNybl8rTJlWdp
 4eTe5wzYR+depKJcqo2DC4kiKDkghehG6QXG3dr40neV+hc+C54sssYddUXeK3Wm
 nkAGqMEc/xjkul+DzmTy0KQd60cp9vWKedPkj1knrUpqkHa7fNj7YH3nX/OIXn+X
 YMpj6hro5O2Zi3l2bz0M7x0aTqAynAOC+7vXTm+S4GDfTX/aPVpYonEt82N/NSGP
 0esiLQDdX/52XAMwuIxyE092R4zatGMYTPLkG4SvZnkpdcnhUw5uCCJEUtBJYaKQ
 5h9KaJ244XW1EPCrw/dzOnt4ysC7HXJBcqYSiysPlF+GB2vO8WxcOk0Duvzlyifp
 9TPNTo+8h/uX5dCPDUfd4Usv7+yvDxhLJ635XAOsWlhhg8rm3RFYCUaG+snE7/Q0
 GJVyNvywvxHu1CHC+bNeQ9OcMwzeiqn5CgBx2g2UjLlt9H8GS/8jsgQXnug1XPRC
 wBVCYCQm9+JoKvKoCulov/KGAKQEXkJEDNZ0Ulc0vI0x1CWfEI1E07yHLEE8k+Ej
 hcheMCFXlBgiIX6AeMK1asHNTcw4L/xo4gnrS9MS7DEOwZRqifI4CwLfU7iqcFdF
 JBEIg6CNkc90Cszj6Xtx
 =K1Q+
 -----END PGP SIGNATURE-----

Merge tag 'pm-extra-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These add new CPU IDs to a couple of drivers, fix a possible NULL
  pointer dereference in the cpuidle core, update DT-related things in
  the generic power domains framework and finally update the
  suspend/resume infrastructure to improve the handling of wakeups from
  suspend-to-idle.

  Specifics:

   - Add Intel Gemini Lake CPU IDs to the intel_idle and intel_rapl
     drivers (David Box).

   - Add a NULL pointer check to the cpuidle core to prevent it from
     crashing on platforms with incomplete cpuidle configuration (Fei
     Li).

   - Fix DT-related documentation in the generic power domains (genpd)
     framework and add a MAINTAINERS entry for DT-related material in
     genpd (Viresh Kumar).

   - Update the system suspend/resume infrastructure to improve the
     handling of aborts of suspend transitions in progress in the wakeup
     framework and rework the suspend-to-idle core loop to make it
     possible to filter out spurious wakeup events (specifically the
     ones coming from ACPI) without resuming all the way up to user
     space every time (Rafael Wysocki)"

* tag 'pm-extra-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle
  PM / wakeup: Integrate mechanism to abort transitions in progress
  x86/intel_idle: add Gemini Lake support
  cpuidle: check dev before usage in cpuidle_use_deepest_state()
  powercap: intel_rapl: Add support for Gemini Lake
  PM / Domains: Add DT file to MAINTAINERS
  PM / Domains: Fix DT example
2017-05-10 09:12:30 -07:00
Will Deacon
88b0193d94 perf/callchain: Force USER_DS when invoking perf_callchain_user()
Perf can generate and record a user callchain in response to a synchronous
request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS),
then we can end up walking the user stack (and dereferencing/saving whatever we
find there) without the protections usually afforded by checks such as
access_ok.

Rather than play whack-a-mole with each architecture's stack unwinding
implementation, fix the root of the problem by ensuring that we force USER_DS
when invoking perf_callchain_user from the perf core.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-10 07:54:00 +02:00
Linus Torvalds
50fb55d88c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

 2) cdc_ncm doesn't actually fully zero out the padding area is
    allocates on TX, from Jim Baxter.

 3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

 4) If we randomize TCP timestamps, we have to do it everywhere
    including SYN cookies. From Eric Dumazet.

 5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

 6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from
    Dan Carpenter.

 7) Add missing memory allocation return value check to DSA loop driver,
    from Christophe Jaillet.

 8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
    Kalluru.

 9) Don't inherit MC list from parent inet connection sockets, another
    syzkaller spotted gem. Fix from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  dccp/tcp: do not inherit mc_list from parent
  qede: Split PF/VF ndos.
  qed: Correct doorbell configuration for !4Kb pages
  qed: Tell QM the number of tasks
  qed: Fix VF removal sequence
  qede: Fix XDP memory leak on unload
  net/mlx4_core: Reduce harmless SRIOV error message to debug level
  net/mlx4_en: Avoid adding steering rules with invalid ring
  net/mlx4_en: Change the error print to debug print
  drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison
  DECnet: Use container_of() for embedded struct
  Revert "ipv4: restore rt->fi for reference counting"
  net: mdio-mux: bcm-iproc: call mdiobus_free() in error path
  net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
  ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
  net: cdc_ncm: Fix TX zero padding
  stmmac: pci: split out common_default_data() helper
  stmmac: pci: RX queue routing configuration
  stmmac: pci: TX and RX queue priority configuration
  stmmac: pci: set default number of rx and tx queues
  ...
2017-05-09 15:42:31 -07:00
Rafael J. Wysocki
80449d89d4 Merge branches 'pm-domains', 'pm-cpuidle', 'pm-sleep' and 'powercap'
* pm-domains:
  PM / Domains: Add DT file to MAINTAINERS
  PM / Domains: Fix DT example

* pm-cpuidle:
  x86/intel_idle: add Gemini Lake support
  cpuidle: check dev before usage in cpuidle_use_deepest_state()

* pm-sleep:
  ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle
  PM / wakeup: Integrate mechanism to abort transitions in progress

* powercap:
  powercap: intel_rapl: Add support for Gemini Lake
2017-05-09 23:21:46 +02:00