Commit Graph

662301 Commits

Author SHA1 Message Date
Paul E. McKenney
a278d47189 rcu: Fix dyntick-idle tracing
The tracing subsystem started using rcu_irq_entry() and rcu_irq_exit()
(with my blessing) to allow the current _rcuidle alternative tracepoint
name to be dispensed with while still maintaining good performance.
Unfortunately, this causes RCU's dyntick-idle entry code's tracing to
appear to RCU like an interrupt that occurs where RCU is not designed
to handle interrupts.

This commit fixes this problem by moving the zeroing of ->dynticks_nesting
after the offending trace_rcu_dyntick() statement, which narrows the
window of vulnerability to a pair of adjacent statements that are now
marked with comments to that effect.

Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home
Link: http://lkml.kernel.org/r/20170405193928.GM1600@linux.vnet.ibm.com

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-10 15:21:57 -04:00
Steven Rostedt (VMware)
8aaf1ee70e tracing: Rename trace_active to disable_stack_tracer and inline its modification
In order to eliminate a function call, make "trace_active" into
"disable_stack_tracer" and convert stack_tracer_disable() and friends into
static inline functions.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-10 15:21:47 -04:00
Steven Rostedt (VMware)
5367278cb7 tracing: Add stack_tracer_disable/enable() functions
There are certain parts of the kernel that cannot let stack tracing
proceed (namely in RCU), because the stack tracer uses RCU, and parts of RCU
internals cannot handle having RCU read side locks taken.

Add stack_tracer_disable() and stack_tracer_enable() functions to let RCU
stop stack tracing on the current CPU when it is in those critical sections.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-10 14:34:10 -04:00
Steven Rostedt (VMware)
252babcd52 tracing: Replace the per_cpu() with __this_cpu*() in trace_stack.c
The updates to the trace_active per cpu variable can be updated with the
__this_cpu_*() functions as it only gets updated on the CPU that the variable
is on.

Thanks to Paul McKenney for suggesting __this_cpu_* instead of this_cpu_*.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-10 14:33:54 -04:00
Steven Rostedt (VMware)
0598e4f08e ftrace: Add use of synchronize_rcu_tasks() with dynamic trampolines
The function tracer needs to be more careful than other subsystems when it
comes to freeing data. Especially if that data is actually executable code.
When a single function is traced, a trampoline can be dynamically allocated
which is called to jump to the function trace callback. When the callback is
no longer needed, the dynamic allocated trampoline needs to be freed. This
is where the issues arise. The dynamically allocated trampoline must not be
used again. As function tracing can trace all subsystems, including
subsystems that are used to serialize aspects of freeing (namely RCU), it
must take extra care when doing the freeing.

Before synchronize_rcu_tasks() was around, there was no way for the function
tracer to know that nothing was using the dynamically allocated trampoline
when CONFIG_PREEMPT was enabled. That's because a task could be indefinitely
preempted while sitting on the trampoline. Now with synchronize_rcu_tasks(),
it will wait till all tasks have either voluntarily scheduled (not on the
trampoline) or goes into userspace (not on the trampoline). Then it is safe
to free the trampoline even with CONFIG_PREEMPT set.

Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-07 09:41:51 -04:00
Alban Crequy
696ced4fb1 tracing/kprobes: expose maxactive for kretprobe in kprobe_events
When a kretprobe is installed on a kernel function, there is a maximum
limit of how many calls in parallel it can catch (aka "maxactive"). A
kernel module could call register_kretprobe() and initialize maxactive
(see example in samples/kprobes/kretprobe_example.c).

But that is not exposed to userspace and it is currently not possible to
choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events

The default maxactive can be as low as 1 on single-core with a
non-preemptive kernel. This is too low and we need to increase it not
only for recursive functions, but for functions that sleep or resched.

This patch updates the format of the command that can be written to
kprobe_events so that maxactive can be optionally specified.

I need this for a bpf program attached to the kretprobe of
inet_csk_accept, which can sleep for a long time.

This patch includes a basic selftest:

> # ./ftracetest -v  test.d/kprobe/
> === Ftrace unit tests ===
> [1] Kprobe dynamic event - adding and removing	[PASS]
> [2] Kprobe dynamic event - busy event check	[PASS]
> [3] Kprobe dynamic event with arguments	[PASS]
> [4] Kprobes event arguments with types	[PASS]
> [5] Kprobe dynamic event with function tracer	[PASS]
> [6] Kretprobe dynamic event with arguments	[PASS]
> [7] Kretprobe dynamic event with maxactive	[PASS]
>
> # of passed:  7
> # of failed:  0
> # of unresolved:  0
> # of untested:  0
> # of unsupported:  0
> # of xfailed:  0
> # of undefined(test bug):  0

BugLink: https://github.com/iovisor/bcc/issues/1072
Link: http://lkml.kernel.org/r/1491215782-15490-1-git-send-email-alban@kinvolk.io

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-04 10:32:03 -04:00
Steven Rostedt (VMware)
b80f0f6c9e ftrace: Have init/main.c call ftrace directly to free init memory
Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.

Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-03 14:04:00 -04:00
Steven Rostedt (VMware)
5bd84629a7 ftrace: Create separate t_func_next() to simplify the function / hash logic
I noticed that if I use dd to read the set_ftrace_filter file that the first
hash command is repeated.

 # cd /sys/kernel/debug/tracing
 # echo schedule > set_ftrace_filter
 # echo do_IRQ >> set_ftrace_filter
 # echo schedule:traceoff >> set_ftrace_filter
 # echo do_IRQ:traceoff >> set_ftrace_filter

 # cat set_ftrace_filter
 schedule
 do_IRQ
 schedule:traceoff:unlimited
 do_IRQ:traceoff:unlimited

 # dd if=set_ftrace_filter bs=1
 schedule
 do_IRQ
 schedule:traceoff:unlimited
 schedule:traceoff:unlimited
 do_IRQ:traceoff:unlimited
 98+0 records in
 98+0 records out
 98 bytes copied, 0.00265011 s, 37.0 kB/s

This is due to the way t_start() calls t_next() as well as the seq_file
calls t_next() and the state is slightly different between the two. Namely,
t_start() will call t_next() with a local "pos" variable.

By separating out the function listing from t_next() into its own function,
we can have better control of outputting the functions and the hash of
triggers. This simplifies the code.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:45 -04:00
Steven Rostedt (VMware)
43ff926a0c ftrace: Update func_pos in t_start() when all functions are enabled
If all functions are enabled, there's a comment displayed in the file to
denote that:

  # cd /sys/kernel/debug/tracing
  # cat set_ftrace_filter
 #### all functions enabled ####

If a function trigger is set, those are displayed as well:

  # echo schedule:traceoff >> /debug/tracing/set_ftrace_filter
  # cat set_ftrace_filter
 #### all functions enabled ####
 schedule:traceoff:unlimited

But if you read that file with dd, the output can change:

  # dd if=/debug/tracing/set_ftrace_filter bs=1
 #### all functions enabled ####
 32+0 records in
 32+0 records out
 32 bytes copied, 7.0237e-05 s, 456 kB/s

This is because the "pos" variable is updated for the comment, but func_pos
is not. "func_pos" is used by the triggers (or hashes) to know how many
functions were printed and it bases its index from the pos - func_pos.
func_pos should be 1 to count for the comment printed. But since it is not,
t_hash_start() thinks that one trigger was already printed.

The cat gets to t_hash_start() via t_next() and not t_start() which updates
both pos and func_pos.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:37 -04:00
Steven Rostedt (VMware)
2d71d98900 ftrace: Return NULL at end of t_start() instead of calling t_hash_start()
The loop in t_start() of calling t_next() will call t_hash_start() if the
pos is beyond the functions and enters the hash items. There's no reason to
check if p is NULL and call t_hash_start(), as that would be redundant.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:37 -04:00
Steven Rostedt (VMware)
c20489dad1 ftrace: Assign iter->hash to filter or notrace hashes on seq read
Instead of testing if the hash to use is the filter_hash or the notrace_hash
at each iteration, do the test at open, and set the iter->hash to point to
the corresponding filter or notrace hash. Then use that directly instead of
testing which hash needs to be used each iteration.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:36 -04:00
Steven Rostedt (VMware)
c1bc5919f6 ftrace: Clean up __seq_open_private() return check
The return status check of __seq_open_private() is rather strange:

	iter = __seq_open_private();
	if (iter) {
		/* do stuff */
	}

	return iter ? 0 : -ENOMEM;

It makes much more sense to do the return of failure right away:

	iter = __seq_open_private();
	if (!iter)
		return -ENOMEM;

	/* do stuff */

	return 0;

This clean up will make updates to this code a bit nicer.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:35 -04:00
Steven Rostedt (VMware)
2b87965a1e ftrace/x86: Do no run CPU sync when there is only one CPU online
Moving enabling of function tracing to early boot, even before scheduling is
enabled, means that it is not safe to enable interrupts. When function
tracing was enabled at boot up, it use to happen after scheduling and the
other CPUs were brought up. That required running a sync across all CPUs
when modifying the function hook locations in the code. To do the
synchronization, interrupts had to be enabled. Now function tracing can be
started before the other CPUs are brought up, and enabling interrupts in
that case is dangerous. As only tho boot CPU is active, there is no reason
to run the synchronization. If the online CPU count is one, do not bother
doing the synchronization. This removes the need to enable interrupts.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-28 12:35:16 -04:00
Steven Rostedt (VMware)
af0009fc16 tracing: Move trace_handle_return() out of line
Currently trace_handle_return() looks like this:

 static inline enum print_line_t trace_handle_return(struct trace_seq *s)
 {
        return trace_seq_has_overflowed(s) ?
                TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
 }

Where trace_seq_overflowed(s) is:

 static inline bool trace_seq_has_overflowed(struct trace_seq *s)
 {
	return s->full || seq_buf_has_overflowed(&s->seq);
 }

And seq_buf_has_overflowed(&s->seq) is:

 static inline bool
 seq_buf_has_overflowed(struct seq_buf *s)
 {
	return s->len > s->size;
 }

Making trace_handle_return() into:

 return (s->full || (s->seq->len > s->seq->size)) ?
           TRACE_TYPE_PARTIAL_LINE :
           TRACE_TYPE_HANDLED;

One would think this is not an issue to keep as an inline. But because this
is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
the system. Taking a look at a single tracepoint x86_irq_vector (was the
first one I randomly chosen). As trace_handle_return is used in the
TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
trace_raw_output_x86_irq_vector and do a diff:

- is the original
+ is the out-of-line code

I removed identical lines that were different just due to different
addresses.

--- /tmp/irq-vec-orig	2017-03-16 09:12:48.569384851 -0400
+++ /tmp/irq-vec-ool	2017-03-16 09:13:39.378153385 -0400
@@ -6,27 +6,23 @@
        53                      push   %rbx
        48 89 fb                mov    %rdi,%rbx
        4c 8b a7 c0 20 00 00    mov    0x20c0(%rdi),%r12
        e8 f7 72 13 00          callq  ffffffff81155c80 <trace_raw_output_prep>
        83 f8 01                cmp    $0x1,%eax
        74 05                   je     ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
        5b                      pop    %rbx
        41 5c                   pop    %r12
        5d                      pop    %rbp
        c3                      retq
        41 8b 54 24 08          mov    0x8(%r12),%edx
-       48 8d bb 98 10 00 00    lea    0x1098(%rbx),%rdi
+       48 81 c3 98 10 00 00    add    $0x1098,%rbx
-       48 c7 c6 7b 8a a0 81    mov    $0xffffffff81a08a7b,%rsi
+       48 c7 c6 ab 8a a0 81    mov    $0xffffffff81a08aab,%rsi
-       e8 c5 85 13 00          callq  ffffffff81156f70 <trace_seq_printf>

 === here's the start of the main difference ===

+       48 89 df                mov    %rbx,%rdi
+       e8 62 7e 13 00          callq  ffffffff81156810 <trace_seq_printf>
-       8b 93 b8 20 00 00       mov    0x20b8(%rbx),%edx
-       31 c0                   xor    %eax,%eax
-       85 d2                   test   %edx,%edx
-       75 11                   jne    ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
-       48 8b 83 a8 20 00 00    mov    0x20a8(%rbx),%rax
-       48 39 83 a0 20 00 00    cmp    %rax,0x20a0(%rbx)
-       0f 93 c0                setae  %al
+       48 89 df                mov    %rbx,%rdi
+       e8 4a c5 12 00          callq  ffffffff8114af00 <trace_handle_return>
        5b                      pop    %rbx
-       0f b6 c0                movzbl %al,%eax

 === end ===

        41 5c                   pop    %r12
        5d                      pop    %rbp
        c3                      retq

If you notice, the original has 22 bytes of text more than the out of line
version. As this is for every TRACE_EVENT() defined in the system, this can
become quite large.

   text	   data	    bss	    dec	    hex	filename
8690305	5450490	1298432	15439227	 eb957b	vmlinux-orig
8681725	5450490	1298432	15430647	 eb73f7	vmlinux-handle

This change has a total of 8580 bytes in savings.

 $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
324

That's 324 tracepoints. But this does not include modules (which contain
many more tracepoints). For an allyesconfig build:

 $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
1401

That's 1401 tracepoints giving us:

   text    data     bss     dec     hex filename
137920629       140221067       53264384        331406080       13c0db00 vmlinux-allyes-orig
137827709       140221067       53264384        331313160       13bf7008 vmlinux-allyes-handle

92920 bytes in savings!!!

Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:50 -04:00
Steven Rostedt (VMware)
42c269c88d ftrace: Allow for function tracing to record init functions on boot up
Adding a hook into free_reserve_area() that informs ftrace that boot up init
text is being free, lets ftrace safely remove those init functions from its
records, which keeps ftrace from trying to modify text that no longer
exists.

Note, this still does not allow for tracing .init text of modules, as
modules require different work for freeing its init code.

Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com

Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Requested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:49 -04:00
Steven Rostedt (VMware)
dbeafd0d61 ftrace: Have function tracing start in early boot up
Register the function tracer right after the tracing buffers are initialized
in early boot up. This will allow function tracing to begin early if it is
enabled via the kernel command line.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:48 -04:00
Steven Rostedt (VMware)
9afecfbb95 tracing: Postpone tracer start-up tests till the system is more robust
As tracing can now be enabled very early in boot up, even before some
critical system services (like scheduling), do not run the tracer selftests
until after early_initcall() is performed. If a tracer is registered before
such time, it is saved off in a list and the test is run when the system is
able to handle more diverse functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:46 -04:00
Steven Rostedt (VMware)
f631718de3 ftrace: Move ftrace_init() to right after memory initialization
Initialize the ftrace records immediately after memory initialization, as
that is all that is required for the records to be created. This will allow
for future work to get function tracing started earlier in the boot process.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 13:08:44 -04:00
Steven Rostedt (VMware)
e725c731e3 tracing: Split tracing initialization into two for early initialization
Create an early_trace_init() function that will initialize the buffers and
allow for ealier use of trace_printk(). This will also allow for future work
to have function tracing start earlier at boot up.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 13:08:43 -04:00
Linus Torvalds
97da3854c5 Linux 4.11-rc3 2017-03-19 19:09:39 -07:00
Linus Torvalds
452b94b8c8 mm/swap: don't BUG_ON() due to uninitialized swap slot cache
This BUG_ON() triggered for me once at shutdown, and I don't see a
reason for the check.  The code correctly checks whether the swap slot
cache is usable or not, so an uninitialized swap slot cache is not
actually problematic afaik.

I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
I'm not sure why that seemingly pointless check was there.  I suspect
the real fix is to just remove it entirely, but for now we'll warn about
it but not bring the machine down.

Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-19 19:00:47 -07:00
Linus Torvalds
a07a6e4121 powerpc fixes for 4.11 #5
- Wire up statx() syscall
  - Don't print a warning on memory hotplug when HPT resizing isn't available
 
 Thanks to:
   David Gibson, Chandan Rajendra.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYzjnLAAoJEFHr6jzI4aWAwcAQAIeOYvCaoxQg6IEFoWJVc+hv
 YfMy86TotkWFojQDV6OUddlz7S14AypjniegVq9aWOPIVwGqmDfp+W8qRfhUh3ab
 S+uorqmXzAFhIUpzVpVpX/nnK6C3eFQdaKCLOxM0Ev413AWUMu70cPmXlCR+x0Kx
 GxeV+nfIfBdyIL4yhK/aDHSItfUwNTmTPSyaUj17/cwLu7DwMjcwKWjrCYCzgBZW
 1hQeWo+yrfvJ1U4yMEGdnDfuTPnWQZsHMw3qSuPzPCnVMgV6YS3HC7/ZEUvBSSBJ
 Bwr6vVEsniLkHDgyFu3665y4abfvqw2iojXKuTpUQMp40T3RCZlT1FQoMvIoJQGC
 FSMUsqnEudjliURi92zq4ImSbPbezB2bG3EsK1jKOOQ9gGbQa2qjXc2CSJiCtZwE
 zsUCcwRFo8Pl1D/KYMTl52nQG3oOMQV/2ceX73HxajsdShXcyJxFEpMPv6aEKi1E
 YT5i2KUXq3sfFPNWe/4gtfOYX9j+tSqi+CRvnwDNZG+a3XLUk/VRptCeIyntH965
 8GYs28CxLz1qljBeAA7czo+1gpNhl1h6FNl1nhqyWRM6jdcLucMeEml0RYCYPmnk
 +jLO6CkYYpIraM0mRqAolM7fYAtm6dOphaeeezCJIMvD/NTdk+pH86JoA64GQZvE
 v1HXCc4lYyUFlvASrCNl
 =TfiW
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc fixes from Michael Ellerman:
 "A couple of minor powerpc fixes for 4.11:

   - wire up statx() syscall

   - don't print a warning on memory hotplug when HPT resizing isn't
     available

  Thanks to: David Gibson, Chandan Rajendra"

* tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries: Don't give a warning when HPT resizing isn't available
  powerpc: Wire up statx() syscall
2017-03-19 18:49:28 -07:00
Linus Torvalds
4571bc5abf Merge branch 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:

 - Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
   modules with CONFIG_MODVERSIONS.

 - Dave Anglin optimized the cache flushing for vmap ranges.

 - Arvind Yadav provided a fix for a potential NULL pointer dereference
   in the parisc perf code (and some code cleanups).

 - I wired up the new statx system call, fixed some compiler warnings
   with the access_ok() macro and fixed shutdown code to really halt a
   system at shutdown instead of crashing & rebooting.

* 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Fix system shutdown halt
  parisc: perf: Fix potential NULL pointer dereference
  parisc: Avoid compiler warnings with access_ok()
  parisc: Wire up statx system call
  parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
  parisc: support R_PARISC_SECREL32 relocation in modules
2017-03-19 18:11:13 -07:00
Linus Torvalds
8aa3417255 Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
 "The bulk of the changes are in qla2xxx target driver code to address
  various issues found during Cavium/QLogic's internal testing (stable
  CC's included), along with a few other stability and smaller
  miscellaneous improvements.

  There are also a couple of different patch sets from Mike Christie,
  which have been a result of his work to use target-core ALUA logic
  together with tcm-user backend driver.

  Finally, a patch to address some long standing issues with
  pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
  which will make folks using physical (or virtual) magnetic tape happy"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
  qla2xxx: Update driver version to 9.00.00.00-k
  qla2xxx: Fix delayed response to command for loop mode/direct connect.
  qla2xxx: Change scsi host lookup method.
  qla2xxx: Add DebugFS node to display Port Database
  qla2xxx: Use IOCB interface to submit non-critical MBX.
  qla2xxx: Add async new target notification
  qla2xxx: Export DIF stats via debugfs
  qla2xxx: Improve T10-DIF/PI handling in driver.
  qla2xxx: Allow relogin to proceed if remote login did not finish
  qla2xxx: Fix sess_lock & hardware_lock lock order problem.
  qla2xxx: Fix inadequate lock protection for ABTS.
  qla2xxx: Fix request queue corruption.
  qla2xxx: Fix memory leak for abts processing
  qla2xxx: Allow vref count to timeout on vport delete.
  tcmu: Convert cmd_time_out into backend device attribute
  tcmu: make cmd timeout configurable
  tcmu: add helper to check if dev was configured
  target: fix race during implicit transition work flushes
  target: allow userspace to set state to transitioning
  target: fix ALUA transition timeout handling
  ...
2017-03-19 18:06:31 -07:00
Linus Torvalds
1b8df61908 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax fixes from Dan Williams:
 "The device-dax driver was not being careful to handle falling back to
  smaller fault-granularity sizes.

  The driver already fails fault attempts that are smaller than the
  device's alignment, but it also needs to handle the cases where a
  larger page mapping could be established. For simplicity of the
  immediate fix the implementation just signals VM_FAULT_FALLBACK until
  fault-size == device-alignment.

  One fix is for -stable to address pmd-to-pte fallback from the
  original implementation, another fix is for the new (introduced in
  4.11-rc1) pud-to-pmd regression, and a typo fix comes along for the
  ride.

  These have received a build success notification from the kbuild
  robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: fix debug output typo
  device-dax: fix pud fault fallback handling
  device-dax: fix pmd/pte fault fallback handling
2017-03-19 15:45:02 -07:00
Himanshu Madhani
6c611d18f3 qla2xxx: Update driver version to 9.00.00.00-k
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:38 -07:00
Quinn Tran
ec7193e260 qla2xxx: Fix delayed response to command for loop mode/direct connect.
Current driver wait for FW to be in the ready state before
processing in-coming commands. For Arbitrated Loop or
Point-to- Point (not switch), FW Ready state can take a while.
FW will transition to ready state after all Nports have been
logged in. In the mean time, certain initiators have completed
the login and starts IO. Driver needs to start processing all
queues if FW is already started.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:38 -07:00
Quinn Tran
482c9dc792 qla2xxx: Change scsi host lookup method.
For target mode, when new scsi command arrive, driver first performs
a look up of the SCSI Host. The current look up method is based on
the ALPA portion of the NPort ID. For Cisco switch, the ALPA can
not be used as the index. Instead, the new search method is based
on the full value of the Nport_ID via btree lib.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:37 -07:00
Himanshu Madhani
c423437e3f qla2xxx: Add DebugFS node to display Port Database
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:37 -07:00
Quinn Tran
15f30a5752 qla2xxx: Use IOCB interface to submit non-critical MBX.
The Mailbox interface is currently over subscribed. We like
to reserve the Mailbox interface for the chip managment and
link initialization. Any non essential Mailbox command will
be routed through the IOCB interface. The IOCB interface is
able to absorb more commands.

Following commands are being routed through IOCB interface

- Get ID List (007Ch)
- Get Port DB (0064h)
- Get Link Priv Stats (006Dh)

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:37 -07:00
Quinn Tran
f1443eebca qla2xxx: Add async new target notification
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:37 -07:00
Anil Gurumurthy
54b9993c8c qla2xxx: Export DIF stats via debugfs
Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:36 -07:00
Quinn Tran
be25152c0d qla2xxx: Improve T10-DIF/PI handling in driver.
Add routines to support T10 DIF tag.

Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:36 -07:00
Quinn Tran
5b33469a05 qla2xxx: Allow relogin to proceed if remote login did not finish
If the remote port have started the login process, then the
PLOGI and PRLI should be back to back. Driver will allow
the remote port to complete the process. For the case where
the remote port decide to back off from sending PRLI, this
local port sets an expiration timer for the PRLI. Once the
expiration time passes, the relogin retry logic is allowed
to go through and perform login with the remote port.

Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:36 -07:00
Quinn Tran
f159b3c7cd qla2xxx: Fix sess_lock & hardware_lock lock order problem.
The main lock that needs to be held for CMD or TMR submission
to upper layer is the sess_lock. The sess_lock is used to
serialize cmd submission and session deletion. The addition
of hardware_lock being held is not necessary. This patch removes
hardware_lock dependency from CMD/TMR submission.

Use hardware_lock only for error response in this case.

Path1
       CPU0                    CPU1
       ----                    ----
  lock(&(&ha->tgt.sess_lock)->rlock);
                               lock(&(&ha->hardware_lock)->rlock);
                               lock(&(&ha->tgt.sess_lock)->rlock);
  lock(&(&ha->hardware_lock)->rlock);

Path2/deadlock
*** DEADLOCK ***
Call Trace:
dump_stack+0x85/0xc2
print_circular_bug+0x1e3/0x250
__lock_acquire+0x1425/0x1620
lock_acquire+0xbf/0x210
_raw_spin_lock_irqsave+0x53/0x70
qlt_sess_work_fn+0x21d/0x480 [qla2xxx]
process_one_work+0x1f4/0x6e0

Cc: <stable@vger.kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reported-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:08 -07:00
Quinn Tran
8f6fc8d4e7 qla2xxx: Fix inadequate lock protection for ABTS.
Normally, ABTS is sent to Target Core as Task MGMT command.
In the case of error, qla2xxx needs to send response, hardware_lock
is required to prevent request queue corruption.

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:08 -07:00
Quinn Tran
8b666809e1 qla2xxx: Fix request queue corruption.
When FW notify driver or driver detects low FW resource,
driver tries to send out Busy SCSI Status to tell Initiator
side to back off. During the send process, the lock was not held.

Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:08 -07:00
Quinn Tran
ae940f2c47 qla2xxx: Fix memory leak for abts processing
Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:28:08 -07:00
Joe Carnuccio
c4a9b538ab qla2xxx: Allow vref count to timeout on vport delete.
Cc: <stable@vger.kernel.org>
Signed-off-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 17:27:56 -07:00
Nicholas Bellinger
7d7a743543 tcmu: Convert cmd_time_out into backend device attribute
Instead of putting cmd_time_out under ../target/core/user_0/foo/control,
which has historically been used by parameters needed for initial
backend device configuration, go ahead and move cmd_time_out into
a backend device attribute.

In order to do this, tcmu_module_init() has been updated to create
a local struct configfs_attribute **tcmu_attrs, that is based upon
the existing passthrough_attrib_attrs along with the new cmd_time_out
attribute.  Once **tcm_attrs has been setup, go ahead and point
it at tcmu_ops->tb_dev_attrib_attrs so it's picked up by target-core.

Also following MNC's previous change, ->cmd_time_out is stored in
milliseconds but exposed via configfs in seconds.  Also, note this
patch restricts the modification of ->cmd_time_out to before +
after the TCMU device has been configured, but not while it has
active fabric exports.

Cc: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 16:32:30 -07:00
Mike Christie
af980e46a2 tcmu: make cmd timeout configurable
A single daemon could implement multiple types of devices
using multuple types of real devices that may not support
restarting from crashes and/or handling tcmu timeouts. This
makes the cmd timeout configurable, so handlers that do not
support it can turn if off for now.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 16:32:27 -07:00
Mike Christie
972c7f1679 tcmu: add helper to check if dev was configured
This adds a helper to check if the dev was configured. It
will be used in the next patch to prevent updates to some
config settings after the device has been setup.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 16:32:19 -07:00
Linus Torvalds
93afaa4513 OpenRISC fixes for 4.11
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYy4GpAAoJEMOzHC1eZifkRm8P/jD5fc3XqdBy9hHX2x5p0mtz
 UNO+tHIO5wjadsf6iMcS/2Ozc8SOXdAietnCK8khodn0RvXBUEW+samSxGKNY8zO
 0Oz3EcUEN3SEJ03sfMwE0G+O+LTUIgdSQ/2galTy6WqrrHVZV5Gp/z9JjDm19kUW
 j5U2/rqVBENSPb+nhOgeq59KpUep7RacXXLDCl34vmQKtFY2fMHa2ddOdtTEBJQh
 HXi7js0EfWeU6LNhMzcuUk1gwitfgNl85up2XBZmgxqe9J7Ysbosv8yH4GyrNjZU
 7TOM4NTUM+LHfVQHCemJ0Ie4CJqMzafY4cdrXC6YMIB86bbLZSMw64FBINcqg8XF
 +RASeFbzUZO38dRccepfRxmSoOPjrN4Nsj83pzdh/4jyD1gDVuQgIPwLONZag3sU
 g9W7bBPPTPaRXn4MA1iyqyLwck2RIpjSMW867LXrgs6cmZXLpccW/psh/c8w08h8
 8c4e++6IoiEuOp4TN0Z38TVT82apfQyfUkHepL26nI5YYgilfM8/9oPAM9l7WRSR
 Zyn7i3TEDX0kmt29ImUjf1ZBGIHzP3krepDkFi+gP5Xycf4sYLXjTffoqlbx9s4U
 8FVdXdiXJGAE3tm+H0DvbS+oTC3eW175SfJ88GZiYog3eWWPnYyMysk/Iv2qAZuE
 KUR4aVx5Rb5+gJAfga2/
 =lNoR
 -----END PGP SIGNATURE-----

Merge tag 'openrisc-for-linus' of git://github.com/openrisc/linux

Pull OpenRISC fixes from Stafford Horne:
 "OpenRISC fixes for build issues that were exposed by kbuild robots
  after 4.11 merge. All from allmodconfig builds. This includes:

   - bug in the handling of 8-byte get_user() calls

   - module build failure due to multile missing symbol exports"

* tag 'openrisc-for-linus' of git://github.com/openrisc/linux:
  openrisc: Export symbols needed by modules
  openrisc: fix issue handling 8 byte get_user calls
  openrisc: xchg: fix `computed is not used` warning
2017-03-18 15:50:39 -07:00
Mike Christie
760bf578ed target: fix race during implicit transition work flushes
This fixes the following races:

1. core_alua_do_transition_tg_pt could have read
tg_pt_gp_alua_access_state and gone into this if chunk:

if (!explicit &&
        atomic_read(&tg_pt_gp->tg_pt_gp_alua_access_state) ==
           ALUA_ACCESS_STATE_TRANSITION) {

and then core_alua_do_transition_tg_pt_work could update the
state. core_alua_do_transition_tg_pt would then only set
tg_pt_gp_alua_pending_state and the tg_pt_gp_alua_access_state would
not get updated with the second calls state.

2. core_alua_do_transition_tg_pt could be setting
tg_pt_gp_transition_complete while the tg_pt_gp_transition_work
is already completing. core_alua_do_transition_tg_pt then waits on the
completion that will never be called.

To handle these issues, we just call flush_work which will return when
core_alua_do_transition_tg_pt_work has completed so there is no need
to do the complete/wait. And, if core_alua_do_transition_tg_pt_work
was running, instead of trying to sneak in the state change, we just
schedule up another core_alua_do_transition_tg_pt_work call.

Note that this does not handle a possible race where there are multiple
threads call core_alua_do_transition_tg_pt at the same time. I think
we need a mutex in target_tg_pt_gp_alua_access_state_store.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:29 -07:00
Mike Christie
1ca4d4fa3b target: allow userspace to set state to transitioning
Userspace target_core_user handlers like tcmu-runner may want to set the
ALUA state to transitioning while it does implicit transitions. This
patch allows that state when set from configfs.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:28 -07:00
Mike Christie
d7175373f2 target: fix ALUA transition timeout handling
The implicit transition time tells initiators the min time
to wait before timing out a transition. We currently schedule
the transition to occur in tg_pt_gp_implicit_trans_secs
seconds so there is no room for delays. If
core_alua_do_transition_tg_pt_work->core_alua_update_tpg_primary_metadata
needs to write out info to a remote file, then the initiator can
easily time out the operation.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:28 -07:00
Mike Christie
207ee84133 target: Use system workqueue for ALUA transitions
If tcmu-runner is processing a STPG and needs to change the kernel's
ALUA state then we cannot use the same work queue for task management
requests and ALUA transitions, because we could deadlock. The problem
occurs when a STPG times out before tcmu-runner is able to
call into target_tg_pt_gp_alua_access_state_store->
core_alua_do_port_transition -> core_alua_do_transition_tg_pt ->
queue_work. In this case, the tmr is on the work queue waiting for
the STPG to complete, but the STPG transition is now queued behind
the waiting tmr.

Note:
This bug will also be fixed by this patch:
http://www.spinics.net/lists/target-devel/msg14560.html
which switches the tmr code to use the system workqueues.

For both, I am not sure if we need a dedicated workqueue since
it is not a performance path and I do not think we need WQ_MEM_RECLAIM
to make forward progress to free up memory like the block layer does.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:27 -07:00
Mike Christie
0a41457298 target: fail ALUA transitions for pscsi
We do not setup the LU group for pscsi devices, so if you write
a state to alua_access_state that will cause a transition you will
get a NULL pointer dereference.

This patch will fail attempts to try and transition the path
for backend devices that set the TRANSPORT_FLAG_PASSTHROUGH_ALUA
flag.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:26 -07:00
Mike Christie
530c6891b1 target: allow ALUA setup for some passthrough backends
This patch allows passthrough backends to use the core/base LIO
ALUA setup and state checks, but still handle the execution of
commands.

This will allow the target_core_user module to execute STPG and RTPG
in userspace, and not have to duplicate the ALUA state checks, path
information (needed so we can check if command is executable on
specific paths) and setup (rtslib sets/updates the configfs ALUA
interface like it does for iblock or file).

For STPG, the target_core_user userspace daemon, tcmu-runner will
still execute the STPG, and to update the core/base LIO state it
will use the existing configfs interface. For RTPG, tcmu-runner
will loop over configfs and/or cache the state.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:25 -07:00
Mike Christie
2579325ca0 tcmu: return on first Opt parse failure
We only were returing failure if the last opt to be parsed failed.
This has a return failure when we first detect a failure.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-03-18 14:47:25 -07:00