Stop including the event type in the definitions for the notice type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Fix to return error code -ENOMEM from the memory alloc fail error
handling case instead of 0, as done elsewhere in this function.
Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Without this we can't cleanly shut down.
Based on analysis an an earlier patch from Hannes Reinecke.
Fixes: bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
Reported-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Print a useful warning instead.
Reported-by: Finn Thain <fthain@telegraphics.com.au>
Tested-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
BFQ can deem a bfq_queue as soft real-time only if the queue
- periodically becomes completely idle, i.e., empty and with
no still-outstanding I/O request;
- after becoming idle, gets new I/O only after a special reference
time soft_rt_next_start.
In this respect, after commit "block, bfq: consider also past I/O in
soft real-time detection", the value of soft_rt_next_start can never
decrease. This causes a problem with the following special updating
case for soft_rt_next_start: to prevent queues that are not completely
idle to be wrongly detected as soft real-time (when they become
non-empty again), soft_rt_next_start is temporarily set to infinity
for empty queues with still outstanding I/O requests. But, if such an
update is actually performed, then, because of the above commit,
soft_rt_next_start will be stuck at infinity forever, and the queue
will have no more chance to be considered soft real-time.
On slow systems, this problem does cause actual soft real-time
applications to be occasionally not detected as such.
This commit addresses this issue by eliminating the pushing of
soft_rt_next_start to infinity, and by changing the way non-empty
queues are prevented from being wrongly detected as soft
real-time. Simply, a queue that becomes non-empty again can now be
detected as soft real-time only if it has no outstanding I/O request.
Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The maximum possible duration of the weight-raising period for
interactive applications is limited to 13 seconds, as this is the time
needed to load the largest application that we considered when tuning
weight raising. Unfortunately, in such an evaluation, we did not
consider the case of very slow virtual machines.
For example, on a QEMU/KVM virtual machine
- running in a slow PC;
- with a virtual disk stacked on a slow low-end 5400rpm HDD;
- serving a heavy I/O workload, such as the sequential reading of
several files;
mplayer takes 23 seconds to start, if constantly weight-raised.
To address this issue, this commit conservatively sets the upper limit
for weight-raising duration to 25 seconds.
Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BFQ computes the duration of weight raising for interactive
applications automatically, using some reference parameters. In
particular, BFQ uses the best durations (see comments in the code for
how these durations have been assessed) for two classes of systems:
slow and fast ones. Examples of slow systems are old phones or systems
using micro HDDs. Fast systems are all the remaining ones. Using these
parameters, BFQ computes the actual duration of the weight raising,
for the system at hand, as a function of the relative speed of the
system w.r.t. the speed of a reference system, belonging to the same
class of systems as the system at hand.
This slow vs fast differentiation proved to be useful in the past, but
happens to have little meaning with current hardware. Even worse, it
does cause problems in virtual systems, where the speed of the system
can vary frequently, and so widely to just confuse the class-detection
mechanism, and, as we have verified experimentally, to cause BFQ to
compute non-sensical weight-raising durations.
This commit addresses this issue by removing the slow class and the
class-detection mechanism.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A description of how weight raising works is missing in BFQ
sources. In addition, the code for handling weight raising is
scattered across a few functions. This makes it rather hard to
understand the mechanism and its rationale. This commits adds such a
description at the beginning of the main source file.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since bfq_finish_request() is always called on the request 'next',
after bfq_requests_merged() is finished, and bfq_finish_request()
removes 'next' from its bfq_queue if needed, it isn't necessary to do
such a removal in advance in bfq_merged_requests().
This commit removes such a useless 'next' removal.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The request rq passed to the function bfq_requests_merged is always in
a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
beginning of bfq_requests_merged always succeeds, and the control
flow systematically skips to the end of the function. This implies
that the body of the function is never executed, i.e., the
repositioning of rq is never performed.
On the opposite end, a control is missing in the body of the function:
'next' must be removed only if it is inside a bfq_queue.
This commit removes the wrong check on rq, and adds the missing check
on 'next'. In addition, this commit adds comments on
bfq_requests_merged.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.
This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.
This commit removes the deadlock by removing the lock/unlock pair.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix NULL pointer dereference in asus-wmi on rfkill cleanup.
The following is an automated git shortlog grouped by driver:
asus-wmi:
- Fix NULL pointer dereference
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEhiZOUlnC9oKN3n3AmT3/83c5Sy0FAlsP6SQACgkQmT3/83c5
Sy2rGA//b4nN3aa2LAKYiMzwQUlrYY8QGxBC4VVyJB+ewTK9R0wEQbIGiwX08Ism
cz23kP5Ewn3SjKikytRBaqwZA92hkke1YSOPCXwgbLFz7e4Q7dxH+oPOxZB2dQoF
lSYe4FEh6XolsEq6aOvzGIUkQl4Ne97rpNGMozIYXnIwzlTTG/NPnwo3ghZL6YTb
XyTekpfLKUycUqKaADIkEGEAXTxDgeeoQnQLLQzBM6cuEkHAk8zyPsE1mVr9G9ns
3CCQlh4U2PFY9k2X32v5YMUrTZh1EtcVDG1gF4kV7OVFqeMjN3cQJti+W6dOjc6u
WoElplxWZ1yy5XKvyN1V4/Fm157t5Hf3BxytgHnwhfi5RvWP526alR42dfHYb6ss
desWCFPmG9nzDbzkVWZapnfVJpb9gqkyev4O9KYLk4ZJrUqEdiaumQicRAnwjwlV
NE1BYGmnj5N9EIw+k5/zYiBuAtYbtDlv5cf1EPbSMg/opXD4JZu+460Ea1y//D7N
mcq6KW6ZjFdsz9Ppw8cWcaEhTnXxYXRyT9IduIwlJ+6wttafGhFKumWMTxgu3WTj
+fhM+D+PXgFIgEoJNOVDSSK1zw2ajYzOyTXzXGo9FWxuFPIw0WfvgubArnOMTspi
kNe6WFJSpE+7UX0xCK0J00ORG6cquHVthYj+rB0seNixyM0+z78=
=GJY6
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v4.17-4' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fix from Andy Shevchenko:
"Fix NULL pointer dereference in asus-wmi on rfkill cleanup.
The effective change is just one new condition - two lines of code.
But it required moving one static helper function, which is why the
diff looks a bit bigger"
* tag 'platform-drivers-x86-v4.17-4' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: asus-wmi: Fix NULL pointer dereference
The counters in client IMC uncore are free running counters, not fixed
counters. It should be corrected. The new infrastructure for free
running counter should be applied.
Introducing a new type SNB_PCI_UNCORE_IMC_DATA for client IMC free
running counters.
Keeping the customized event_init() function to be compatible with old
event encoding.
Clean up other customized event_*() functions.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-8-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some uncores have customized PMU. For customized PMU, it does not need
to customize everything. For example, it only needs to customize init()
function for client IMC uncore. Other functions like
add()/del()/start()/stop()/read() can use generic code.
Expose the uncore_pmu_event_add/del/start/stop() functions.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-7-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As of Skylake Server, there are a number of free running counters in
each IIO Box that collect counts of per-box IO clocks and per-port
Input/Output x BW/Utilization.
The free running counters cannot be part of the existing IIO BOX,
because, quoting from Peter Zijlstra:
"This will result in some (probably) unexpected scheduling artifacts.
Probably the only way to really cure that is to have the free running
counters in their own PMU and not share with the GP counters of this
box."
So let's add a new PMU for the free running counters, as suggested.
The free-running counter is read-only and always active. Counting will
be suspended only when the IIO Box is powered down.
There are three types of IIO free-running counters on Skylake server, IO
CLOCKS counter, BANDWIDTH counters and UTILIZATION counters.
IO CLOCKS counter is a clock of IIO box.
BANDWIDTH counters are to count inbound(PCIe->CPU)/outbound(CPU->PCIe)
bandwidth.
UTILIZATION counters are to count input/output utilization.
The bit width of the free-running counters is 36-bits.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-6-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a number of free running counters introduced for uncore, which
provide highly valuable information to a wide array of customers.
However, the generic uncore code doesn't support them yet.
The free running counters will be specially handled based on their
unique attributes:
- They are read-only. They cannot be enabled/disabled.
- The event and the counter are always 1:1 mapped. It doesn't need to
be assigned nor tracked by event_list.
- They are always active. It doesn't need to check the availability.
- They have different bit width.
Also, using inline helpers to replace the check for fixed counter and
free running counter.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-5-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a number of free running counters introduced for uncore, which
provide highly valuable information to a wide array of customers.
For example, Skylake Server has IIO free running counters to collect
Input/Output x BW/Utilization.
There is NO event available on the general purpose counters, that is
exactly the same as the free running counters. The generic uncore code
needs to be enhanced to support the new counters.
In the uncore document, there is no event-code assigned to free running
counters. Some events need to be defined to indicate the free running
counters. The events are encoded as event-code + umask-code.
The event-code for all free running counters is 0xff, which is the same
as the fixed counters:
- It has not been decided what code will be used for common events on
future platforms. 0xff is the only one which will definitely not be
used as any common event-code.
- Cannot re-use current events on the general purpose counters. Because
there is NO event available, that is exactly the same as the free
running counters.
- Even in the existing codes, the fixed counters for core, that have the
same event-code, may count different things. Hence, it should not
surprise the users if the free running counters that share the same
event-code also count different things.
Umask will be used to distinguish the counters.
The umask-code is used to distinguish a fixed counter and a free running
counter, and different types of free running counters.
For fixed counters, the umask-code is 0x0X, where X indicates the index
of the fixed counter, which starts from 0.
- Compatible with the old event encoding.
- Currently, there is only one fixed counter. There are still 15
reserved spaces for extension.
For free running counters, the umask-code uses the rest of the space.
It would follow the format of 0xXY:
- X stands for the type of free running counters, which starts from 1.
- Y stands for the index of free running counters of same type, which
starts from 0.
- The free running counters do different thing. It can be categorized to
several types, according to the MSR location, bit width and
definition. E.g. there are three types of IIO free running counters on
Skylake server to monitor IO CLOCKS, BANDWIDTH and UTILIZATION on
different ports. It makes it easy to locate the free running counter
of a specific type.
- So far, there are at most 8 counters of each type. There are still 8
reserved spaces for extension.
Introducing a new index to indicate the free running counters. Only one
index is enough for all free running counters. Because the free running
counters are always active, and the event and free running counter are
always 1:1 mapped, it does not need extra index to indicate the assigned
counter.
Introducing a new data structure to store free running counters related
information for each type. It includes the number of counters, bit
width, base address, offset between counters and offset between boxes.
Introducing several inline helpers to check index for fixed counter and
free running counter, validate free running counter event, and retrieve
the free running counter information according to box and event.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no index which is bigger than UNCORE_PMC_IDX_FIXED. The only
exception is client IMC uncore, which has been specially handled.
For generic code, it is not correct to use >= to check fixed counter.
The code quality issue will bring problem when a new counter index is
introduced.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-3-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For Nehalem and Westmere, there is only one fixed counter for W-Box.
There is no index which is bigger than UNCORE_PMC_IDX_FIXED.
It is not correct to use >= to check fixed counter.
The code quality issue will bring problem when new counter index is
introduced.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-2-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two free-running counters for client IMC uncore. The
customized event_init() function hard codes their index to
'UNCORE_PMC_IDX_FIXED' and 'UNCORE_PMC_IDX_FIXED + 1'.
To support the index 'UNCORE_PMC_IDX_FIXED + 1', the generic
uncore_perf_event_update is obscurely hacked.
The code quality issue will bring problems when a new counter index is
introduced into the generic code, for example, a new index for
free-running counter.
Introducing a customized event_read() function for client IMC uncore.
The customized function is copied from previous generic
uncore_pmu_event_read().
The index 'UNCORE_PMC_IDX_FIXED + 1' will be isolated for client IMC
uncore only.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1525371913-10597-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A missing clock update is causing the following warning:
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
Call Trace:
<IRQ>
__hrtimer_run_queues+0x10f/0x530
hrtimer_interrupt+0xe5/0x240
smp_apic_timer_interrupt+0x79/0x2b0
apic_timer_interrupt+0xf/0x20
</IRQ>
do_idle+0x203/0x280
cpu_startup_entry+0x6f/0x80
start_secondary+0x1b0/0x200
secondary_startup_64+0xa5/0xb0
hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
This happens because inactive_task_timer() calls sub_running_bw() (if
TASK_DEAD and non_contending) that might trigger a schedutil update,
which might access the clock. Clock is however currently updated only
later in inactive_task_timer() function.
Fix the problem by updating the clock right after task_rq_lock().
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
select_task_rq() is used in a few paths to select the CPU upon which a
thread should be run - for example it is used by try_to_wake_up() & by
fork or exec balancing. As-is it allows use of any online CPU that is
present in the task's cpus_allowed mask.
This presents a problem because there is a period whilst CPUs are
brought online where a CPU is marked online, but is not yet fully
initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
CPUHP_ONLINE. Usually we don't run any user tasks during this window,
but there are corner cases where this can happen. An example observed
is:
- Some user task A, running on CPU X, forks to create task B.
- sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
task_struct::cpu field to X.
- CPU X is offlined.
- Task A, currently somewhere between the __set_task_cpu() in
copy_process() and the call to wake_up_new_task(), is migrated to
CPU Y by migrate_tasks() when CPU X is offlined.
- CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
scheduler is now active on CPU X, but there are no user tasks on
the runqueue.
- Task A runs on CPU Y & reaches wake_up_new_task(). This calls
select_task_rq() with cpu=X, taken from task B's task_struct,
and select_task_rq() allows CPU X to be returned.
- Task A enqueues task B on CPU X's runqueue, via activate_task() &
enqueue_task().
- CPU X now has a user task on its runqueue before it has reached the
CPUHP_ONLINE state.
In most cases, the user tasks that schedule on the newly onlined CPU
have no idea that anything went wrong, but one case observed to be
problematic is if the task goes on to invoke the sched_setaffinity
syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
before the CPU that brought it online calls stop_machine_unpark(). This
means that for a portion of the window of time between
CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
cpu_stopper has its enabled field set to false. If a user thread is
executed on the CPU during this window and it invokes sched_setaffinity
with a CPU mask that does not include the CPU it's running on, then when
__set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
migration_cpu_stop() and perform the actual migration away from the CPU
it will simply return -ENOENT rather than calling migration_cpu_stop().
We then return from the sched_setaffinity syscall back to the user task
that is now running on a CPU which it just asked not to run on, and
which is not present in its cpus_allowed mask.
This patch resolves the problem by having select_task_rq() enforce that
user tasks run on CPUs that are active - the same requirement that
select_fallback_rq() already enforces. This should ensure that newly
onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
schedule user tasks, and also implies that bringup_wait_for_ap() will
have called stop_machine_unpark() which resolves the sched_setaffinity
issue above.
I haven't yet investigated them, but it may be of interest to review
whether any of the actions performed by hotplug states between
CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
effects on user tasks that might schedule before they are reached, which
might widen the scope of the problem from just affecting the behaviour
of sched_setaffinity.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
for running on an online && !active CPU are stricter than just being a
kthread, you need to be a per-cpu kthread.
If you're not strictly per-CPU, you have better CPUs to run on and
don't need the partially booted one to get your work done.
The exception is to allow smpboot threads to bootstrap the CPU itself
and get kernel 'services' initialized before we allow userspace on it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 955dbdf4ce ("sched: Allow migrating kthreads into online but inactive CPUs")
Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The correct form is "a high-level", so fix it accordingly.
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Reviewed-by: Boris Brezillon <boris.brezillon@bootlin.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvolume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.
This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.
The main differences from original ino_lookup ioctl are:
1. Read + Exec permission will be checked using inode_permission()
during path construction. -EACCES will be returned in case
of failure.
2. Path construction will be stopped at the inode number which
corresponds to the fd with which this ioctl is called. If
constructed path does not exist under fd's inode, -EACCES
will be returned.
3. The name of bottom subvolume is also searched and filled.
Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns
ROOT_REF information of the subvolume containing this inode except the
subvolume name (this is because to prevent potential name leak). The
subvolume name will be gained by user version of ino_lookup ioctl
(BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.
The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM,
-EOVERFLOW is returned. Therefore the caller can just call this ioctl
again without changing the argument to continue search.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes and struct item renames ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ minor style fixes, update struct comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
We may derference an invalid pointer in the error path of
xfrm_bundle_create(). Fix this by returning this error
pointer directly instead of assigning it to xdst0.
Fixes: 45b018bedd ("ipsec: Create and use new helpers for dst child access.")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This avoids a WARNING splat when loading the macsonic or macmace driver.
Please see commit 205e1b7f51 ("dma-mapping: warn when there is no
coherent_dma_mask").
This implementation of arch_setup_pdev_archdata() differs from the
powerpc one, in that this one avoids clobbering a device dma mask
which has already been initialized.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.
This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file. This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Missed converting the bioset_integrity_create() bounce bio set
call.
Fixes: 338aa96d56 ("block: convert bounce, q->bio_split to bioset_init()/mempool_init()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJbDwViAAoJEPfTWPspceCmv+gQANwV5cPSvXEsGdsI4FX+ThUD
li2vscItK2firIAPnDMN9K3pKCWN46Rm8oLvzo2vois/vAtIvbg+eJqxJmNFPNJP
X8wcFb108DF/4I8K1Vklq1ILl5+CNmb1KjdiJjrUbDhDlQ4lsHo0eUAabbZ4LTZE
mT5BZjHiYCLW/om/AXZWV5TWVopMefp/hBX8ICSts8vCMRnxA8xoFkC1QB3WgG4g
2P49RsLehfqIJTNOmwcw0TShNWu0gFjdhrTv1NpzsHcJuGvCUJ+OhhUyZAmGNSCb
VyGp7xb1T9NqyfVLnBeMHM1V2PudT3lX4VZkDFG5ZKkFdam9J2SvaFaNwl2pKear
znQAnpGH1fXlVSx4FCPTPESj3buEpJ8UY99+gpHQ5+cN1jVJEG+UprQnuXFpNhFM
sOizGiqOU5edwkxcMl6M/kJwmbNPnxFs/fDgUgWacorTBZEG7bxcrI10rKnuPs3A
Zl3OCYXq7Ls0lU9QBRqSPNVd87i/BTuOWUqkLjjVH8xsOiw1MFVqLKfpZVe9AWZy
4Y8u9ShHo/tKO219Sq2onUUGUmdCtMQJwx0vnuD3r0B7VMIUBNc6D8xwre2xoNrr
1BgnNooyXrC+7ESrEdaO2ORn0W3VqZkTMfc0k3Cms/ZbEc7/1+8bWTfmHDYh+VbV
gJ9jl85Qd2aGjK+f0s8+
=Dk2i
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180530' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Just a single fix that should make it into this release, fixing a
regression with T10-DIF on NVMe"
* tag 'for-linus-20180530' of git://git.kernel.dk/linux-block:
nvme: fix extended data LBA supported setting
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsOyMwUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIo4BRAApHiqQUZkaN705wf/YUQ7RpBbCdYZ
Ycg31Ls0OWaI7UNTr6tID4CFz9JIAhLqEQ6k9iDWc8sDnq9vNELRiBz7vZePxbYx
AKkXrgRKVROtMGgvoJFNrYGArX+UvSbZ1qhYFhVH2IptW4q/q2atEOXdQOSdNZhD
lWgz3Woi1ZBXPvKdXtj6Rme0C5VghOMDXX3gPfog7O+SWDW8lFOupZ9YbcnUF+kV
mgk/bNYKBZIKaL/XYuuF+4SIqSpbusQr9T4juT3xBKifSNatusLYDAIUGBSv2I7T
xc0yP+nHB50T+T2nec2eBcUXxKckbsrc4K+CdV1hWGj624Boq/MqEWzmw/4L/oHg
YLFrLjXYt0WtNg0SnjCKBNrsAuICx6g2m2imVMb3IQMnf03Q42jQtJvZDXbv24zR
7vDLUoK8z0RiyjAOgV0vK7qpZ6IncuX4twK1627ziMyE52gtHoR+T8f/x9BShaOi
8svPGz8xW6yfvCR3m5KDAqJrSFNA0ex20BkOP9Mi2QctMyfODJmQx9JJCEFSuYFz
mzDlQeOKCOO0p25GDXHPwHSgcfopWZzQgHB7RYz0uuz2pCZc5t3kfQvciIvjjhO0
MJSk69GgoQojc/Z9c+Y1+dQ5lXWmQEIJmAtsSZBVkI9lVutr6OUtgMd2fQZwo3/W
BBH5q85bwdagLAw=
=3tiZ
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20180530' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux fix from Paul Moore:
"One more small fix for SELinux: a small string length fix found by
KASAN.
I dislike sending patches this late in the release cycle, but this
patch fixes a legitimate problem, is very small, limited in scope, and
well understood.
There are two threads with more information on the problem, the latest
is linked below:
https://marc.info/?t=152723737400001&r=1&w=2
Stephen points out in the thread linked above:
'Such a setxattr() call can only be performed by a process with
CAP_MAC_ADMIN that is also allowed mac_admin permission in SELinux
policy. Consequently, this is never possible on Android (no process
is allowed mac_admin permission, always enforcing) and is only
possible in Fedora/RHEL for a few domains (if enforcing)'"
* tag 'selinux-pr-20180530' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: KASAN: slab-out-of-bounds in xattr_getsecurity
All users have been converted to bioset_init(), kill off the
old API.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert XFS to embedded bio sets.
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert btrfs to embedded bio sets.
Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert block DIO code to embedded bio sets.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the target code to embedded bio sets.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert dm to embedded bio sets.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>