Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.
A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
SEV uses SWIOTLB to make this happen without requiring changes to device
drivers. However, depending on the workload being run, the default 64MB
of it might not be enough and it may run out of buffers to use for DMA,
resulting in I/O errors and/or performance degradation for high
I/O workloads.
Adjust the default size of SWIOTLB for SEV guests using a
percentage of the total memory available to guest for the SWIOTLB buffers.
Adds a new sev_setup_arch() function which is invoked from setup_arch()
and it calls into a new swiotlb generic code function swiotlb_adjust_size()
to do the SWIOTLB buffer adjustment.
v5 fixed build errors and warnings as
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.
# dmesg | grep 'Estimated ratio of average max'
[ 14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.
On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.
Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.
Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.
[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]
Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);
which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.
That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/87k0tpju47.fsf@nanos.tec.linutronix.de
Prarit reported that depending on the affinity setting the
' irq $N: Affinity broken due to vector space exhaustion.'
message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.
Shung-Hsi provided traces and analysis which pinpoints the problem:
The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:
1) Try the intersection of affinity mask and node mask
2) Try the node mask
3) Try the full affinity mask
4) Try the full online mask
Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.
In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.
Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.
If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.
Fixes: d6ffc6ac83 ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
rdtgroup_mondata_show()
mon_event_read()
mon_event_count()
__mon_event_count()
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Before fix:
local b/w (bytes/s): 11076796416
After fix:
local b/w (bytes/s): 5465014272
Fixes: ba0f26d852 (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
Fix to restore BTF if single-stepping causes a page fault and
it is cancelled.
Usually the BTF flag was restored when the single stepping is done
(in resume_execution()). However, if a page fault happens on the
single stepping instruction, the fault handler is invoked and
the single stepping is cancelled. Thus, the BTF flag is not
restored.
Fixes: 1ecc798c67 ("x86: debugctlmsr kprobes")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
Commit
26bfa5f894 ("x86, amd: Cleanup init_amd")
moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().
However, bsp_init_amd() is executed well before the direct mapping is
actually created:
setup_arch()
-> early_cpu_init()
-> early_identify_cpu()
-> this_cpu->c_bsp_init()
-> bsp_init_amd()
...
-> init_mem_mapping()
So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.
It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.
Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.
Committer notes:
This workaround is incomplete anyway:
1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.
2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.
3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after
66520ebc2d ("x86, mm: Only direct map addresses that are marked as E820_RAM")
which means, there's nothing to be worked around either.
So let's rip it out.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
After having collected the majority of reports about MSRs being written
by userspace tools and what tools those are, and all newer reports
mostly repeating, add an URL where detailed information is gathered and
kept up-to-date.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201205002825.19107-1-bp@alien8.de
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Make the AMD L3 QoS code and data priorization enable/disable
mechanism work correctly.
The control bit was only set/cleared on one of the CPUs in a L3
domain, but it has to be modified on all CPUs in the domain. The
initial documentation was not clear about this, but the updated one
from Oct 2020 spells it out.
- Fix an off by one in the UV platform detection code which causes
the UV hubs to be identified wrongly.
The chip revisions start at 1 not at 0.
- Fix a long standing bug in the evaluation of prefixes in the
uprobes code which fails to handle repeated prefixes properly.
The aggregate size of the prefixes can be larger than the bytes
array but the code blindly iterated over the aggregate size beyond
the array boundary. Add a macro to handle this case properly and
use it at the affected places"
* tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
x86/platform/uv: Fix UV4 hub revision adjustment
x86/resctrl: Fix AMD L3 QOS CDP enable/disable
Currently, if an MCE happens in user-mode or while the kernel is copying
data from user space, 'kill_it' is used to check if execution of the
interrupted task can be recovered or not; the flag name however is not
very meaningful, hence rename it to match its goal.
[ bp: Massage commit message, rename the queue_task_work() arg too. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
Right now, for local MCEs the machine calls panic(), if needed, right
after lmce is set. For MCE broadcasting, mce_reign() takes care of
calling mce_panic().
Hence:
- improve readability by moving the conditional evaluation of
tolerant up to when kill_it is set first;
- move the mce_panic() call up into the statement where mce_end()
fails.
[ bp: Massage, remove comment in the mce_end() failure case because it
is superfluous; use local ptr 'cfg' in both tests. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
Commit
fd8d9db355 ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
removed superfluous kernfs_get() calls in rdtgroup_ctrl_remove() and
rdtgroup_rmdir_ctrl(). That change resulted in an unused function
parameter to these two functions.
Clean up the unused function parameter in rdtgroup_ctrl_remove(),
rdtgroup_rmdir_mon() and their callers rdtgroup_rmdir_ctrl() and
rdtgroup_rmdir().
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1606759618-13181-1-git-send-email-xiaochen.shen@intel.com
When the AMD QoS feature CDP (code and data prioritization) is enabled
or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs
in an L3 domain (core complex). That is not correct - the CDP bit needs
to be updated on all the logical CPUs in the domain.
This was not spelled out clearly in the spec earlier. The specification
has been updated and the updated document, "AMD64 Technology Platform
Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue
Date: October 2020" is available now. Refer the section: Code and Data
Prioritization.
Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache
data structure.
The documentation can be obtained at:
https://developer.amd.com/wp-content/resources/56375.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
[ bp: Massage commit message. ]
Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
Pull locking fixes from Thomas Gleixner:
"Two more places which invoke tracing from RCU disabled regions in the
idle path.
Similar to the entry path the low level idle functions have to be
non-instrumentable"
* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_idle: Fix intel_idle() vs tracing
sched/idle: Fix arch_cpu_idle() vs tracing
Pull x86 fixes from Borislav Petkov:
"A couple of urgent fixes which accumulated this last week:
- Two resctrl fixes to prevent refcount leaks when manipulating the
resctrl fs (Xiaochen Shen)
- Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)
- A fix to not lose already seen MCE severity which determines
whether the machine can recover (Gabriele Paoloni)"
* tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Do not overwrite no_way_out if mce_end() fails
x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
Pull iommu fixes from Will Deacon:
"Here's another round of IOMMU fixes for -rc6 consisting mainly of a
bunch of independent driver fixes. Thomas agreed for me to take the
x86 'tboot' fix here, as it fixes a regression introduced by a vt-d
change.
- Fix intel iommu driver when running on devices without VCCAP_REG
- Fix swiotlb and "iommu=pt" interaction under TXT (tboot)
- Fix missing return value check during device probe()
- Fix probe ordering for Qualcomm SMMU implementation
- Ensure page-sized mappings are used for AMD IOMMU buffers with SNP
RMP"
* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
iommu/vt-d: Don't read VCCAP register unless it exists
x86/tboot: Don't disable swiotlb when iommu is forced on
iommu: Check return of __iommu_attach_device()
arm-smmu-qcom: Ensure the qcom_scm driver has finished probing
iommu/amd: Enforce 4k mapping for certain IOMMU data structures
Currently, if mce_end() fails, no_way_out - the variable denoting
whether the machine can recover from this MCE - is determined by whether
the worst severity that was found across the MCA banks associated with
the current CPU, is of panic severity.
However, at this point no_way_out could have been already set by
mca_start() after looking at all severities of all CPUs that entered the
MCE handler. If mce_end() fails, check first if no_way_out is already
set and, if so, stick to it, otherwise use the local worst value.
[ bp: Massage. ]
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command
line, IBPB is force-enabled and STIPB is conditionally-enabled (or not
available).
However, since
21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP}
instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour.
Because the issuing of IBPB relies on the switch_mm_*_ibpb static
branches, the mitigations behave as expected.
Since
1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
this discrepency caused the misreporting of IB speculation via prctl().
On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb,
prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL |
PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are
always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB
speculation mode, even though the flag is ignored.
Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should
also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not
available.
[ bp: Massage commit message. ]
Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Fixes: 1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.
(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
On resource group creation via a mkdir an extra kernfs_node reference is
obtained by kernfs_get() to ensure that the rdtgroup structure remains
accessible for the rdtgroup_kn_unlock() calls where it is removed on
deletion. Currently the extra kernfs_node reference count is only
dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
structure is removed in a few other locations that lack the matching
reference drop.
In call paths of rmdir and umount, when a control group is removed,
kernfs_remove() is called to remove the whole kernfs nodes tree of the
control group (including the kernfs nodes trees of all child monitoring
groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
structures of all child monitoring groups under the control group are
freed by kfree() in free_all_child_rdtgrp().
Before calling kfree() to free the rdtgroup structures, the kernfs node
of the control group itself as well as the kernfs nodes of all child
monitoring groups still take the extra references which will never be
dropped to 0 and the kernfs nodes will never be freed. It leads to
reference count leak and kernfs_node_cache memory leak.
For example, reference count leak is observed in these two cases:
(1) mount -t resctrl resctrl /sys/fs/resctrl
mkdir /sys/fs/resctrl/c1
mkdir /sys/fs/resctrl/c1/mon_groups/m1
umount /sys/fs/resctrl
(2) mkdir /sys/fs/resctrl/c1
mkdir /sys/fs/resctrl/c1/mon_groups/m1
rmdir /sys/fs/resctrl/c1
The same reference count leak issue also exists in the error exit paths
of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().
Fix this issue by following changes to make sure the extra kernfs_node
reference on rdtgroup is dropped before freeing the rdtgroup structure.
(1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
kernfs_put() and kfree().
(2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
structure is about to be freed by kfree().
(3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
exit paths of mkdir where an extra reference is taken by kernfs_get().
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
Willem reported growing of kernfs_node_cache entries in slabtop when
repeatedly creating and removing resctrl subdirectories as well as when
repeatedly mounting and unmounting the resctrl filesystem.
On resource group (control as well as monitoring) creation via a mkdir
an extra kernfs_node reference is obtained to ensure that the rdtgroup
structure remains accessible for the rdtgroup_kn_unlock() calls where it
is removed on deletion. The kernfs_node reference count is dropped by
kernfs_put() in rdtgroup_kn_unlock().
With the above explaining the need for one kernfs_get()/kernfs_put()
pair in resctrl there are more places where a kernfs_node reference is
obtained without a corresponding release. The excessive amount of
reference count on kernfs nodes will never be dropped to 0 and the
kernfs nodes will never be freed in the call paths of rmdir and umount.
It leads to reference count leak and kernfs_node_cache memory leak.
Remove the superfluous kernfs_get() calls and expand the existing
comments surrounding the remaining kernfs_get()/kernfs_put() pair that
remains in use.
Superfluous kernfs_get() calls are removed from two areas:
(1) In call paths of mount and mkdir, when kernfs nodes for "info",
"mon_groups" and "mon_data" directories and sub-directories are
created, the reference count of newly created kernfs node is set to 1.
But after kernfs_create_dir() returns, superfluous kernfs_get() are
called to take an additional reference.
(2) kernfs_get() calls in rmdir call paths.
Fixes: 17eafd0762 ("x86/intel_rdt: Split resource group removal in two")
Fixes: 4af4a88e0c ("x86/intel_rdt/cqm: Add mount,umount support")
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Fixes: 5dc1d5c6ba ("x86/intel_rdt: Simplify info and base file lists")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Fixes: 4e978d06de ("x86/intel_rdt: Add "info" files to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
Fix
./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Function parameter or member \
'encl' not described in 'sgx_ioc_enclave_provision'
./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Excess function parameter \
'enclave' description in 'sgx_ioc_enclave_provision'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201123181922.0c009406@canb.auug.org.au
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.
This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.
This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.
Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull x86 fixes from Borislav Petkov:
- An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
same because the proper one is going through the IOMMU tree (Thomas
Gleixner)
- An Intel microcode loader fix to save the correct microcode patch to
apply during resume (Chen Yu)
- A fix to not access user memory of other processes when dumping
opcode bytes (Thomas Gleixner)
* tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
x86/dumpstack: Do not try to access user space code of other tasks
x86/microcode/intel: Check patch signature before saving microcode for early loading
iommu/vt-d: Take CONFIG_PCI_ATS into account
The kernel uses ACPI Boot Error Record Table (BERT) to report fatal
errors that occurred in a previous boot. The MCA errors in the BERT are
reported using the x86 Processor Error Common Platform Error Record
(CPER) format. Currently, the record prints out the raw MSR values and
AMD relies on the raw record to provide MCA information.
Extract the raw MSR values of MCA registers from the BERT and feed them
into mce_log() to decode them properly.
The implementation is SMCA-specific as the raw MCA register values are
given in the register offset order of the SMCA address space.
[ bp: Massage. ]
[ Fix a build breakage in patch v1. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Punit Agrawal <punit1.agrawal@toshiba.co.jp>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20201119182938.151155-1-Smita.KoralahalliChannabasappa@amd.com
Pull iommu fixes from Will Deacon:
"Two straightforward vt-d fixes:
- Fix boot when intel iommu initialisation fails under TXT (tboot)
- Fix intel iommu compilation error when DMAR is enabled without ATS
and temporarily update IOMMU MAINTAINERs entry"
* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
MAINTAINERS: Temporarily add myself to the IOMMU entry
iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set
iommu/vt-d: Avoid panic if iommu init fails in tboot system
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:
=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x77/0x97
__is_insn_slot_addr+0x15d/0x170
kernel_text_address+0xba/0xe0
? get_stack_info+0x22/0xa0
__kernel_text_address+0x9/0x30
show_trace_log_lvl+0x17d/0x380
? dump_stack+0x77/0x97
dump_stack+0x77/0x97
__lock_acquire+0xdf7/0x1bf0
lock_acquire+0x258/0x3d0
? vprintk_emit+0x6d/0x2c0
_raw_spin_lock+0x27/0x40
? vprintk_emit+0x6d/0x2c0
vprintk_emit+0x6d/0x2c0
printk+0x4d/0x69
start_secondary+0x1c/0x100
secondary_startup_64_no_verify+0xb8/0xbb
This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.
Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The only usage of the kf_ops field in the rftype struct is to pass
it as argument to __kernfs_create_file(), which accepts a pointer to
const. Make it a pointer to const. This makes it possible to make
rdtgroup_kf_single_ops and kf_mondata_ops const, which allows the
compiler to put them in read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201110230228.801785-1-rikard.falkeborn@gmail.com