Userspace could have munmapped the area before doing unmapping from
the gmap. This would leave us with a valid vmaddr, but an invalid vma
from which we would try to zap memory.
Let's check before using the vma.
Fixes: 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c")
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20180816082432.78828-1-frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull s390 updates from Heiko Carstens:
"Since Martin is on vacation you get the s390 pull request from me:
- Host large page support for KVM guests. As the patches have large
impact on arch/s390/mm/ this series goes out via both the KVM and
the s390 tree.
- Add an option for no compression to the "Kernel compression mode"
menu, this will come in handy with the rework of the early boot
code.
- A large rework of the early boot code that will make life easier
for KASAN and KASLR. With the rework the bootable uncompressed
image is not generated anymore, only the bzImage is available. For
debuggung purposes the new "no compression" option is used.
- Re-enable the gcc plugins as the issue with the latent entropy
plugin is solved with the early boot code rework.
- More spectre relates changes:
+ Detect the etoken facility and remove expolines automatically.
+ Add expolines to a few more indirect branches.
- A rewrite of the common I/O layer trace points to make them
consumable by 'perf stat'.
- Add support for format-3 PCI function measurement blocks.
- Changes for the zcrypt driver:
+ Add attributes to indicate the load of cards and queues.
+ Restructure some code for the upcoming AP device support in KVM.
- Build flags improvements in various Makefiles.
- A few fixes for the kdump support.
- A couple of patches for gcc 8 compile warning cleanup.
- Cleanup s390 specific proc handlers.
- Add s390 support to the restartable sequence self tests.
- Some PTR_RET vs PTR_ERR_OR_ZERO cleanup.
- Lots of bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (107 commits)
s390/dasd: fix hanging offline processing due to canceled worker
s390/dasd: fix panic for failed online processing
s390/mm: fix addressing exception after suspend/resume
rseq/selftests: add s390 support
s390: fix br_r1_trampoline for machines without exrl
s390/lib: use expoline for all bcr instructions
s390/numa: move initial setup of node_to_cpumask_map
s390/kdump: Fix elfcorehdr size calculation
s390/cpum_sf: save TOD clock base in SDBs for time conversion
KVM: s390: Add huge page enablement control
s390/mm: Add huge page gmap linking support
s390/mm: hugetlb pages within a gmap can not be freed
KVM: s390: Add skey emulation fault handling
s390/mm: Add huge pmd storage key handling
s390/mm: Clear skeys for newly mapped huge guest pmds
s390/mm: Clear huge page storage keys on enable_skey
s390/mm: Add huge page dirty sync support
s390/mm: Add gmap pmd invalidation and clearing
s390/mm: Add gmap pmd notification bit setting
s390/mm: Add gmap pmd linking
...
Commit c9b5ad546e "s390/mm: tag normal pages vs pages used in page tables"
accidentally changed the logic in arch_set_page_states(), which is used by
the suspend/resume code. set_page_stable(page, order) was changed to
set_page_stable_dat(page, 0). After this, only the first page of higher order
pages will be set to stable, and a write to one of the unstable pages will
result in an addressing exception.
Fix this by using "order" again, instead of "0".
Fixes: c9b5ad546e ("s390/mm: tag normal pages vs pages used in page tables")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
- must be enabled via module parameter hpage=1
- cannot be used together with nested
- does support migration
- does support hugetlbfs
- no THP yet
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJbX4AwAAoJEBF7vIC1phx85eMP/ifsNHwqfAOrZBdlJuLVPla5
47J8iY4i4DOKGhKI4YOTcJQhn1izKZhECXS8d8hghB/sQUCE2CLVr1X/r1Udy2Pq
bpKG4apYtcJZBF6qn7yDMjBGkIRK4OCBD1pkuKEq2NyvUgPsHUVUgpuq2gngMTBk
ZN9MIfRQMdIEJsT389D6T9as0lwABJ0MJap5AudkQwguN2dDhQGeZv8l0QYV8C2I
WqRI2VsI1QEo3cJr1lJ5li/F9fC7q0l6QwlvPVocIHJAnq01zJvOekeAgQ4hzz16
JIoQckJq8m4d4PqZ7aWmAaMEemoQ9llmCavovspJNtFT79jho6cWWtBEvq+t0GLQ
qTsG9Yi20hONZMWAw+JIdSdOuFMD0HCpOWdUtSMjENFRbr8LLHUr91dGIxRLjF8Z
gv3vDJrbGzCQ+b9qPA8SrAN7U3VNCZG384MEmobwTuv5hxOopWp6chcK7RCriV/m
7cFDfO7+2pZymdW7D4DWlFiZl4mWpwOxip32C9tCt0CQveqeYSZsb5Qb9Pe+50vr
JhpB74UL79Wffvd65InGlu5jx1SdGG0QAzmBOkdOsAhX+0WMmXRB1ddn4whu7HPU
ssNtdKgLt9KkM/kIsB9RC/YLvUFK1lBVHrfnzUmLw3CBHP3QeO+V+arLwdVLVDjV
PA/LPECBWtGtQtxGWb2H
=Y0Wl
-----END PGP SIGNATURE-----
Merge tag 'hlp_stage1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into features
Pull hlp_stage1 from Christian Borntraeger with the following changes:
KVM: s390: initial host large page support
- must be enabled via module parameter hpage=1
- cannot be used together with nested
- does support migration
- does support hugetlbfs
- no THP yet
Let's allow huge pmd linking when enabled through the
KVM_CAP_S390_HPAGE_1M capability. Also we can now restrict gmap
invalidation and notification to the cases where the capability has
been activated and save some cycles when that's not the case.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Guests backed by huge pages could theoretically free unused pages via
the diagnose 10 instruction. We currently don't allow that, so we
don't have to refault it once it's needed again.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Storage keys for guests with huge page mappings have to be managed in
hardware. There are no PGSTEs for PMDs that we could use to retain the
guests's logical view of the key.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Similarly to the pte skey handling, where we set the storage key to
the default key for each newly mapped pte, we have to also do that for
huge pmds.
With the PG_arch_1 flag we keep track if the area has already been
cleared of its skeys.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When a guest starts using storage keys, we trap and set a default one
for its whole valid address space. With this patch we are now able to
do that for large pages.
To speed up the storage key insertion, we use
__storage_key_init_range, which in-turn will use sske_frame to set
multiple storage keys with one instruction. As it has been previously
used for debuging we have to get rid of the default key check and make
it quiescing.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
[replaced page_set_storage_key loop with __storage_key_init_range]
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
To do dirty loging with huge pages, we protect huge pmds in the
gmap. When they are written to, we unprotect them and mark them dirty.
We introduce the function gmap_test_and_clear_dirty_pmd which handles
dirty sync for huge pages.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
If the host invalidates a pmd, we also have to invalidate the
corresponding gmap pmds, as well as flush them from the TLB. This is
necessary, as we don't share the pmd tables between host and guest as
we do with ptes.
The clearing part of these three new functions sets a guest pmd entry
to _SEGMENT_ENTRY_EMPTY, so the guest will fault on it and we will
re-link it.
Flushing the gmap is not necessary in the host's lazy local and csp
cases. Both purge the TLB completely.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Like for ptes, we also need invalidation notification for pmds, to
make sure the guest lowcore pages are always accessible and later
addition of shadowed pmds.
With PMDs we do not have PGSTEs or some other bits we could use in the
host PMD. Instead we pick one of the free bits in the gmap PMD. Every
time a host pmd will be invalidated, we will check if the respective
gmap PMD has the bit set and in that case fire up the notifier.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Let's allow pmds to be linked into gmap for the upcoming s390 KVM huge
page support.
Before this patch we copied the full userspace pmd entry. This is not
correct, as it contains SW defined bits that might be interpreted
differently in the GMAP context. Now we only copy over all hardware
relevant information leaving out the software bits.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Currently we use the software PGSTE bits PGSTE_IN_BIT and
PGSTE_VSIE_BIT to notify before an invalidation occurs on a prefix
page or a VSIE page respectively. Both bits are pgste specific, but
are used when protecting a memory range.
Let's introduce abstract GMAP_NOTIFY_* bits that will be realized into
the respective bits when gmap DAT table entries are protected.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
This patch reworks the gmap_protect_range logic and extracts the pte
handling into an own function. Also we do now walk to the pmd and make
it accessible in the function for later use. This way we can add huge
page handling logic more easily.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When the oom killer kills a userspace process in the page fault handler
while in guest context, the fault handler fails to release the mm_sem
if the FAULT_FLAG_RETRY_NOWAIT option is set. This leads to a deadlock
when tearing down the mm when the process terminates. This bug can only
happen when pfault is enabled, so only KVM clients are affected.
The problem arises in the rare cases in which handle_mm_fault does not
release the mm_sem. This patch fixes the issue by manually releasing
the mm_sem when needed.
Fixes: 24eb3a824c ("KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault")
Cc: <stable@vger.kernel.org> # 3.15+
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
cmm_set_timer could be called concurrently from cmm_thread, cmm proc
handler, upon cmm smsg receive and timer function itself. To avoid
potential race condition and hitting BUG_ON in add_timer on already
pending timer simply reuse mod_timer which is according to
documentation "the only safe way to modify the timeout" with multiple
unserialized concurrent users. mod_timer can handle both active and
inactive timers which allows to carry out minor code simplification as
well.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Split cmm_pages_handler into cmm_pages_handler and
cmm_timed_pages_handler, each handling separate proc entry. And reuse
proc_doulongvec_minmax to simplify proc handlers. Min/max values are
optional and are omitted here.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Since proc_dointvec does not perform value range control,
proc_dointvec_minmax should be used to limit value range, which is
clearly intended here, as the internal representation of the value:
unsigned int alloc_pgste:1;
In fact it currently works, since we have
mm->context.alloc_pgste = page_table_allocate_pgste || ...
... since commit 23fefe119c ("s390/kvm: avoid global config of vm.alloc_pgste=1")
Before that it was
mm->context.alloc_pgste = page_table_allocate_pgste;
which was broken. That was introduced with commit 0b46e0a3ec ("s390/kvm:
remove delayed reallocation of page tables for KVM").
Fixes: 0b46e0a3ec ("s390/kvm: remove delayed reallocation of page tables for KVM")
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
s390 no longer uses the _mapcount field in struct page to identify
the page table format being used. While the code was diligent in handling
the different mappings, it neglected to turn "off" the map bits when
alloc_pgste was being used. This resulted in bits remaining "on" in the
_refcount field, and thus an artifically huge "in use" count that prevents
the pages from actually being released by __free_page.
There's opportunity for improvement in the "1 vs 3" vs "1U vs 3U" vs
"0x1 vs 0x11" etc. variations for all these calls, I am just keeping
things simple compared to neighboring code.
Fixes: 620b4e9031 ("s390: use _refcount for pgtables")
Reported-by: Halil Pasic <pasic@linux.ibm.com>
Bisected-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
arch/s390/mm/extmem.c: In function '__segment_load':
arch/s390/mm/extmem.c:436:2: warning: 'strncat' specified bound 7 equals
source length [-Wstringop-overflow=]
strncat(seg->res_name, " (DCSS)", 7);
What gcc complains about here is the misuse of strncat function, which
in this case does not limit a number of bytes taken from "src", so it is
in the end the same as strcat(seg->res_name, " (DCSS)");
Keeping in mind that a res_name is 15 bytes, strncat in this case
would overflow the buffer and write 0 into alignment byte between the
fields in the struct. To avoid that increasing res_name size to 16,
and reusing strlcat.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
- Additional struct_size() conversions (Matthew, Kees)
- Explicitly reported overflow fixes (Silvio, Kees)
- Add missing kvcalloc() function (Kees)
- Treewide conversions of allocators to use either 2-factor argument
variant when available, or array_size() and array3_size() as needed (Kees)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
AtfvEeaPHaOyD8/h2Q==
=zUUp
-----END PGP SIGNATURE-----
Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull more overflow updates from Kees Cook:
"The rest of the overflow changes for v4.18-rc1.
This includes the explicit overflow fixes from Silvio, further
struct_size() conversions from Matthew, and a bug fix from Dan.
But the bulk of it is the treewide conversions to use either the
2-factor argument allocators (e.g. kmalloc(a * b, ...) into
kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
b) into vmalloc(array_size(a, b)).
Coccinelle was fighting me on several fronts, so I've done a bunch of
manual whitespace updates in the patches as well.
Summary:
- Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
- Explicitly reported overflow fixes (Silvio, Kees)
- Add missing kvcalloc() function (Kees)
- Treewide conversions of allocators to use either 2-factor argument
variant when available, or array_size() and array3_size() as needed
(Kees)"
* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
treewide: Use array_size in f2fs_kvzalloc()
treewide: Use array_size() in f2fs_kzalloc()
treewide: Use array_size() in f2fs_kmalloc()
treewide: Use array_size() in sock_kmalloc()
treewide: Use array_size() in kvzalloc_node()
treewide: Use array_size() in vzalloc_node()
treewide: Use array_size() in vzalloc()
treewide: Use array_size() in vmalloc()
treewide: devm_kzalloc() -> devm_kcalloc()
treewide: devm_kmalloc() -> devm_kmalloc_array()
treewide: kvzalloc() -> kvcalloc()
treewide: kvmalloc() -> kvmalloc_array()
treewide: kzalloc_node() -> kcalloc_node()
treewide: kzalloc() -> kcalloc()
treewide: kmalloc() -> kmalloc_array()
mm: Introduce kvcalloc()
video: uvesafb: Fix integer overflow in allocation
UBIFS: Fix potential integer overflow in allocation
leds: Use struct_size() in allocation
Convert intel uncore to struct_size
...
* ARM: lazy context-switching of FPSIMD registers on arm64, "split"
regions for vGIC redistributor
* s390: cleanups for nested, clock handling, crypto, storage keys and
control register bits
* x86: many bugfixes, implement more Hyper-V super powers,
implement lapic_timer_advance_ns even when the LAPIC timer
is emulated using the processor's VMX preemption timer. Two
security-related bugfixes at the top of the branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbH8Z/AAoJEL/70l94x66DF+UIAJeOuTp6LGasT/9uAb2OovaN
+5kGmOPGFwkTcmg8BQHI2fXT4vhxMXWPFcQnyig9eXJVxhuwluXDOH4P9IMay0yw
VDCBsWRdMvZDQad2hn6Z5zR4Jx01XrSaG/KqvXbbDKDCy96mWG7SYAY2m3ZwmeQi
3Pa3O3BTijr7hBYnMhdXGkSn4ZyU8uPaAgIJ8795YKeOJ2JmioGYk6fj6y2WCxA3
ztJymBjTmIoZ/F8bjuVouIyP64xH4q9roAyw4rpu7vnbWGqx1fjPYJoB8yddluWF
JqCPsPzhKDO7mjZJy+lfaxIlzz2BN7tKBNCm88s5GefGXgZwk3ByAq/0GQ2M3rk=
=H5zI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Small update for KVM:
ARM:
- lazy context-switching of FPSIMD registers on arm64
- "split" regions for vGIC redistributor
s390:
- cleanups for nested
- clock handling
- crypto
- storage keys
- control register bits
x86:
- many bugfixes
- implement more Hyper-V super powers
- implement lapic_timer_advance_ns even when the LAPIC timer is
emulated using the processor's VMX preemption timer.
- two security-related bugfixes at the top of the branch"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
kvm: fix typo in flag name
kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
KVM: x86: introduce linear_{read,write}_system
kvm: nVMX: Enforce cpl=0 for VMX instructions
kvm: nVMX: Add support for "VMWRITE to any supported field"
kvm: nVMX: Restrict VMX capability MSR changes
KVM: VMX: Optimize tscdeadline timer latency
KVM: docs: nVMX: Remove known limitations as they do not exist now
KVM: docs: mmu: KVM support exposing SLAT to guests
kvm: no need to check return value of debugfs_create functions
kvm: Make VM ioctl do valloc for some archs
kvm: Change return type to vm_fault_t
KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
KVM: introduce kvm_make_vcpus_request_mask() API
KVM: x86: hyperv: do rep check for each hypercall separately
...
Patch series "Rearrange struct page", v6.
As presented at LSFMM, this patch-set rearranges struct page to give
more contiguous usable space to users who have allocated a struct page
for their own purposes. For a graphical view of before-and-after, see
the first two tabs of
https://docs.google.com/spreadsheets/d/1tvCszs_7FXrjei9_mtFiKV6nW1FLnYyvPvW-qNZhdog/edit?usp=sharing
Highlights:
- deferred_list now really exists in struct page instead of just a comment.
- hmm_data also exists in struct page instead of being a nasty hack.
- x86's PGD pages have a real pointer to the mm_struct.
- VMalloc pages now have all sorts of extra information stored in them
to help with debugging and tuning.
- rcu_head is no longer tied to slab in case anyone else wants to
free pages by RCU.
- slub's counters no longer share space with _refcount.
- slub's freelist+counters are now naturally dword aligned.
- slub loses a parameter to a lot of functions and a sysfs file.
This patch (of 17):
s390 borrows the storage used for _mapcount in struct page in order to
account whether the bottom or top half is being used for 2kB page tables.
I want to use that for something else, so use the top byte of _refcount
instead of the bottom byte of _mapcount. _refcount may temporarily be
incremented by other CPUs that see a stale pointer to this page in the
page cache, but each CPU can only increment it by one, and there are no
systems with 2^24 CPUs today, so they will not change the upper byte of
_refcount. We do have to be a little careful not to lose any of their
writes (as they will subsequently decrement the counter).
Link: http://lkml.kernel.org/r/20180518194519.3820-2-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Up to now we always expected to have the storage key facility
available for our (non-VSIE) KVM guests. For huge page support, we
need to be able to disable it, so let's introduce that now.
We add the use_skf variable to manage KVM storage key facility
usage. Also we rename use_skey in the mm context struct to uses_skeys
to make it more clear that it is an indication that the vm actively
uses storage keys.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Filling in struct siginfo before calling force_sig_info a tedious and
error prone process, where once in a great while the wrong fields
are filled out, and siginfo has been inconsistently cleared.
Simplify this process by using the helper force_sig_fault. Which
takes as a parameters all of the information it needs, ensures
all of the fiddly bits of filling in struct siginfo are done properly
and then calls force_sig_info.
In short about a 5 line reduction in code for every time force_sig_info
is called, which makes the calling function clearer.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Acked-by: Martin Schwidefsky >schwidefsky@de.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.
Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.
The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.
In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.
Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
__get_user_pages_fast handles errors differently from
get_user_pages_fast: the former always returns the number of pages
pinned, the later might return a negative error code.
Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: Pin stack limit during exec".
Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2]. In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging. Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits. This series implements
the approach.
[1] 04e35f4495 ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"
This patch (of 3):
Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.
Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Common code defines linker symbols which denote sections start/end in
a form of char []. Referencing those symbols as _symbol or &_symbol
yields the same result, but "_symbol" form is more widespread across
newly written code. Convert s390 specific code to this style.
Also removes unused _text symbol definition in boot/compressed/misc.c.
Signed-off-by: Vasily Gorbik <gor@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide base_asce_alloc() and base_asce_free() helper functions which
can be used to allocate an ASCE and all required region, segment and
page tables required to access memory regions of the virtual kernel
address space.
Both, the ASCE and all tables, do not use any features that correspond
to e.g. enhanced DAT features. This is required for some I/O functions
that pass an ASCE, like e.g. some service call requests, but which may
not use any enhanced features.
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
We never call it with anything but PROT_READ. This is a left over from
an old prototype. For creation of shadow page tables, we always only
have to protect the original table in guest memory from write accesses,
so we can properly invalidate the shadow on writes. Other protections
are not needed.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20180123212618.32611-1-david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
It seems it hasn't even been used before the last cleanup and was
overlooked.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Message-Id: <1513169613-13509-12-git-send-email-frankja@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
gmap_mprotect_notify() refuses shadow gmaps. Turns out that
a) gmap_protect_range()
b) gmap_read_table()
c) gmap_pte_op_walk()
Are never called for gmap shadows. And never should be. This dates back
to gmap shadow prototypes where we allowed to call mprotect_notify() on
the gmap shadow (to get notified about the prefix pages getting removed).
This is avoided by always getting notified about any change on the gmap
shadow.
The only real function for walking page tables on shadow gmaps is
gmap_table_walk().
So, essentially, these functions should never get called and
gmap_pte_op_walk() can be cleaned up. Add some checks to callers of
gmap_pte_op_walk().
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20171110151805.7541-1-david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Martin Cermak reported that setting a uprobe doesn't work. Reason for
this is that the common uprobes code tries to get an unmapped area at
the last possible page within an address space.
This broke with commit 1aea9b3f92 ("s390/mm: implement 5 level pages
tables") which introduced an off-by-one bug which prevents to map
anything at the last possible page within an address space.
The check with the off-by-one bug however can be removed since with
commit 8ab867cb08 ("s390/mm: fix BUG_ON in crst_table_upgrade") the
necessary check is done at both call sites.
Reported-by: Martin Cermak <mcermak@redhat.com>
Bisected-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Fixes: 1aea9b3f92 ("s390/mm: implement 5 level pages tables")
Cc: <stable@vger.kernel.org> # v4.13+
Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 fixes from Martin Schwidefsky:
- SPDX identifiers are added to more of the s390 specific files.
- The ELF_ET_DYN_BASE base patch from Kees is reverted, with the change
some old 31-bit programs crash.
- Bug fixes and cleanups.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (29 commits)
s390/gs: add compat regset for the guarded storage broadcast control block
s390: revert ELF_ET_DYN_BASE base changes
s390: Remove redundant license text
s390: crypto: Remove redundant license text
s390: include: Remove redundant license text
s390: kernel: Remove redundant license text
s390: add SPDX identifiers to the remaining files
s390: appldata: add SPDX identifiers to the remaining files
s390: pci: add SPDX identifiers to the remaining files
s390: mm: add SPDX identifiers to the remaining files
s390: crypto: add SPDX identifiers to the remaining files
s390: kernel: add SPDX identifiers to the remaining files
s390: sthyi: add SPDX identifiers to the remaining files
s390: drivers: Remove redundant license text
s390: crypto: Remove redundant license text
s390: virtio: add SPDX identifiers to the remaining files
s390: scsi: zfcp_aux: add SPDX identifier
s390: net: add SPDX identifiers to the remaining files
s390: char: add SPDX identifiers to the remaining files
s390: cio: add SPDX identifiers to the remaining files
...
Now that the SPDX tag is in all arch/s390/ files, that identifies the
license in a specific and legally-defined manner. So the extra GPL text
wording in the remaining files can be removed as it is no longer needed
at all.
This is done on a quest to remove the 700+ different ways that files in
the kernel describe the GPL license text. And there's unneeded stuff
like the address (sometimes incorrect) for the FSF which is never
needed.
No copyright headers or other non-license-description text was removed.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.
Update the arch/s390/mm/ files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of
the full boiler plate text.
This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
The hardware sampler creates samples that are processed at a later
point in time. The PID and TID values of the perf samples that are
created for hardware samples are initialized with values from the
current task. Hence, the PID and TID values are not correct and
perf samples are associated with wrong processes.
The PID and TID values are obtained from the Host Program Parameter
(HPP) field in the basic-sampling data entries. These PIDs are
valid in the init PID namespace. Ensure that the PIDs in the perf
samples are resolved considering the PID namespace in which the
perf event was created.
To correct the PID and TID values in the created perf samples,
a special overflow handler is installed. It replaces the default
overflow handler and does not become effective if any other
overflow handler is used. With the special overflow handler most
of the perf samples are associated with the right processes.
For processes, that are no longer exist, the association might
still be wrong.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The vdso code for the getcpu() and the clock_gettime() call use the access
register mode to access the per-CPU vdso data page with the current code.
An alternative to the complicated AR mode is to use the secondary space
mode. This makes the vdso faster and quite a bit simpler. The downside is
that the uaccess code has to be changed quite a bit.
Which instructions are used depends on the machine and what kind of uaccess
operation is requested. The instruction dictates which ASCE value needs
to be loaded into %cr1 and %cr7.
The different cases:
* User copy with MVCOS for z10 and newer machines
The MVCOS instruction can copy between the primary space (aka user) and
the home space (aka kernel) directly. For set_fs(KERNEL_DS) the kernel
ASCE is loaded into %cr1. For set_fs(USER_DS) the user space is already
loaded in %cr1.
* User copy with MVCP/MVCS for older machines
To be able to execute the MVCP/MVCS instructions the kernel needs to
switch to primary mode. The control register %cr1 has to be set to the
kernel ASCE and %cr7 to either the kernel ASCE or the user ASCE dependent
on set_fs(KERNEL_DS) vs set_fs(USER_DS).
* Data access in the user address space for strnlen / futex
To use "normal" instruction with data from the user address space the
secondary space mode is used. The kernel needs to switch to primary mode,
%cr1 has to contain the kernel ASCE and %cr7 either the user ASCE or the
kernel ASCE, dependent on set_fs.
To load a new value into %cr1 or %cr7 is an expensive operation, the kernel
tries to be lazy about it. E.g. for multiple user copies in a row with
MVCP/MVCS the replacement of the vdso ASCE in %cr7 with the user ASCE is
done only once. On return to user space a CPU bit is checked that loads the
vdso ASCE again.
To enable and disable the data access via the secondary space two new
functions are added, enable_sacf_uaccess and disable_sacf_uaccess. The fact
that a context is in secondary space uaccess mode is stored in the
mm_segment_t value for the task. The code of an interrupt may use set_fs
as long as it returns to the previous state it got with get_fs with another
call to set_fs. The code in finish_arch_post_lock_switch simply has to do a
set_fs with the current mm_segment_t value for the task.
For CPUs with MVCOS:
CPU running in | %cr1 ASCE | %cr7 ASCE |
--------------------------------------|-----------|-----------|
user space | user | vdso |
kernel, USER_DS, normal-mode | user | vdso |
kernel, USER_DS, normal-mode, lazy | user | user |
kernel, USER_DS, sacf-mode | kernel | user |
kernel, KERNEL_DS, normal-mode | kernel | vdso |
kernel, KERNEL_DS, normal-mode, lazy | kernel | kernel |
kernel, KERNEL_DS, sacf-mode | kernel | kernel |
For CPUs without MVCOS:
CPU running in | %cr1 ASCE | %cr7 ASCE |
--------------------------------------|-----------|-----------|
user space | user | vdso |
kernel, USER_DS, normal-mode | user | vdso |
kernel, USER_DS, normal-mode lazy | kernel | user |
kernel, USER_DS, sacf-mode | kernel | user |
kernel, KERNEL_DS, normal-mode | kernel | vdso |
kernel, KERNEL_DS, normal-mode, lazy | kernel | kernel |
kernel, KERNEL_DS, sacf-mode | kernel | kernel |
The lines with "lazy" refer to the state after a copy via the secondary
space with a delayed reload of %cr1 and %cr7.
There are three hardware address spaces that can cause a DAT exception,
primary, secondary and home space. The exception can be related to
four different fault types: user space fault, vdso fault, kernel fault,
and the gmap faults.
Dependent on the set_fs state and normal vs. sacf mode there are a number
of fault combinations:
1) user address space fault via the primary ASCE
2) gmap address space fault via the primary ASCE
3) kernel address space fault via the primary ASCE for machines with
MVCOS and set_fs(KERNEL_DS)
4) vdso address space faults via the secondary ASCE with an invalid
address while running in secondary space in problem state
5) user address space fault via the secondary ASCE for user-copy
based on the secondary space mode, e.g. futex_ops or strnlen_user
6) kernel address space fault via the secondary ASCE for user-copy
with secondary space mode with set_fs(KERNEL_DS)
7) kernel address space fault via the primary ASCE for user-copy
with secondary space mode with set_fs(USER_DS) on machines without
MVCOS.
8) kernel address space fault via the home space ASCE
Replace user_space_fault() with a new function get_fault_type() that
can distinguish all four different fault types.
With these changes the futex atomic ops from the kernel and the
strnlen_user will get a little bit slower, as well as the old style
uaccess with MVCP/MVCS. All user accesses based on MVCOS will be as
fast as before. On the positive side, the user space vdso code is a
lot faster and Linux ceases to use the complicated AR mode.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The identification of guest fault currently relies on the PF_VCPU flag.
This is set in guest_entry_irqoff and cleared in guest_exit_irqoff.
Both functions are called by __vcpu_run, the PF_VCPU flag is set for
quite a lot of kernel code outside of the guest execution.
Replace the PF_VCPU scheme with the PIF_GUEST_FAULT in the pt_regs and
make the program check handler code in entry.S set the bit only for
exception that occurred between the .Lsie_gmap and .Lsie_done labels.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>