Commit Graph

3830 Commits

Author SHA1 Message Date
Linus Torvalds
c42b59bfaa - Convert a couple of paravirt callbacks to asm to prevent
-fzero-call-used-regs builds from zeroing live registers because
   paravirt hides the CALLs from the compiler so latter doesn't know
   there's a CALL in the first place
 
 - Merge two paravirt callbacks into one, as their functionality is
   identical
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRGjhwACgkQEsHwGGHe
 VUpRUA//QduAqsDoIJ1U/y7JxbLriFR9X9rCY4u5pq2ZSroQZ861eBeUCJnlc+MQ
 +zlAsmGkLpb9b5be85vMByz++rO4ZTfBVamqJORD9Zj99RY0F7ym/HYXK4CP6J+E
 IrgMTTLPd8kMH/5Pb/tiNXOuKnNw99MdKhE5CPKGnvtM7eSzLrIuN9sHAUb9SMuV
 l+TWYh4vkf8+XfzSpp5WYaCgyDBN8tMWD3cBeLTljT3OEOh9vIQYWjRliKQyxjWG
 FJ8BnL8Nx+3kDkRjHyK4/h0P0KQYB6hnRSOrZyaae2H3N7uSMQbcLuRC6aXz1amm
 9AKoubhzx/A5hwGx8jKtGuLCkEtSakdcbiF0l3gek3Auecxcg6x8W+cCNvpq8FGV
 DJ349RPqR7TlKJwyvPp7dHRozVrY2sdbWZILxLhKDvAoOR4F927dt9+A96glc5dP
 VTnrlptj1vX+dSkKgKRTmPUKbsXM2h003qTiAUVzjMP0PcKUKknpBhz7kLQ3gpFc
 7rxyjHWANQJpY39WHvuIv+pzVUodrUGioA1LcEisx8FCM/iAIoejLi+ybbRMyc/2
 NN3TMxoEl3RIQCOFgsM8NxAvOL9P6+82NiM+0v0TgzszMlso7RzbjBeaaWRtxX+O
 82p9mTLDQuxESkA0HEwoTQa/xfO51zCi+SeLfhFO6A4s93Sjjb0=
 =Eu5f
 -----END PGP SIGNATURE-----

Merge tag 'x86_paravirt_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 paravirt updates from Borislav Petkov:

 - Convert a couple of paravirt callbacks to asm to prevent
   '-fzero-call-used-regs' builds from zeroing live registers because
   paravirt hides the CALLs from the compiler so latter doesn't know
   there's a CALL in the first place

 - Merge two paravirt callbacks into one, as their functionality is
   identical

* tag 'x86_paravirt_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/paravirt: Convert simple paravirt functions to asm
  x86/paravirt: Merge activate_mm() and dup_mmap() callbacks
2023-04-25 10:32:51 -07:00
Linus Torvalds
c23f28975a Commit volume in documentation is relatively low this time, but there is
still a fair amount going on, including:
 
 - Reorganizing the architecture-specific documentation under
   Documentation/arch.  This makes the structure match the source directory
   and helps to clean up the mess that is the top-level Documentation
   directory a bit.  This work creates the new directory and moves x86 and
   most of the less-active architectures there.  The current plan is to move
   the rest of the architectures in 6.5, with the patches going through the
   appropriate subsystem trees.
 
 - Some more Spanish translations and maintenance of the Italian
   translation.
 
 - A new "Kernel contribution maturity model" document from Ted.
 
 - A new tutorial on quickly building a trimmed kernel from Thorsten.
 
 Plus the usual set of updates and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmRGze0PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Y/VsH/RyWqinorRVFZmHqRJMRhR0j7hE2pAgK5prE
 dGXYVtHHNQ+25thNaqhZTOLYFbSX6ii2NG7sLRXmyOTGIZrhUCFFXCHkuq4ZUypR
 gJpMUiKQVT4dhln3gIZ0k09NSr60gz8UTcq895N9UFpUdY1SCDhbCcLc4uXTRajq
 NrdgFaHWRkPb+gBRbXOExYm75DmCC6Ny5AyGo2rXfItV//ETjWIJVQpJhlxKrpMZ
 3LgpdYSLhEFFnFGnXJ+EAPJ7gXDi2Tg5DuPbkvJyFOTouF3j4h8lSS9l+refMljN
 xNRessv+boge/JAQidS6u8F2m2ESSqSxisv/0irgtKIMJwXaoX4=
 =1//8
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.4' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Commit volume in documentation is relatively low this time, but there
  is still a fair amount going on, including:

   - Reorganize the architecture-specific documentation under
     Documentation/arch

     This makes the structure match the source directory and helps to
     clean up the mess that is the top-level Documentation directory a
     bit. This work creates the new directory and moves x86 and most of
     the less-active architectures there.

     The current plan is to move the rest of the architectures in 6.5,
     with the patches going through the appropriate subsystem trees.

   - Some more Spanish translations and maintenance of the Italian
     translation

   - A new "Kernel contribution maturity model" document from Ted

   - A new tutorial on quickly building a trimmed kernel from Thorsten

  Plus the usual set of updates and fixes"

* tag 'docs-6.4' of git://git.lwn.net/linux: (47 commits)
  media: Adjust column width for pdfdocs
  media: Fix building pdfdocs
  docs: clk: add documentation to log which clocks have been disabled
  docs: trace: Fix typo in ftrace.rst
  Documentation/process: always CC responsible lists
  docs: kmemleak: adjust to config renaming
  ELF: document some de-facto PT_* ABI quirks
  Documentation: arm: remove stih415/stih416 related entries
  docs: turn off "smart quotes" in the HTML build
  Documentation: firmware: Clarify firmware path usage
  docs/mm: Physical Memory: Fix grammar
  Documentation: Add document for false sharing
  dma-api-howto: typo fix
  docs: move m68k architecture documentation under Documentation/arch/
  docs: move parisc documentation under Documentation/arch/
  docs: move ia64 architecture docs under Documentation/arch/
  docs: Move arc architecture docs under Documentation/arch/
  docs: move nios2 documentation under Documentation/arch/
  docs: move openrisc documentation under Documentation/arch/
  docs: move superh documentation under Documentation/arch/
  ...
2023-04-24 12:35:49 -07:00
Nick Alcock
569e4d25f4 x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13 13:13:54 -07:00
Suren Baghdasaryan
0bff0aaea0 x86/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Link: https://lkml.kernel.org/r/20230227173632.3292573-30-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:01 -07:00
Jonathan Corbet
ff61f0791c docs: move x86 documentation into Documentation/arch/
Move the x86 documentation under Documentation/arch/ as a way of cleaning
up the top-level directory and making the structure of our docs more
closely match the structure of the source directories it describes.

All in-kernel references to the old paths have been updated.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-03-30 12:58:51 -06:00
Ma Wupeng
d155df53f3 x86/mm/pat: clear VM_PAT if copy_p4d_range failed
Syzbot reports a warning in untrack_pfn().  Digging into the root we found
that this is due to memory allocation failure in pmd_alloc_one.  And this
failure is produced due to failslab.

In copy_page_range(), memory alloaction for pmd failed.  During the error
handling process in copy_page_range(), mmput() is called to remove all
vmas.  While untrack_pfn this empty pfn, warning happens.

Here's a simplified flow:

dup_mm
  dup_mmap
    copy_page_range
      copy_p4d_range
        copy_pud_range
          copy_pmd_range
            pmd_alloc
              __pmd_alloc
                pmd_alloc_one
                  page = alloc_pages(gfp, 0);
                    if (!page)
                      return NULL;
    mmput
        exit_mmap
          unmap_vmas
            unmap_single_vma
              untrack_pfn
                follow_phys
                  WARN_ON_ONCE(1);

Since this vma is not generate successfully, we can clear flag VM_PAT.  In
this case, untrack_pfn() will not be called while cleaning this vma.

Function untrack_pfn_moved() has also been renamed to fit the new logic.

Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:07 -07:00
Michael Kelley
812b0597fb x86/hyperv: Change vTOM handling to use standard coco mechanisms
Hyper-V guests on AMD SEV-SNP hardware have the option of using the
"virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
architecture. With vTOM, shared vs. private memory accesses are
controlled by splitting the guest physical address space into two
halves.

vTOM is the dividing line where the uppermost bit of the physical
address space is set; e.g., with 47 bits of guest physical address
space, vTOM is 0x400000000000 (bit 46 is set).  Guest physical memory is
accessible at two parallel physical addresses -- one below vTOM and one
above vTOM.  Accesses below vTOM are private (encrypted) while accesses
above vTOM are shared (decrypted). In this sense, vTOM is like the
GPA.SHARED bit in Intel TDX.

Support for Hyper-V guests using vTOM was added to the Linux kernel in
two patch sets[1][2]. This support treats the vTOM bit as part of
the physical address. For accessing shared (decrypted) memory, these
patch sets create a second kernel virtual mapping that maps to physical
addresses above vTOM.

A better approach is to treat the vTOM bit as a protection flag, not
as part of the physical address. This new approach is like the approach
for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel
virtual mapping, the existing mapping is updated using recently added
coco mechanisms.

When memory is changed between private and shared using
set_memory_decrypted() and set_memory_encrypted(), the PTEs for the
existing kernel mapping are changed to add or remove the vTOM bit in the
guest physical address, just as with TDX. The hypercalls to change the
memory status on the host side are made using the existing callback
mechanism. Everything just works, with a minor tweak to map the IO-APIC
to use private accesses.

To accomplish the switch in approach, the following must be done:

* Update Hyper-V initialization to set the cc_mask based on vTOM
  and do other coco initialization.

* Update physical_mask so the vTOM bit is no longer treated as part
  of the physical address

* Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality
  under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear
  the vTOM bit as a protection flag.

* Code already exists to make hypercalls to inform Hyper-V about pages
  changing between shared and private.  Update this code to run as a
  callback from __set_memory_enc_pgtable().

* Remove the Hyper-V special case from __set_memory_enc_dec()

* Remove the Hyper-V specific call to swiotlb_update_mem_attributes()
  since mem_encrypt_init() will now do it.

* Add a Hyper-V specific implementation of the is_private_mmio()
  callback that returns true for the IO-APIC and vTPM MMIO addresses

  [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/
  [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/

  [ bp: Touchups. ]

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com
2023-03-27 09:31:43 +02:00
Michael Kelley
c7b5254bd8 x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
sme_postprocess_startup() decrypts the bss_decrypted section when
sme_me_mask is non-zero.

mem_encrypt_free_decrypted_mem() re-encrypts the unused portion based
on CC_ATTR_MEM_ENCRYPT.

In a Hyper-V guest VM using vTOM, these conditions are not equivalent
as sme_me_mask is always zero when using vTOM. Consequently,
mem_encrypt_free_decrypted_mem() attempts to re-encrypt memory that was
never decrypted.

So check sme_me_mask in mem_encrypt_free_decrypted_mem() too.

Hyper-V guests using vTOM don't need the bss_decrypted section to be
decrypted, so skipping the decryption/re-encryption doesn't cause a
problem.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1678329614-3482-5-git-send-email-mikelley@microsoft.com
2023-03-27 09:23:21 +02:00
Michael Kelley
88e378d400 x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM
Current code always maps MMIO devices as shared (decrypted) in a
confidential computing VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
use a paravisor running in VMPL0 to emulate some devices, such as the
IO-APIC and TPM. In such a case, the device must be accessed as private
(encrypted) because the paravisor emulates the device at an address below
vTOM, where all accesses are encrypted.

Add a new hypervisor callback to determine if an MMIO address should
be mapped private. The callback allows hypervisor-specific code to handle
any quirks, the use of a paravisor, etc. in determining whether a mapping
must be private. If the callback is not used by a hypervisor, default
to returning "false", which is consistent with normal coco VM behavior.

Use this callback as another special case to check for when doing
ioremap().  Just checking the starting address is sufficient as an
ioremap range must be all private or all shared.

Also make the callback in early boot IO-APIC mapping code that uses the
fixmap.

  [ bp: Touchups. ]

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1678329614-3482-2-git-send-email-mikelley@microsoft.com
2023-03-26 23:42:40 +02:00
Michal Koutný
a3f547addc x86/mm: Do not shuffle CPU entry areas without KASLR
The commit 97e3d26b5e ("x86/mm: Randomize per-cpu entry area") fixed
an omission of KASLR on CPU entry areas. It doesn't take into account
KASLR switches though, which may result in unintended non-determinism
when a user wants to avoid it (e.g. debugging, benchmarking).

Generate only a single combination of CPU entry areas offsets -- the
linear array that existed prior randomization when KASLR is turned off.

Since we have 3f148f3318 ("x86/kasan: Map shadow for percpu pages on
demand") and followups, we can use the more relaxed guard
kasrl_enabled() (in contrast to kaslr_memory_enabled()).

Fixes: 97e3d26b5e ("x86/mm: Randomize per-cpu entry area")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230306193144.24605-1-mkoutny%40suse.com
2023-03-22 10:42:47 -07:00
Kirill A. Shutemov
74c228d20a x86/uaccess: Provide untagged_addr() and remove tags before address check
untagged_addr() is a helper used by the core-mm to strip tag bits and
get the address to the canonical shape based on rules of the current
thread. It only handles userspace addresses.

The untagging mask is stored in per-CPU variable and set on context
switching to the task.

The tags must not be included into check whether it's okay to access the
userspace address. Strip tags in access_ok().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-7-kirill.shutemov%40linux.intel.com
2023-03-16 13:08:39 -07:00
Kirill A. Shutemov
82721d8b25 x86/mm: Handle LAM on context switch
Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-5-kirill.shutemov%40linux.intel.com
2023-03-16 13:08:39 -07:00
Nikita Zhandarovich
cbebd68f59 x86/mm: Fix use of uninitialized buffer in sme_enable()
cmdline_find_option() may fail before doing any initialization of
the buffer array. This may lead to unpredictable results when the same
buffer is used later in calls to strncmp() function.  Fix the issue by
returning early if cmdline_find_option() returns an error.

Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.

Fixes: aca20d5462 ("x86/mm: Add support to make use of Secure Memory Encryption")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230306160656.14844-1-n.zhandarovich@fintech.ru
2023-03-16 12:22:25 +01:00
Juergen Gross
c9ae1b10d9 x86/paravirt: Merge activate_mm() and dup_mmap() callbacks
The two paravirt callbacks .mmu.activate_mm() and .mmu.dup_mmap() are
sharing the same implementations in all cases: for Xen PV guests they
are pinning the PGD of the new mm_struct, and for all other cases they
are a NOP.

In the end, both callbacks are meant to register an address space with
the underlying hypervisor, so there needs to be only a single callback
for that purpose.

So merge them to a common callback .mmu.enter_mmap() (in contrast to the
corresponding already existing .mmu.exit_mmap()).

As the first parameter of the old callbacks isn't used, drop it from the
replacement one.

  [ bp: Remove last occurrence of paravirt_activate_mm() in
    asm/mmu_context.h ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Link: https://lore.kernel.org/r/20230207075902.7539-1-jgross@suse.com
2023-03-06 09:41:37 +01:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Linus Torvalds
238b05ec99 Micro-optimize __flush_tlb_all().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzdIoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iHNQ//UJ49mOUEHiyChwMxEWcHjY4KXffafg54
 ahXCMgT3mwaA0OEjGvVON0w8Fe6NVeMv2n8/mR+X/r6euBpk/XJDCAa/tUfx3wUh
 9AiWSktlAgR/2lgof8/UDdAaHxjabXJCS/qi2u+0D/cKr5odDR5SogCca6AeG1op
 ZskTJPg3teSDbW8vd9k7wkTYy8xRAHUmdBibXvbVXdehzm8Qw2RDC6pfONlI26up
 UEteiBT9FjxyjrLhZ2snsnNPei1JFMbvjikJCPEwRES4sFNZbevHb2NlDxUSC7M+
 wwb8VnUI9UX9HMZFiqVTXfwDNAlKpQmYuSfL5juFPgLgoiiHPpdKr4tw9jiNdB++
 NLeIsbQLS70wpC2PflFo4Kmpr3FQU5FA1TesBaD3c6wjey+PGvojCRyh2pbTL2tQ
 dM/hPZnsmYkbzUgFV8rSXCON7LdXFgkKBJJKgW9eT1njQRbxHL/6RVh67fa1640O
 gamXYi7icrrkaMPRCB+TwtA14S7FBVKo/I4m+4j880zY3Hrvv7pzoHPQilu5Hg32
 4yZC0XJcoQcWfCOxMCOFdbeAOx+PIxPJxIuSh6HdpJ7nL5igbXJtIBea78ru2RhS
 22T+0h4dpm8ZygeGvMCunZBvzrEUUJdEbIC1csOcSKPzzMzVfEf0rr+QLAFXUDVn
 S/rgNqnV+Hc=
 =iekt
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm update from Ingo Molnar:
 "Micro-optimize __flush_tlb_all()"

* tag 'x86-mm-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Use cpu_feature_enabled() when checking global pages support
2023-02-20 19:00:00 -08:00
Linus Torvalds
6e649d0856 Updates for this cycle were:
- rwsem micro-optimizations
  - spinlock micro-optimizations
  - cleanups, simplifications
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzZkURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gBvA/6A5RMmSdOIVojmiwUYvZInA/Kpm2wuW0q
 bQWW9maLb96JMpj3FB5Xs5U993WiF0Gt9aGHoND9V2wOYbnv01ElCKKgsw7zLnXb
 c++txpmD+HoUGp94H8T2nA3szPLR7OpPpLmfjTWHKeWQRTStJobTTqi5jVTUZT37
 92MZ2tVzapQJq5VESk0C+0FBFDobh0gTX8hwkEj83ubXK4rC071/gJD4JHZt4nWN
 Up9YGNoNvw+ns7upo2C1XJ4H4ucFoCXT2smH4Oh0gk8Cfs6oP1k5H8J5aJQ+23fT
 EOWzkk0vJdpukNXI1+4G4KMwCO6zv+xVxXpEBizEeTgKWwbJpgBeGrisheAUyMHT
 bwfztsn+NQET11NsccmtRzspscUT42Nc+FUW0KeR2LiBKZhuD6l1Tac3w2HolycA
 2YjMQx3ATOEnMFgv4jGlldlasIAnYj0qitw6wCGqkJSvrC3au/LfcBHn45SxkBWc
 KZV1Oj26aH1hDxYSLyZRmFEvf/46D9CHmv8ReuFbmM6FIYwL+go+Odw/MHMFAbZR
 aP9YR4e94p6WmaNwMqzozP+wN67E4TME2vG6+T1n/szKDogoBlcn/wl053pkcHa/
 CsjELY82/CRRrDgWSnSKUZEFvnnBujyEiSz7pTZCzdBTMc/EcxK5CHOSyN23x+LI
 TvvxFn7KM/o=
 =yTly
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - rwsem micro-optimizations

 - spinlock micro-optimizations

 - cleanups, simplifications

* tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vduse: Remove include of rwlock.h
  locking/lockdep: Remove lockdep_init_map_crosslock.
  x86/ACPI/boot: Use try_cmpxchg() in __acpi_{acquire,release}_global_lock()
  x86/PAT: Use try_cmpxchg() in set_page_memtype()
  locking/rwsem: Disable preemption in all down_write*() and up_write() code paths
  locking/rwsem: Disable preemption in all down_read*() and up_read() code paths
  locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  locking/qspinlock: Micro-optimize pending state waiting for unlock
2023-02-20 17:18:23 -08:00
Juergen Gross
f9f57da2c2 x86/mtrr: Revert 90b926e68f ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Commit

  90b926e68f ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")

broke the use case of running Xen dom0 kernels on machines with an
external disk enclosure attached via USB, see Link tag.

What this commit was originally fixing - SEV-SNP guests on Hyper-V - is
a more specialized situation which has other issues at the moment anyway
so reverting this now and addressing the issue properly later is the
prudent thing to do.

So revert it in time for the 6.2 proper release.

  [ bp: Rewrite commit message. ]

Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de
2023-02-14 10:16:34 +01:00
Suren Baghdasaryan
68f48381d7 mm: introduce __vm_flags_mod and use it in untrack_pfn
There are scenarios when vm_flags can be modified without exclusive
mmap_lock, such as:
- after VMA was isolated and mmap_lock was downgraded or dropped
- in exit_mmap when there are no other mm users and locking is unnecessary
Introduce __vm_flags_mod to avoid assertions when the caller takes
responsibility for the required locking.
Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
flags modification to avoid assertion.

Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:40 -08:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Alexander Potapenko
1e15d374bb Revert "x86: kmsan: sync metadata pages on page fault"
This reverts commit 3f1e2c7a90.

As noticed by Qun-Wei Lin, arch_sync_kernel_mappings() in
arch/x86/mm/fault.c is only used with CONFIG_X86_32, whereas KMSAN is only
supported on x86_64, where this code is not compiled.

The patch in question dates back to downstream KMSAN branch based on
v5.8-rc5, it sneaked into upstream unnoticed in v6.1.

Link: https://lkml.kernel.org/r/20230111101806.3236991-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Qun-Wei Lin <qun-wei.lin@mediatek.com>
  Link: https://github.com/google/kmsan/issues/91
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:01 -08:00
Uros Bizjak
50fd4d5e69 x86/PAT: Use try_cmpxchg() in set_page_memtype()
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
set_page_memtype.  x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230116163446.4734-1-ubizjak@gmail.com
2023-01-26 11:49:39 +01:00
Borislav Petkov (AMD)
ebd3ad60a6 x86/cpu: Use cpu_feature_enabled() when checking global pages support
X86_FEATURE_PGE determines whether the CPU has enabled global page
translations support. Use the faster cpu_feature_enabled() check to
shave off some more cycles when flushing all TLB entries, including the
global ones.

What this practically saves is:

   mov    0x82eb308(%rip),%rax        # 0xffffffff8935bec8 <boot_cpu_data+40>
   test   $0x20,%ah

... which test the bit. Not a lot, but TLB flushing is a timing-sensitive
path, so anything to make it even faster.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230125075013.9292-1-bp@alien8.de
2023-01-25 10:32:06 +01:00
Juergen Gross
26ce6ec364 x86/mm: fix poking_init() for Xen PV guests
Commit 3f4c8211d9 ("x86/mm: Use mm_alloc() in poking_init()") broke
the kernel for running as Xen PV guest.

It seems as if the new address space is never activated before being
used, resulting in Xen rejecting to accept the new CR3 value (the PGD
isn't pinned).

Fix that by adding the now missing call of paravirt_arch_dup_mmap() to
poking_init(). That call was previously done by dup_mm()->dup_mmap() and
it is a NOP for all cases but for Xen PV, where it is just doing the
pinning of the PGD.

Fixes: 3f4c8211d9 ("x86/mm: Use mm_alloc() in poking_init()")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230109150922.10578-1-jgross@suse.com
2023-01-12 11:22:20 +01:00
Juergen Gross
90b926e68f x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case
Since

  72cbc8f04f ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")

PAT can be enabled without MTRR.

This has resulted in problems e.g. for a SEV-SNP guest running under Hyper-V,
when trying to establish a new mapping via memremap() with WB caching mode, as
pat_x_mtrr_type() will call mtrr_type_lookup(), which in turn is returning
MTRR_TYPE_INVALID due to MTRR being disabled in this configuration.

The result is a mapping with UC- caching, leading to severe performance
degradation.

Fix that by handling MTRR_TYPE_INVALID the same way as MTRR_TYPE_WRBACK
in pat_x_mtrr_type() because MTRR_TYPE_INVALID means MTRRs are disabled.

  [ bp: Massage commit message. ]

Fixes: 72cbc8f04f ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Reported-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230110065427.20767-1-jgross@suse.com
2023-01-10 17:21:53 +01:00
Jason A. Donenfeld
3c202d14a9 prandom: remove prandom_u32_max()
Convert the final two users of prandom_u32_max() that slipped in during
6.2-rc1 to use get_random_u32_below().

Then, with no more users left, we can finally remove the deprecated
function.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-12-20 03:13:45 +01:00
Linus Torvalds
4f292c4de4 New Feature:
* Randomize the per-cpu entry areas
 Cleanups:
 * Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
   coding it
 * Move to "native" set_memory_rox() helper
 * Clean up pmd_get_atomic() and i386-PAE
 * Remove some unused page table size macros
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
 krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
 cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
 Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
 MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
 UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
 9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
 +PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
 x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
 VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
 D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
 dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
 =wwVF
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "New Feature:

   - Randomize the per-cpu entry areas

  Cleanups:

   - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it

   - Move to "native" set_memory_rox() helper

   - Clean up pmd_get_atomic() and i386-PAE

   - Remove some unused page table size macros"

* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  x86/mm: Ensure forced page table splitting
  x86/kasan: Populate shadow for shared chunk of the CPU entry area
  x86/kasan: Add helpers to align shadow addresses up and down
  x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
  x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
  x86/mm: Recompute physical address for every page of per-CPU CEA mapping
  x86/mm: Rename __change_page_attr_set_clr(.checkalias)
  x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
  x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
  x86/mm: Add a few comments
  x86/mm: Fix CR3_ADDR_MASK
  x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
  mm: Convert __HAVE_ARCH_P..P_GET to the new style
  mm: Remove pointless barrier() after pmdp_get_lockless()
  x86/mm/pae: Get rid of set_64bit()
  x86_64: Remove pointless set_64bit() usage
  x86/mm/pae: Be consistent with pXXp_get_and_clear()
  x86/mm/pae: Use WRITE_ONCE()
  x86/mm/pae: Don't (ab)use atomic64
  mm/gup: Fix the lockless PMD access
  ...
2022-12-17 14:06:53 -06:00
Linus Torvalds
fe36bb8736 Tracing updates for 6.2:
- Add options to the osnoise tracer
   o panic_on_stop option that panics the kernel if osnoise is greater than some
     user defined threshold.
   o preempt option, to test noise while preemption is disabled
   o irq option, to test noise when interrupts are disabled
 
 - Add .percent and .graph suffix to histograms to give different outputs
 
 - Add nohitcount to disable showing hitcount in histogram output
 
 - Add new __cpumask() to trace event fields to annotate that a unsigned long
   array is a cpumask to user space and should be treated as one.
 
 - Add trace_trigger kernel command line parameter to enable trace event
   triggers at boot up. Useful to trace stack traces, disable tracing and take
   snapshots.
 
 - Fix x86/kmmio mmio tracer to work with the updates to lockdep
 
 - Unify the panic and die notifiers
 
 - Add back ftrace_expect reference that is used to extract more information in
   the ftrace_bug() code.
 
 - Have trigger filter parsing errors show up in the tracing error log.
 
 - Updated MAINTAINERS file to add kernel tracing  mailing list and patchwork
   info
 
 - Use IDA to keep track of event type numbers.
 
 - And minor fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5vIcxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlO7AQCmtZbriadAR6N7Llj092YXmYfzrxyi
 1WS35vhpZsBJ8gEA8j68l+LrgNt51N2gXlTXEHgXzdBgL/TKAPSX4D99GQY=
 =z1pe
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Add options to the osnoise tracer:
      - 'panic_on_stop' option that panics the kernel if osnoise is
        greater than some user defined threshold.
      - 'preempt' option, to test noise while preemption is disabled
      - 'irq' option, to test noise when interrupts are disabled

 - Add .percent and .graph suffix to histograms to give different
   outputs

 - Add nohitcount to disable showing hitcount in histogram output

 - Add new __cpumask() to trace event fields to annotate that a unsigned
   long array is a cpumask to user space and should be treated as one.

 - Add trace_trigger kernel command line parameter to enable trace event
   triggers at boot up. Useful to trace stack traces, disable tracing
   and take snapshots.

 - Fix x86/kmmio mmio tracer to work with the updates to lockdep

 - Unify the panic and die notifiers

 - Add back ftrace_expect reference that is used to extract more
   information in the ftrace_bug() code.

 - Have trigger filter parsing errors show up in the tracing error log.

 - Updated MAINTAINERS file to add kernel tracing mailing list and
   patchwork info

 - Use IDA to keep track of event type numbers.

 - And minor fixes and clean ups

* tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits)
  tracing: Fix cpumask() example typo
  tracing: Improve panic/die notifiers
  ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels
  tracing: Do not synchronize freeing of trigger filter on boot up
  tracing: Remove pointer (asterisk) and brackets from cpumask_t field
  tracing: Have trigger filter parsing errors show up in error_log
  x86/mm/kmmio: Remove redundant preempt_disable()
  tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_line
  Documentation/osnoise: Add osnoise/options documentation
  tracing/osnoise: Add preempt and/or irq disabled options
  tracing/osnoise: Add PANIC_ON_STOP option
  Documentation/osnoise: Escape underscore of NO_ prefix
  tracing: Fix some checker warnings
  tracing/osnoise: Make osnoise_options static
  tracing: remove unnecessary trace_trigger ifdef
  ring-buffer: Handle resize in early boot up
  tracing/hist: Fix issue of losting command info in error_log
  tracing: Fix issue of missing one synthetic field
  tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx'
  tracing/hist: Fix wrong return value in parse_action_params()
  ...
2022-12-15 18:01:16 -08:00
Dave Hansen
3e844d842d x86/mm: Ensure forced page table splitting
There are a few kernel users like kfence that require 4k pages to work
correctly and do not support large mappings.  They use set_memory_4k()
to break down those large mappings.

That, in turn relies on cpa_data->force_split option to indicate to
set_memory code that it should split page tables regardless of whether
the need to be.

But, a recent change added an optimization which would return early
if a set_memory request came in that did not change permissions.  It
did not consult ->force_split and would mistakenly optimize away the
splitting that set_memory_4k() needs.  This broke kfence.

Skip the same-permission optimization when ->force_split is set.

Fixes: 127960a05548 ("x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Marco Elver <elver@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/CA+G9fYuFxZTxkeS35VTZMXwQvohu73W3xbZ5NtjebsVvH6hCuA@mail.gmail.com/
2022-12-15 10:37:28 -08:00
Sean Christopherson
1cfaac2400 x86/kasan: Populate shadow for shared chunk of the CPU entry area
Popuplate the shadow for the shared portion of the CPU entry area, i.e.
the read-only IDT mapping, during KASAN initialization.  A recent change
modified KASAN to map the per-CPU areas on-demand, but forgot to keep a
shadow for the common area that is shared amongst all CPUs.

Map the common area in KASAN init instead of letting idt_map_in_cea() do
the dirty work so that it Just Works in the unlikely event more shared
data is shoved into the CPU entry area.

The bug manifests as a not-present #PF when software attempts to lookup
an IDT entry, e.g. when KVM is handling IRQs on Intel CPUs (KVM performs
direct CALL to the IRQ handler to avoid the overhead of INTn):

 BUG: unable to handle page fault for address: fffffbc0000001d8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 16c03a067 P4D 16c03a067 PUD 0
 Oops: 0000 [#1] PREEMPT SMP KASAN
 CPU: 5 PID: 901 Comm: repro Tainted: G        W          6.1.0-rc3+ #410
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:kasan_check_range+0xdf/0x190
  vmx_handle_exit_irqoff+0x152/0x290 [kvm_intel]
  vcpu_run+0x1d89/0x2bd0 [kvm]
  kvm_arch_vcpu_ioctl_run+0x3ce/0xa70 [kvm]
  kvm_vcpu_ioctl+0x349/0x900 [kvm]
  __x64_sys_ioctl+0xb8/0xf0
  do_syscall_64+0x2b/0x50
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand")
Reported-by: syzbot+8cdd16fd5a6c0565e227@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221110203504.1985010-6-seanjc@google.com
2022-12-15 10:37:28 -08:00
Sean Christopherson
bde258d974 x86/kasan: Add helpers to align shadow addresses up and down
Add helpers to dedup code for aligning shadow address up/down to page
boundaries when translating an address to its shadow.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Link: https://lkml.kernel.org/r/20221110203504.1985010-5-seanjc@google.com
2022-12-15 10:37:28 -08:00
Sean Christopherson
7077d2ccb9 x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
Rename the CPU entry area variables in kasan_init() to shorten their
names, a future fix will reference the beginning of the per-CPU portion
of the CPU entry area, and shadow_cpu_entry_per_cpu_begin is a bit much.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Link: https://lkml.kernel.org/r/20221110203504.1985010-4-seanjc@google.com
2022-12-15 10:37:28 -08:00
Sean Christopherson
97650148a1 x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
Populate a KASAN shadow for the entire possible per-CPU range of the CPU
entry area instead of requiring that each individual chunk map a shadow.
Mapping shadows individually is error prone, e.g. the per-CPU GDT mapping
was left behind, which can lead to not-present page faults during KASAN
validation if the kernel performs a software lookup into the GDT.  The DS
buffer is also likely affected.

The motivation for mapping the per-CPU areas on-demand was to avoid
mapping the entire 512GiB range that's reserved for the CPU entry area,
shaving a few bytes by not creating shadows for potentially unused memory
was not a goal.

The bug is most easily reproduced by doing a sigreturn with a garbage
CS in the sigcontext, e.g.

  int main(void)
  {
    struct sigcontext regs;

    syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
    syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
    syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

    memset(&regs, 0, sizeof(regs));
    regs.cs = 0x1d0;
    syscall(__NR_rt_sigreturn);
    return 0;
  }

to coerce the kernel into doing a GDT lookup to compute CS.base when
reading the instruction bytes on the subsequent #GP to determine whether
or not the #GP is something the kernel should handle, e.g. to fixup UMIP
violations or to emulate CLI/STI for IOPL=3 applications.

  BUG: unable to handle page fault for address: fffffbc8379ace00
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 16c03a067 P4D 16c03a067 PUD 15b990067 PMD 15b98f067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 3 PID: 851 Comm: r2 Not tainted 6.1.0-rc3-next-20221103+ #432
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kasan_check_range+0xdf/0x190
  Call Trace:
   <TASK>
   get_desc+0xb0/0x1d0
   insn_get_seg_base+0x104/0x270
   insn_fetch_from_user+0x66/0x80
   fixup_umip_exception+0xb1/0x530
   exc_general_protection+0x181/0x210
   asm_exc_general_protection+0x22/0x30
  RIP: 0003:0x0
  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  RSP: 0003:0000000000000000 EFLAGS: 00000202
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000001d0
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   </TASK>

Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand")
Reported-by: syzbot+ffb4f000dc2872c93f62@syzkaller.appspotmail.com
Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Link: https://lkml.kernel.org/r/20221110203504.1985010-3-seanjc@google.com
2022-12-15 10:37:28 -08:00
Sean Christopherson
80d72a8f76 x86/mm: Recompute physical address for every page of per-CPU CEA mapping
Recompute the physical address for each per-CPU page in the CPU entry
area, a recent commit inadvertantly modified cea_map_percpu_pages() such
that every PTE is mapped to the physical address of the first page.

Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Link: https://lkml.kernel.org/r/20221110203504.1985010-2-seanjc@google.com
2022-12-15 10:37:28 -08:00
Peter Zijlstra
e996365ee7 x86/mm: Rename __change_page_attr_set_clr(.checkalias)
Now that the checkalias functionality is taken by CPA_NO_CHECK_ALIAS
rename the argument to better match is remaining purpose: primary,
matching __change_page_attr().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221110125544.661001508%40infradead.org
2022-12-15 10:37:28 -08:00
Peter Zijlstra
d597416683 x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
There is a cludge in change_page_attr_set_clr() that inhibits
propagating NX changes to the aliases (directmap and highmap) -- this
is a cludge twofold:

 - it also inhibits the primary checks in __change_page_attr();
 - it hard depends on single bit changes.

The introduction of set_memory_rox() triggered this last issue for
clearing both _PAGE_RW and _PAGE_NX.

Explicitly ignore _PAGE_NX in cpa_process_alias() instead.

Fixes: b38994948567 ("x86/mm: Implement native set_memory_rox()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Debugged-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221110125544.594991716%40infradead.org
2022-12-15 10:37:28 -08:00
Peter Zijlstra
ef9ab81af6 x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
The .checkalias argument to __change_page_attr_set_clr() is overloaded
and serves two different purposes:

 - it inhibits the call to cpa_process_alias() -- as suggested by the
   name; however,

 - it also serves as 'primary' indicator for __change_page_attr()
   ( which in turn also serves as a recursion terminator for
     cpa_process_alias() ).

Untangle these by extending the use of CPA_NO_CHECK_ALIAS to all
callsites that currently use .checkalias=0 for this purpose.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221110125544.527267183%40infradead.org
2022-12-15 10:37:28 -08:00
Peter Zijlstra
5ceeee7571 x86/mm: Add a few comments
It's a shame to hide useful comments in Changelogs, add some to the
code.

Shamelessly stolen from commit:

  c40a56a781 ("x86/mm/init: Remove freed kernel image areas from alias mapping")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221110125544.460677011%40infradead.org
2022-12-15 10:37:28 -08:00
Pasha Tatashin
82328227db x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE.
Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only
used in x86/*.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com
2022-12-15 10:37:27 -08:00
Peter Zijlstra
60463628c9 x86/mm: Implement native set_memory_rox()
Provide a native implementation of set_memory_rox(), avoiding the
double set_memory_ro();set_memory_x(); calls.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-12-15 10:37:27 -08:00
Peter Zijlstra
414ebf148c x86/mm: Do verify W^X at boot up
Straight up revert of commit:

  a970174d7a ("x86/mm: Do not verify W^X at boot up")

now that the root cause has been fixed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201058.011279208@infradead.org
2022-12-15 10:37:26 -08:00
Peter Zijlstra
3f4c8211d9 x86/mm: Use mm_alloc() in poking_init()
Instead of duplicating init_mm, allocate a fresh mm. The advantage is
that mm_alloc() has much simpler dependencies. Additionally it makes
more conceptual sense, init_mm has no (and must not have) user state
to duplicate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org
2022-12-15 10:37:26 -08:00
Peter Zijlstra
97e3d26b5e x86/mm: Randomize per-cpu entry area
Seth found that the CPU-entry-area; the piece of per-cpu data that is
mapped into the userspace page-tables for kPTI is not subject to any
randomization -- irrespective of kASLR settings.

On x86_64 a whole P4D (512 GB) of virtual address space is reserved for
this structure, which is plenty large enough to randomize things a
little.

As such, use a straight forward randomization scheme that avoids
duplicates to spread the existing CPUs over the available space.

  [ bp: Fix le build. ]

Reported-by: Seth Jenkins <sethjenkins@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-12-15 10:37:26 -08:00
Andrey Ryabinin
3f148f3318 x86/kasan: Map shadow for percpu pages on demand
KASAN maps shadow for the entire CPU-entry-area:
  [CPU_ENTRY_AREA_BASE, CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE]

This will explode once the per-cpu entry areas are randomized since it
will increase CPU_ENTRY_AREA_MAP_SIZE to 512 GB and KASAN fails to
allocate shadow for such big area.

Fix this by allocating KASAN shadow only for really used cpu entry area
addresses mapped by cea_map_percpu_pages()

Thanks to the 0day folks for finding and reporting this to be an issue.

[ dhansen: tweak changelog since this will get committed before peterz's
	   actual cpu-entry-area randomization ]

Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Yujie Liu <yujie.liu@intel.com>
Cc: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210241508.2e203c3d-yujie.liu@intel.com
2022-12-15 10:37:26 -08:00
Linus Torvalds
48ea09cdda hardening updates for v6.2-rc1
- Convert flexible array members, fix -Wstringop-overflow warnings,
   and fix KCFI function type mismatches that went ignored by
   maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook).
 
 - Remove the remaining side-effect users of ksize() by converting
   dma-buf, btrfs, and coredump to using kmalloc_size_roundup(),
   add more __alloc_size attributes, and introduce full testing
   of all allocator functions. Finally remove the ksize() side-effect
   so that each allocation-aware checker can finally behave without
   exceptions.
 
 - Introduce oops_limit (default 10,000) and warn_limit (default off)
   to provide greater granularity of control for panic_on_oops and
   panic_on_warn (Jann Horn, Kees Cook).
 
 - Introduce overflows_type() and castable_to_type() helpers for
   cleaner overflow checking.
 
 - Improve code generation for strscpy() and update str*() kern-doc.
 
 - Convert strscpy and sigphash tests to KUnit, and expand memcpy
   tests.
 
 - Always use a non-NULL argument for prepare_kernel_cred().
 
 - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell).
 
 - Adjust orphan linker section checking to respect CONFIG_WERROR
   (Xin Li).
 
 - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu).
 
 - Fix um vs FORTIFY warnings for always-NULL arguments.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K
 AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls
 F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+
 I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV
 yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB
 d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG
 rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds
 oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw
 0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z
 ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb
 jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs
 AHXcibguPRQBPAdiPQ==
 =yaaN
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull kernel hardening updates from Kees Cook:

 - Convert flexible array members, fix -Wstringop-overflow warnings, and
   fix KCFI function type mismatches that went ignored by maintainers
   (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook)

 - Remove the remaining side-effect users of ksize() by converting
   dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add
   more __alloc_size attributes, and introduce full testing of all
   allocator functions. Finally remove the ksize() side-effect so that
   each allocation-aware checker can finally behave without exceptions

 - Introduce oops_limit (default 10,000) and warn_limit (default off) to
   provide greater granularity of control for panic_on_oops and
   panic_on_warn (Jann Horn, Kees Cook)

 - Introduce overflows_type() and castable_to_type() helpers for cleaner
   overflow checking

 - Improve code generation for strscpy() and update str*() kern-doc

 - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests

 - Always use a non-NULL argument for prepare_kernel_cred()

 - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell)

 - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin
   Li)

 - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu)

 - Fix um vs FORTIFY warnings for always-NULL arguments

* tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits)
  ksmbd: replace one-element arrays with flexible-array members
  hpet: Replace one-element array with flexible-array member
  um: virt-pci: Avoid GCC non-NULL warning
  signal: Initialize the info in ksignal
  lib: fortify_kunit: build without structleak plugin
  panic: Expose "warn_count" to sysfs
  panic: Introduce warn_limit
  panic: Consolidate open-coded panic_on_warn checks
  exit: Allow oops_limit to be disabled
  exit: Expose "oops_count" to sysfs
  exit: Put an upper limit on how often we can oops
  panic: Separate sysctl logic from CONFIG_SMP
  mm/pgtable: Fix multiple -Wstringop-overflow warnings
  mm: Make ksize() a reporting-only function
  kunit/fortify: Validate __alloc_size attribute results
  drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid()
  drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid()
  driver core: Add __alloc_size hint to devm allocators
  overflow: Introduce overflows_type() and castable_to_type()
  coredump: Proactively round up to kmalloc bucket size
  ...
2022-12-14 12:20:00 -08:00
Linus Torvalds
e2ca6ba6ba MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu.
 
 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
 
 - Some filemap cleanups from Vishal Moola.
 
 - David Hildenbrand added the ability to selftest anon memory COW handling.
 
 - Some cpuset simplifications from Liu Shixin.
 
 - Addition of vmalloc tracing support by Uladzislau Rezki.
 
 - Some pagecache folioifications and simplifications from Matthew Wilcox.
 
 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
 
 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.  This series shold have been in the
   non-MM tree, my bad.
 
 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages.
 
 - DAMON cleanups and tuneups from SeongJae Park
 
 - Tony Luck fixed the handling of COW faults against poisoned pages.
 
 - Peter Xu utilized the PTE marker code for handling swapin errors.
 
 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient.
 
 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand.
 
 - zram support for multiple compression streams from Sergey Senozhatsky.
 
 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway.
 
 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations.
 
 - Vishal Moola removed the try_to_release_page() wrapper.
 
 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache.
 
 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking.
 
 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend.
 
 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range().
 
 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen.
 
 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests.  Better, but still not perfect.
 
 - Christoph Hellwig has removed the .writepage() method from several
   filesystems.  They only need .writepages().
 
 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting.
 
 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines.
 
 - Many singleton patches, as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
 jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
 CodAgiA51qwzId3GRytIo/tfWZSezgA=
 =d19R
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - More userfaultfs work from Peter Xu

 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying

 - Some filemap cleanups from Vishal Moola

 - David Hildenbrand added the ability to selftest anon memory COW
   handling

 - Some cpuset simplifications from Liu Shixin

 - Addition of vmalloc tracing support by Uladzislau Rezki

 - Some pagecache folioifications and simplifications from Matthew
   Wilcox

 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
   it

 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.

   This series should have been in the non-MM tree, my bad

 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages

 - DAMON cleanups and tuneups from SeongJae Park

 - Tony Luck fixed the handling of COW faults against poisoned pages

 - Peter Xu utilized the PTE marker code for handling swapin errors

 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient

 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand

 - zram support for multiple compression streams from Sergey Senozhatsky

 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway

 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations

 - Vishal Moola removed the try_to_release_page() wrapper

 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache

 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking

 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend

 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range()

 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen

 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests. Better, but still not perfect

 - Christoph Hellwig has removed the .writepage() method from several
   filesystems. They only need .writepages()

 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting

 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines

 - Many singleton patches, as usual

* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
  mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
  mm: mmu_gather: allow more than one batch of delayed rmaps
  mm: fix typo in struct pglist_data code comment
  kmsan: fix memcpy tests
  mm: add cond_resched() in swapin_walk_pmd_entry()
  mm: do not show fs mm pc for VM_LOCKONFAULT pages
  selftests/vm: ksm_functional_tests: fixes for 32bit
  selftests/vm: cow: fix compile warning on 32bit
  selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
  mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
  mm,thp,rmap: fix races between updates of subpages_mapcount
  mm: memcg: fix swapcached stat accounting
  mm: add nodes= arg to memory.reclaim
  mm: disable top-tier fallback to reclaim on proactive reclaim
  selftests: cgroup: make sure reclaim target memcg is unprotected
  selftests: cgroup: refactor proactive reclaim code to reclaim_until()
  mm: memcg: fix stale protection of reclaim target memcg
  mm/mmap: properly unaccount memory on mas_preallocate() failure
  omfs: remove ->writepage
  jfs: remove ->writepage
  ...
2022-12-13 19:29:45 -08:00
Linus Torvalds
3ef3ace4e2 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do not
 support the cache disabling dance when setting up MTRRs so they fall
 under the same category.) This is a cleanup work to remove all the ugly
 workarounds for such guests and init things separately (Juergen Gross)
 
 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
 Performance Bias, leading to power savings
 
 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur
 CPUs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe
 VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt
 cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3
 kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I
 9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD
 fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd
 exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp
 S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC
 0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ
 8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq
 An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX
 KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ=
 =Zmgh
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
   guests which do not get MTRRs exposed but only PAT. (TDX guests do
   not support the cache disabling dance when setting up MTRRs so they
   fall under the same category)

   This is a cleanup work to remove all the ugly workarounds for such
   guests and init things separately (Juergen Gross)

 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
   Performance Bias, leading to power savings

 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern
   Centaur CPUs

* tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  x86/mtrr: Make message for disabled MTRRs more descriptive
  x86/pat: Handle TDX guest PAT initialization
  x86/cpuid: Carve out all CPUID functionality
  x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV
  x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
  x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack()
  x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()
  x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h
  x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs
  x86/mtrr: Simplify mtrr_ops initialization
  x86/cacheinfo: Switch cache_ap_init() to hotplug callback
  x86: Decouple PAT and MTRR handling
  x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
  x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
  x86/mtrr: Get rid of __mtrr_enabled bool
  x86/mtrr: Simplify mtrr_bp_init()
  x86/mtrr: Remove set_all callback from struct mtrr_ops
  x86/mtrr: Disentangle MTRR init from PAT init
  x86/mtrr: Move cache control code to cacheinfo.c
  x86/mtrr: Split MTRR-specific handling from cache dis/enabling
  ...
2022-12-13 14:56:56 -08:00
Linus Torvalds
268325bda5 Random number generator updates for Linux 6.2-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
 A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
 bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
 RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
 WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
 waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
 ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
 PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
 RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
 APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
 =QRhK
 -----END PGP SIGNATURE-----

Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:

 - Replace prandom_u32_max() and various open-coded variants of it,
   there is now a new family of functions that uses fast rejection
   sampling to choose properly uniformly random numbers within an
   interval:

       get_random_u32_below(ceil) - [0, ceil)
       get_random_u32_above(floor) - (floor, U32_MAX]
       get_random_u32_inclusive(floor, ceil) - [floor, ceil]

   Coccinelle was used to convert all current users of
   prandom_u32_max(), as well as many open-coded patterns, resulting in
   improvements throughout the tree.

   I'll have a "late" 6.1-rc1 pull for you that removes the now unused
   prandom_u32_max() function, just in case any other trees add a new
   use case of it that needs to converted. According to linux-next,
   there may be two trivial cases of prandom_u32_max() reintroductions
   that are fixable with a 's/.../.../'. So I'll have for you a final
   conversion patch doing that alongside the removal patch during the
   second week.

   This is a treewide change that touches many files throughout.

 - More consistent use of get_random_canary().

 - Updates to comments, documentation, tests, headers, and
   simplification in configuration.

 - The arch_get_random*_early() abstraction was only used by arm64 and
   wasn't entirely useful, so this has been replaced by code that works
   in all relevant contexts.

 - The kernel will use and manage random seeds in non-volatile EFI
   variables, refreshing a variable with a fresh seed when the RNG is
   initialized. The RNG GUID namespace is then hidden from efivarfs to
   prevent accidental leakage.

   These changes are split into random.c infrastructure code used in the
   EFI subsystem, in this pull request, and related support inside of
   EFISTUB, in Ard's EFI tree. These are co-dependent for full
   functionality, but the order of merging doesn't matter.

 - Part of the infrastructure added for the EFI support is also used for
   an improvement to the way vsprintf initializes its siphash key,
   replacing an sleep loop wart.

 - The hardware RNG framework now always calls its correct random.c
   input function, add_hwgenerator_randomness(), rather than sometimes
   going through helpers better suited for other cases.

 - The add_latent_entropy() function has long been called from the fork
   handler, but is a no-op when the latent entropy gcc plugin isn't
   used, which is fine for the purposes of latent entropy.

   But it was missing out on the cycle counter that was also being mixed
   in beside the latent entropy variable. So now, if the latent entropy
   gcc plugin isn't enabled, add_latent_entropy() will expand to a call
   to add_device_randomness(NULL, 0), which adds a cycle counter,
   without the absent latent entropy variable.

 - The RNG is now reseeded from a delayed worker, rather than on demand
   when used. Always running from a worker allows it to make use of the
   CPU RNG on platforms like S390x, whose instructions are too slow to
   do so from interrupts. It also has the effect of adding in new inputs
   more frequently with more regularity, amounting to a long term
   transcript of random values. Plus, it helps a bit with the upcoming
   vDSO implementation (which isn't yet ready for 6.2).

 - The jitter entropy algorithm now tries to execute on many different
   CPUs, round-robining, in hopes of hitting even more memory latencies
   and other unpredictable effects. It also will mix in a cycle counter
   when the entropy timer fires, in addition to being mixed in from the
   main loop, to account more explicitly for fluctuations in that timer
   firing. And the state it touches is now kept within the same cache
   line, so that it's assured that the different execution contexts will
   cause latencies.

* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
  random: include <linux/once.h> in the right header
  random: align entropy_timer_state to cache line
  random: mix in cycle counter when jitter timer fires
  random: spread out jitter callback to different CPUs
  random: remove extraneous period and add a missing one in comments
  efi: random: refresh non-volatile random seed when RNG is initialized
  vsprintf: initialize siphash key using notifier
  random: add back async readiness notifier
  random: reseed in delayed work rather than on-demand
  random: always mix cycle counter in add_latent_entropy()
  hw_random: use add_hwgenerator_randomness() for early entropy
  random: modernize documentation comment on get_random_bytes()
  random: adjust comment to account for removed function
  random: remove early archrandom abstraction
  random: use random.trust_{bootloader,cpu} command line option only
  stackprotector: actually use get_random_canary()
  stackprotector: move get_random_canary() into stackprotector.h
  treewide: use get_random_u32_inclusive() when possible
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_below() instead of deprecated function
  ...
2022-12-12 16:22:22 -08:00
Linus Torvalds
c1f0fcd85d cxl for 6.2
- Add the cpu_cache_invalidate_memregion() API for cache flushing in
   response to physical memory reconfiguration, or memory-side data
   invalidation from operations like secure erase or memory-device unlock.
 
 - Add a facility for the kernel to warn about collisions between kernel
   and userspace access to PCI configuration registers
 
 - Add support for Restricted CXL Host (RCH) topologies (formerly CXL 1.1)
 
 - Add handling and reporting of CXL errors reported via the PCIe AER
   mechanism
 
 - Add support for CXL Persistent Memory Security commands
 
 - Add support for the "XOR" algorithm for CXL host bridge interleave
 
 - Rework / simplify CXL to NVDIMM interactions
 
 - Miscellaneous cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCY5UpyAAKCRDfioYZHlFs
 Z0ttAP4uxCjIibKsFVyexpSgI4vaZqQ9yt9NesmPwonc0XookwD+PlwP6Xc0d0Ox
 t0gJ6+pwdh11NRzhcNE1pAaPcJZU4gs=
 =HAQk
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
 "Compute Express Link (CXL) updates for 6.2.

  While it may seem backwards, the CXL update this time around includes
  some focus on CXL 1.x enabling where the work to date had been with
  CXL 2.0 (VH topologies) in mind.

  First generation CXL can mostly be supported via BIOS, similar to DDR,
  however it became clear there are use cases for OS native CXL error
  handling and some CXL 3.0 endpoint features can be deployed on CXL 1.x
  hosts (Restricted CXL Host (RCH) topologies). So, this update brings
  RCH topologies into the Linux CXL device model.

  In support of the ongoing CXL 2.0+ enabling two new core kernel
  facilities are added.

  One is the ability for the kernel to flag collisions between userspace
  access to PCI configuration registers and kernel accesses. This is
  brought on by the PCIe Data-Object-Exchange (DOE) facility, a hardware
  mailbox over config-cycles.

  The other is a cpu_cache_invalidate_memregion() API that maps to
  wbinvd_on_all_cpus() on x86. To prevent abuse it is disabled in guest
  VMs and architectures that do not support it yet. The CXL paths that
  need it, dynamic memory region creation and security commands (erase /
  unlock), are disabled when it is not present.

  As for the CXL 2.0+ this cycle the subsystem gains support Persistent
  Memory Security commands, error handling in response to PCIe AER
  notifications, and support for the "XOR" host bridge interleave
  algorithm.

  Summary:

   - Add the cpu_cache_invalidate_memregion() API for cache flushing in
     response to physical memory reconfiguration, or memory-side data
     invalidation from operations like secure erase or memory-device
     unlock.

   - Add a facility for the kernel to warn about collisions between
     kernel and userspace access to PCI configuration registers

   - Add support for Restricted CXL Host (RCH) topologies (formerly CXL
     1.1)

   - Add handling and reporting of CXL errors reported via the PCIe AER
     mechanism

   - Add support for CXL Persistent Memory Security commands

   - Add support for the "XOR" algorithm for CXL host bridge interleave

   - Rework / simplify CXL to NVDIMM interactions

   - Miscellaneous cleanups and fixes"

* tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (71 commits)
  cxl/region: Fix memdev reuse check
  cxl/pci: Remove endian confusion
  cxl/pci: Add some type-safety to the AER trace points
  cxl/security: Drop security command ioctl uapi
  cxl/mbox: Add variable output size validation for internal commands
  cxl/mbox: Enable cxl_mbox_send_cmd() users to validate output size
  cxl/security: Fix Get Security State output payload endian handling
  cxl: update names for interleave ways conversion macros
  cxl: update names for interleave granularity conversion macros
  cxl/acpi: Warn about an invalid CHBCR in an existing CHBS entry
  tools/testing/cxl: Require cache invalidation bypass
  cxl/acpi: Fail decoder add if CXIMS for HBIG is missing
  cxl/region: Fix spelling mistake "memergion" -> "memregion"
  cxl/regs: Fix sparse warning
  cxl/acpi: Set ACPI's CXL _OSC to indicate RCD mode support
  tools/testing/cxl: Add an RCH topology
  cxl/port: Add RCD endpoint port enumeration
  cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem
  tools/testing/cxl: Add XOR Math support to cxl_test
  cxl/acpi: Support CXL XOR Interleave Math (CXIMS)
  ...
2022-12-12 13:55:31 -08:00
Steven Rostedt (Google)
3e12758392 x86/mm/kmmio: Remove redundant preempt_disable()
Now that kmmio uses rcu_read_lock_sched_notrace() there's no reason to
call preempt_disable() as the read_lock_sched_notrace() already does that
and is redundant.

This also removes the preempt_enable_no_resched() as the "no_resched()"
portion was bogus as there's no reason to do that.

Link: https://lkml.kernel.org/r/20221212103703.7129cc5d@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Karol Herbst <karolherbst@gmail.com>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-12 10:54:48 -05:00
Feiyang Chen
2045a3b891 mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can
share its implementation.

Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Min Zhou <zhoumin@loongson.cn>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:12 -08:00
Steven Rostedt
20fb6c9976 x86/mm/kmmio: Use rcu_read_lock_sched_notrace()
The mmiotrace tracer is "special". The purpose is to help reverse engineer
binary drivers by removing the memory allocated by the driver and when the
driver goes to access it, a fault occurs, the mmiotracer will record what
the driver was doing and then do the work on its behalf by single stepping
through the process.

But to achieve this ability, it must do some special things. One is to
take the rcu_read_lock() when the fault occurs, and then release it in the
breakpoint that is single stepping. This makes lockdep unhappy, as it
changes the state of RCU from within an exception that is not contained in
that exception, and we get a nasty splat from lockdep.

Instead, switch to rcu_read_lock_sched_notrace() as the RCU sched variant
has the same grace period as normal RCU. This is basically the same as
rcu_read_lock() but does not make lockdep complain about it.

Note, the preempt_disable() is still needed as it uses preempt_enable_no_resched().

Link: https://lore.kernel.org/linux-trace-kernel/20221209134144.04f33626@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Karol Herbst <karolherbst@gmail.com>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10 13:35:46 -05:00
Steven Rostedt
4994e387d7 x86/mm/kmmio: Switch to arch_spin_lock()
The mmiotrace tracer is "special". The purpose is to help reverse engineer
binary drivers by removing the memory allocated by the driver and when the
driver goes to access it, a fault occurs, the mmiotracer will record what
the driver was doing and then do the work on its behalf by single stepping
through the process.

But to achieve this ability, it must do some special things. One is it
needs to grab a lock while in the breakpoint handler. This is considered
an NMI state, and then lockdep warns that the lock is being held in both
an NMI state (really a breakpoint handler) and also in normal context.

As the breakpoint/NMI state only happens when the driver is accessing
memory, there's no concern of a race condition against the setup and
tear-down of mmiotracer.

To make lockdep and mmiotrace work together, convert the locks used in the
breakpoint handler into arch_spin_lock().

Link: https://lkml.kernel.org/r/20221206191229.656244029@goodmis.org
Link: https://lore.kernel.org/lkml/20221201213126.620b7dd3@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Karol Herbst <karolherbst@gmail.com>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-09 23:48:06 -05:00
Juergen Gross
c11ca45441 x86/pat: Handle TDX guest PAT initialization
With the decoupling of PAT and MTRR initialization, PAT will be used
even with MTRRs disabled. This seems to break booting up as TDX guest,
as the recommended sequence to set the PAT MSR across CPUs can't work
in TDX guests due to disabling caches via setting CR0.CD isn't allowed
in TDX mode.

This is an inconsistency in the Intel documentation between the SDM
and the TDX specification. For now handle TDX mode the same way as Xen
PV guest mode by just accepting the current PAT MSR setting without
trying to modify it.

  [ bp: Align conditions for better readability. ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221205080433.16643-2-jgross@suse.com
2022-12-05 11:03:27 +01:00
Gustavo A. R. Silva
25226df4b9 mm/pgtable: Fix multiple -Wstringop-overflow warnings
The actual size of the following arrays at run-time depends on
CONFIG_X86_PAE.

427         pmd_t *u_pmds[MAX_PREALLOCATED_USER_PMDS];
428         pmd_t *pmds[MAX_PREALLOCATED_PMDS];

If CONFIG_X86_PAE is not enabled, their final size will be zero (which
is technically not a legal storage size in C, but remains "valid" via
the GNU extension). In that case, the compiler complains about trying to
access objects of size zero when calling functions where these objects
are passed as arguments.

Fix this by sanity-checking the size of those arrays just before the
function calls. Also, the following warnings are fixed by these changes
when building with GCC 11+ and -Wstringop-overflow enabled:

arch/x86/mm/pgtable.c:437:13: warning: ‘preallocate_pmds.constprop’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=]
arch/x86/mm/pgtable.c:440:13: warning: ‘preallocate_pmds.constprop’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=]
arch/x86/mm/pgtable.c:462:9: warning: ‘free_pmds.constprop’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=]
arch/x86/mm/pgtable.c:455:9: warning: ‘pgd_prepopulate_user_pmd’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=]
arch/x86/mm/pgtable.c:464:9: warning: ‘free_pmds.constprop’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=]

This is one of the last cases in the ongoing effort to globally enable
-Wstringop-overflow.

The alternative to this is to make the originally suggested change:
make the pmds argument from an array pointer to a pointer pointer. That
situation is considered "legal" for C in the sense that it does not have
a way to reason about the storage. i.e.:

-static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
+static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t **pmds)

With the above change, there's no difference in binary output, and the
compiler warning is silenced.

However, with this patch, the compiler can actually figure out that it
isn't using the code at all, and it gets dropped:

   text    data     bss     dec     hex filename
   8218     718      32    8968    2308 arch/x86/mm/pgtable.o.before
   7765     694      32    8491    212b arch/x86/mm/pgtable.o.after

So this case (fixing a warning and reducing image size) is a clear win.

Additionally drops an old work-around for GCC in the same code.

Link: https://github.com/KSPP/linux/issues/203
Link: https://github.com/KSPP/linux/issues/181
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/Yytb67xvrnctxnEe@work
2022-12-01 08:50:38 -08:00
Andrew Morton
a38358c934 Merge branch 'mm-hotfixes-stable' into mm-stable 2022-11-30 14:58:42 -08:00
Juergen Gross
d76c4f7a61 x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
Testing of X86_FEATURE_XENPV in setup_cpu_entry_area() can be removed,
as this code path is 32-bit only, and Xen PV guests are 64-bit only.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20221104072701.20283-5-jgross@suse.com
2022-11-22 16:16:25 +01:00
Michael Kelley
4dbd6a3e90 x86/ioremap: Fix page aligned size calculation in __ioremap_caller()
Current code re-calculates the size after aligning the starting and
ending physical addresses on a page boundary. But the re-calculation
also embeds the masking of high order bits that exceed the size of
the physical address space (via PHYSICAL_PAGE_MASK). If the masking
removes any high order bits, the size calculation results in a huge
value that is likely to immediately fail.

Fix this by re-calculating the page-aligned size first. Then mask any
high order bits using PHYSICAL_PAGE_MASK.

Fixes: ffa71f33a8 ("x86, ioremap: Fix incorrect physical address handling in PAE mode")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/1668624097-14884-2-git-send-email-mikelley@microsoft.com
2022-11-22 12:21:16 +01:00
Jason A. Donenfeld
8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Davidlohr Bueso
1156b4418d memregion: Add cpu_cache_invalidate_memregion() interface
With CXL security features, and CXL dynamic provisioning, global CPU
cache flushing nvdimm requirements are no longer specific to that
subsystem, even beyond the scope of security_ops. CXL will need such
semantics for features not necessarily limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the erase is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked. Lastly this facility is used when
mapping new devices, or new capacity into an established physical
address range. I.e. when the driver switches DeviceA mapping AddressX to
DeviceB mapping AddressX then any cached data from DeviceA:AddressX
needs to be invalidated.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device. It may
also be used for dynamic CXL region provisioning.

Users must first call cpu_cache_has_invalidate_memregion() to know
whether this functionality is available on the architecture. On x86 this
respects the constraints of when wbinvd() is tolerable. It is already
the case that wbinvd() is problematic to allow in VMs due its global
performance impact and KVM, for example, has been known to just trap and
ignore the call. With confidential computing guest execution of wbinvd()
may even trigger an exception. Given guests should not be messing with
the bare metal address map via CXL configuration changes
cpu_cache_has_invalidate_memregion() returns false in VMs.

While this global cache invalidation facility, is exported to modules,
since NVDIMM and CXL support can be built as a module, it is not for
general use. The intent is that this facility is not available outside
of specific "device-memory" use cases. To make that expectation as clear
as possible the API is scoped to a new "DEVMEM" module namespace that
only the NVDIMM and CXL subsystems are expected to import.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-11-14 10:07:22 -08:00
Juergen Gross
adfe7512e1 x86: Decouple PAT and MTRR handling
Today, PAT is usable only with MTRR being active, with some nasty tweaks
to make PAT usable when running as a Xen PV guest which doesn't support
MTRR.

The reason for this coupling is that both PAT MSR changes and MTRR
changes require a similar sequence and so full PAT support was added
using the already available MTRR handling.

Xen PV PAT handling can work without MTRR, as it just needs to consume
the PAT MSR setting done by the hypervisor without the ability and need
to change it. This in turn has resulted in a convoluted initialization
sequence and wrong decisions regarding cache mode availability due to
misguiding PAT availability flags.

Fix all of that by allowing to use PAT without MTRR and by reworking
the current PAT initialization sequence to match better with the newly
introduced generic cache initialization.

This removes the need of the recently added pat_force_disabled flag, so
remove the remnants of the patch adding it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-14-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Kefeng Wang
e025ab842e mm: remove kern_addr_valid() completely
Most architectures (except arm64/x86/sparc) simply return 1 for
kern_addr_valid(), which is only used in read_kcore(), and it calls
copy_from_kernel_nofault() which could check whether the address is a
valid kernel address.  So as there is no need for kern_addr_valid(), let's
remove it.

Link: https://lkml.kernel.org/r/20221018074014.185687-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Heiko Carstens <hca@linux.ibm.com>		[s390]
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Helge Deller <deller@gmx.de>			[parisc]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Guo Ren <guoren@kernel.org>			[csky]
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: <aou@eecs.berkeley.edu>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 17:37:18 -08:00
Naoya Horiguchi
1fdbed657a arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging
The following bug is reported to be triggered when starting X on x86-32
system with i915:

  [  225.777375] kernel BUG at mm/memory.c:2664!
  [  225.777391] invalid opcode: 0000 [#1] PREEMPT SMP
  [  225.777405] CPU: 0 PID: 2402 Comm: Xorg Not tainted 6.1.0-rc3-bdg+ #86
  [  225.777415] Hardware name:  /8I865G775-G, BIOS F1 08/29/2006
  [  225.777421] EIP: __apply_to_page_range+0x24d/0x31c
  [  225.777437] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76
  [  225.777446] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00
  [  225.777454] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80
  [  225.777462] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202
  [  225.777470] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0
  [  225.777479] Call Trace:
  [  225.777486]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.777684]  apply_to_page_range+0x21/0x27
  [  225.777694]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.777870]  remap_io_mapping+0x49/0x75 [i915]
  [  225.778046]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.778220]  ? mutex_unlock+0xb/0xd
  [  225.778231]  ? i915_vma_pin_fence+0x6d/0xf7 [i915]
  [  225.778420]  vm_fault_gtt+0x2a9/0x8f1 [i915]
  [  225.778644]  ? lock_is_held_type+0x56/0xe7
  [  225.778655]  ? lock_is_held_type+0x7a/0xe7
  [  225.778663]  ? 0xc1000000
  [  225.778670]  __do_fault+0x21/0x6a
  [  225.778679]  handle_mm_fault+0x708/0xb21
  [  225.778686]  ? mt_find+0x21e/0x5ae
  [  225.778696]  exc_page_fault+0x185/0x705
  [  225.778704]  ? doublefault_shim+0x127/0x127
  [  225.778715]  handle_exception+0x130/0x130
  [  225.778723] EIP: 0xb700468a

Recently pud_huge() got aware of non-present entry by commit 3a194f3f8a
("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present
pud entry") to handle some special states of gigantic page.  However, it's
overlooked that pud_none() always returns false when running with 2-level
paging, and as a result pud_huge() can return true pointlessly.

Introduce "#if CONFIG_PGTABLE_LEVELS > 2" to pud_huge() to deal with this.

Link: https://lkml.kernel.org/r/20221107021010.2449306-1-naoya.horiguchi@linux.dev
Fixes: 3a194f3f8a ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Steven Rostedt (Google)
a970174d7a x86/mm: Do not verify W^X at boot up
Adding on the kernel command line "ftrace=function" triggered:

  CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range: 0xffffffffc0013000 - 0xffffffffc0013fff PFN 10031b
  WARNING: CPU: 0 PID: 0 at arch/x86/mm/pat/set_memory.c:609
  verify_rwx+0x61/0x6d
  Call Trace:
     __change_page_attr_set_clr+0x146/0x8a6
     change_page_attr_set_clr+0x135/0x268
     change_page_attr_clear.constprop.0+0x16/0x1c
     set_memory_x+0x2c/0x32
     arch_ftrace_update_trampoline+0x218/0x2db
     ftrace_update_trampoline+0x16/0xa1
     __register_ftrace_function+0x93/0xb2
     ftrace_startup+0x21/0xf0
     register_ftrace_function_nolock+0x26/0x40
     register_ftrace_function+0x4e/0x143
     function_trace_init+0x7d/0xc3
     tracer_init+0x23/0x2c
     tracing_set_tracer+0x1d5/0x206
     register_tracer+0x1c0/0x1e4
     init_function_trace+0x90/0x96
     early_trace_init+0x25c/0x352
     start_kernel+0x424/0x6e4
     x86_64_start_reservations+0x24/0x2a
     x86_64_start_kernel+0x8c/0x95
     secondary_startup_64_no_verify+0xe0/0xeb

This is because at boot up, kernel text is writable, and there's no
reason to do tricks to updated it.  But the verifier does not
distinguish updates at boot up and at run time, and causes a warning at
time of boot.

Add a check for system_state == SYSTEM_BOOTING and allow it if that is
the case.

[ These SYSTEM_BOOTING special cases are all pretty horrid, but the x86
  text_poke() code does some odd things at bootup, forcing this for now
    - Linus ]

Link: https://lore.kernel.org/r/20221024112730.180916b3@gandalf.local.home
Fixes: 652c5bf380 ("x86/mm: Refuse W^X violations")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-10-24 18:05:27 -07:00
Jason A. Donenfeld
81895a65ec treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

-       RAND = get_random_u32();
        ... when != RAND
-       RAND %= (E);
+       RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

        ((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
        value = int(literal, 16)
elif literal[0] in '123456789':
        value = int(literal, 10)
if value is None:
        print("I don't know how to handle %s" % (literal))
        cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
        print("Skipping 0x%x for cleanup elsewhere" % (value))
        cocci.include_match(False)
elif value & (value + 1) != 0:
        print("Skipping 0x%x because it's not a power of two minus one" % (value))
        cocci.include_match(False)
elif literal.startswith('0x'):
        coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
        coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

 {
-       T VAR;
-       VAR = (E);
-       return VAR;
+       return E;
 }

@drop_var@
type T;
identifier VAR;
@@

 {
-       T VAR;
        ... when != VAR
 }

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:55 -06:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Linus Torvalds
70442fc54e * Detect insecure W+X mappings and warn about them, including a
few bug fixes and relaxing the enforcement
  * Do a long-overdue defconfig update and enabling W+X boot-time
    detection
  * Cleanup _PAGE_PSE handling (follow-up on an earlier bug)
  * Rename a change_page_attr function
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmM+EPcACgkQaDWVMHDJ
 krCTJBAAyCJOcywm9bhKuxCKFzeDwKf5znW3sJVQYPlcskLlLyQdqbPN32BSMH4t
 rM+m7c3nChjx0wwwXbhqxcxmMAN8E3EC2GrPxJg6WY119F9rFsaxica54/ZGvnd3
 7YjShCqmNpQS1PbhfB5aJMRiZU6CkEZ2ER9ewypEg3na5ShhWKiT++nd9jwn8b6R
 8nQyVCBFVS+dcJ3G8XBwikvKB2o1KGysDWI1NfJGhd4XxUMFWL8KVbjgnRIMuPfJ
 A3lvoIFgwTgZuoZA0DfGPnmXqryq+15ADpjK/4R4rllPEawvBcLMGngvyAv5LNT0
 I9yOjBIm8AXMMJf9rHJQlxwouE/QapwVH9rN0QZgRAhPsD5sAaMAD8FqZ11ovtlf
 VGsndZnv2LqIdnf1TV23IPTFoU58diStoYcuwvqqiWhBlMc2UYSNNlwGTeloZVK1
 /JFL9HZWLZzUfq+/cODsvHBJsVS5/X3FdYN1WDzHXSiiN8c/IbaYPduRpZdwbQ2q
 nWGep8z9QkihrSA1sjkfDrtBwA5EKC8U//M4ve7jf3xVv2mcjnSxYKlTrq0e4+q+
 +6F+CvApPavkK4hmdqKBlJeCQo7rhiaD9iSpaO4XQZ463TowoYzj5F3hNtYi/p7I
 +n63UPUhhh6hHAtXKSfpiRpsei4dtRgbtaHQ1m3Z5t35id/gqhA=
 =KkQf
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "There are some small things here, plus one big one.

  The big one detected and refused to create W+X kernel mappings. This
  caused a bit of trouble and it is entirely disabled on 32-bit due to
  known unfixable EFI issues. It also oopsed on some systemd eBPF use,
  which kept some users from booting.

  The eBPF issue is fixed, but those troubles were caught relatively
  recently which made me nervous that there are more lurking. The final
  commit in here retains the warnings, but doesn't actually refuse to
  create W+X mappings.

  Summary:

   - Detect insecure W+X mappings and warn about them, including a few
     bug fixes and relaxing the enforcement

   - Do a long-overdue defconfig update and enabling W+X boot-time
     detection

   - Cleanup _PAGE_PSE handling (follow-up on an earlier bug)

   - Rename a change_page_attr function"

* tag 'x86_mm_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Ease W^X enforcement back to just a warning
  x86/mm: Disable W^X detection and enforcement on 32-bit
  x86/mm: Add prot_sethuge() helper to abstract out _PAGE_PSE handling
  x86/mm/32: Fix W^X detection when page tables do not support NX
  x86/defconfig: Enable CONFIG_DEBUG_WX=y
  x86/defconfig: Refresh the defconfigs
  x86/mm: Refuse W^X violations
  x86/mm: Rename set_memory_present() to set_memory_p()
2022-10-10 17:13:07 -07:00
Dave Hansen
c5129ecc12 x86/mm: Ease W^X enforcement back to just a warning
Currently, the "change_page_attr" (CPA) code refuses to create
W+X mappings on 64-bit kernels.  There have been reports both
from 32-bit[1] and from BPF[2] users where this change kept the
system from booting.

These reports are showing up even after about a month of soak
time in -next.

To avoid breaking anything, never enforce W^X.  Always warn
and return the requested permissions even if a problem is
detected.

1. https://lore.kernel.org/all/CAMj1kXHcF_iK_g0OZSkSv56Wmr=eQGQwNstcNjLEfS=mm7a06w@mail.gmail.com/
2. https://lore.kernel.org/bpf/c84cc27c1a5031a003039748c3c099732a718aec.camel@kernel.org/T/#u

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2022-10-04 12:53:52 -07:00
Linus Torvalds
7db99f01d1 - Print the CPU number at segfault time. The number printed is not
always accurate (preemption is enabled at that time) but the print string
   contains "likely" and after a lot of back'n'forth on this, this was the
   consensus that was reached. See thread starting at:
   https://lore.kernel.org/r/5d62c1d0-7425-d5bb-ecb5-1dc3b4d7d245@intel.com
 
 - After a *lot* of testing and polishing, finally the clear_user()
   improvements to inline REP; STOSB by default
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM780YACgkQEsHwGGHe
 VUoVmRAAhTTUYqe81XRAX1Egge1RVwXgZFZQQD54239IveLt80kpg1AFR697z3/G
 GSaL5dy9/LEwyE0r9u7VN1SuY3Bz8LEZLYfurKlGfMGv3mlcUedLSWHQNaNZ4+cx
 nwP1VjrfH80Qwn7l99hOZ7kwCRlUWdsamcMsv6n7Fq0YnM7vW6MgmQGlqCGADQKI
 GGElgn3VaU5pEXF4YtZE0qfy17dgkW+RJD7RlaAtzmKcdKYiKgfX0Mh9bkQ5VVcb
 973peg9GHKHoP0w54LYKNgF/WkYPpBwcNIkJW//aMLGS5ofalliu0W281okAhZ7w
 Sknnx+umoprCAV9ljn/HZbQnseXPXKlZWUxiwfEZD2fcBI6+60HK86zVSEvrYxNs
 COZRhpj/bwvOt5LtNj7dcV6cbbBvJJIEJSNazXHscds2zUh79fidzkjZ/Z0Hbo7D
 sr7SM80STaURGFPTLoD41z4TY3V2GH8JUbWG9+I4fBjmSDyX+tQKKJhH886MRSSW
 ZQ0pY1uhwF8QBMlp3t87pZ2gbdjMr40PSYJGU+Xqsb/khq7xOWnP1CiIkVUHLF2L
 n/hm6VF6ifwFstFhOBc8nTcz7kGqzKvlQssFpf+DAIl/b7mZ3AKHVIJxTpZ1hcis
 BqQdeNcVEcNCRZWFzrG5hCs0b7xtaJrI30RPSMsQP2IcaX+DNyg=
 =xLSS
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Print the CPU number at segfault time.

   The number printed is not always accurate (preemption is enabled at
   that time) but the print string contains "likely" and after a lot of
   back'n'forth on this, this was the consensus that was reached. See
   thread at [1].

 - After a *lot* of testing and polishing, finally the clear_user()
   improvements to inline REP; STOSB by default

Link: https://lore.kernel.org/r/5d62c1d0-7425-d5bb-ecb5-1dc3b4d7d245@intel.com [1]

* tag 'x86_cpu_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Print likely CPU at segfault time
  x86/clear_user: Make it faster
2022-10-04 09:21:30 -07:00
Alexander Potapenko
ce732a7520 x86: kmsan: handle CPU entry area
Among other data, CPU entry area holds exception stacks, so addresses from
this area can be passed to kmsan_get_metadata().

This previously led to kmsan_get_metadata() returning NULL, which in turn
resulted in a warning that triggered further attempts to call
kmsan_get_metadata() in the exception context, which quickly exhausted the
exception stack.

This patch allocates shadow and origin for the CPU entry area on x86 and
introduces arch_kmsan_get_meta_or_null(), which performs arch-specific
metadata mapping.

Link: https://lkml.kernel.org/r/20220928123219.1101883-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: 21d723a7c1409 ("kmsan: add KMSAN runtime core")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:26 -07:00
Alexander Potapenko
3f1e2c7a90 x86: kmsan: sync metadata pages on page fault
KMSAN assumes shadow and origin pages for every allocated page are
accessible.  For pages between [VMALLOC_START, VMALLOC_END] those metadata
pages start at KMSAN_VMALLOC_SHADOW_START and KMSAN_VMALLOC_ORIGIN_START,
therefore we must sync a bigger memory region.

Link: https://lkml.kernel.org/r/20220915150417.722975-37-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:24 -07:00
Alexander Potapenko
93324e6842 x86: kmsan: disable instrumentation of unsupported code
Instrumenting some files with KMSAN will result in kernel being unable to
link, boot or crashing at runtime for various reasons (e.g.  infinite
recursion caused by instrumentation hooks calling instrumented code
again).

Completely omit KMSAN instrumentation in the following places:
 - arch/x86/boot and arch/x86/realmode/rm, as KMSAN doesn't work for i386;
 - arch/x86/entry/vdso, which isn't linked with KMSAN runtime;
 - three files in arch/x86/kernel - boot problems;
 - arch/x86/mm/cpu_entry_area.c - recursion.

Link: https://lkml.kernel.org/r/20220915150417.722975-33-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:24 -07:00
Alexander Potapenko
b073d7f8ae mm: kmsan: maintain KMSAN metadata for page operations
Insert KMSAN hooks that make the necessary bookkeeping changes:
 - poison page shadow and origins in alloc_pages()/free_page();
 - clear page shadow and origins in clear_page(), copy_user_highpage();
 - copy page metadata in copy_highpage(), wp_page_copy();
 - handle vmap()/vunmap()/iounmap();

Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:20 -07:00
Alexander Potapenko
1a167ddd3c x86: kmsan: pgtable: reduce vmalloc space
KMSAN is going to use 3/4 of existing vmalloc space to hold the metadata,
therefore we lower VMALLOC_END to make sure vmalloc() doesn't allocate
past the first 1/4.

Link: https://lkml.kernel.org/r/20220915150417.722975-10-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:19 -07:00
Dave Hansen
8c4934f475 x86/mm: Disable W^X detection and enforcement on 32-bit
The 32-bit code is in a weird spot.  Some 32-bit builds (non-PAE) do not
even have NX support.  Even PAE builds that support NX have to contend
with things like EFI data and code mixed in the same pages where W+X
is unavoidable.

The folks still running X86_32=y kernels are unlikely to care much about
NX.  That combined with the fundamental inability fix _all_ of the W+X
things means this code had little value on X86_32=y.  Disable the checks.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: linux-efi@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/CAMj1kXHcF_iK_g0OZSkSv56Wmr=eQGQwNstcNjLEfS=mm7a06w@mail.gmail.com/
2022-10-03 13:12:23 -07:00
Linus Torvalds
334b2cea81 x86/mm: Add prot_sethuge() helper to abstract out _PAGE_PSE handling
We still have some historic cases of direct fiddling of page
attributes with (dangerous & fragile) type casting and address shifting.

Add the prot_sethuge() helper instead that gets the types right and
doesn't have to transform addresses.

( Also add a debug check to make sure this doesn't get applied
  to _PAGE_BIT_PAT/_PAGE_BIT_PAT_LARGE pages. )

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
2022-09-29 18:01:40 +02:00
Yu Zhao
eed9a328aa mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
linear address translation [1].  Page table walkers that clear the
accessed bit may use this capability to reduce their search space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

Thanks to the following developers for their efforts [2][3].
  Randy Dunlap <rdunlap@infradead.org>
  Stephen Rothwell <sfr@canb.auug.org.au>

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8
[2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
[3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/

Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:08 -07:00
Peter Xu
be45a4902c mm/swap: cache maximum swapfile size when init swap
We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patch introduce a
variable "swapfile_maximum_size" and cache the value of old
max_swapfile_size(), so that we don't need to calculate the value every
time.

Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information.  Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.

Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init().  As a reference, the code path looks like
this for x86:

- start_kernel
  - setup_arch
    - early_cpu_init
      - early_identify_cpu --> setup X86_BUG_L1TF
  - parse_early_param
    - l1tf_cmdline --> set l1tf_mitigation
  - check_bugs
    - l1tf_select_mitigation --> set l1tf_mitigation
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.

Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
because arch can define its own function, so it's more straightforward to
have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
to replace the old usages of max_swapfile_size().

[peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
  Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:05 -07:00
Dave Hansen
a3d3163fbe x86/mm/32: Fix W^X detection when page tables do not support NX
The x86 MM code now actively refuses to create writable+executable mappings,
and warns when there is an attempt to create one.

The 0day test robot ran across a warning triggered by module unloading on
32-bit kernels.  This was only seen on CPUs with NX support, but where a
32-bit kernel was built without PAE support.

On those systems, there is no room for the NX bit in the page
tables and _PAGE_NX is #defined to 0, breaking some of the W^X
detection logic in verify_rwx().  The X86_FEATURE_NX check in
there does not do any good here because the CPU itself supports
NX.

Fix it by checking for _PAGE_NX support directly instead of
checking CPU support for NX.

Note that since _PAGE_NX is actually defined to be 0 at
compile-time this fix should also end up letting the compiler
optimize away most of verify_rwx() on non-PAE kernels.

Fixes: 652c5bf380 ("x86/mm: Refuse W^X violations")
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/fcf89147-440b-e478-40c9-228c9fe56691@intel.com/
2022-09-21 10:02:55 +02:00
Naohiro Aota
818c4fdaa9 x86/mm: disable instrumentations of mm/pgprot.c
Commit 4867fbbdd6 ("x86/mm: move protection_map[] inside the platform")
moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c.  As
a result, the accesses are now targets of KASAN (and other
instrumentations), leading to the crash during the boot process.

Disable the instrumentations for pgprot.c like commit 67bb8e999e
("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and
mm/tlb.c").

Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN
enabled, without anything printed.  After the change, it successfully
boots up.

Fixes: 4867fbbdd6 ("x86/mm: move protection_map[] inside the platform")
Link: https://lkml.kernel.org/r/20220824084726.2174758-1-naohiro.aota@wdc.com
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 16:22:30 -07:00
Ingo Molnar
1043897681 Merge branch 'linus' into x86/mm, to refresh the branch
This branch is ~14k commits behind upstream, and has an old merge base
from early into the merge window, refresh it to v6.0-rc3+fixes before
queueing up new commits.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-09-02 10:33:33 +02:00
Peter Zijlstra
652c5bf380 x86/mm: Refuse W^X violations
x86 has STRICT_*_RWX, but not even a warning when someone violates it.

Add this warning and fully refuse the transition.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/YwySW3ROc21hN7g9@hirez.programming.kicks-ass.net
2022-09-01 11:10:19 -07:00
Linus Torvalds
2f23a7c914 Misc fixes:
- Fix PAT on Xen, which caused i915 driver failures
  - Fix compat INT 80 entry crash on Xen PV guests
  - Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs
  - Fix RSB stuffing regressions
  - Fix ORC unwinding on ftrace trampolines
  - Add Intel Raptor Lake CPU model number
  - Fix (work around) a SEV-SNP bootloader bug providing bogus values in
    boot_params->cc_blob_address, by ignoring the value on !SEV-SNP bootups.
  - Fix SEV-SNP early boot failure
  - Fix the objtool list of noreturn functions and annotate snp_abort(),
    which bug confused objtool on gcc-12.
  - Fix the documentation for retbleed
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmMLgUERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hU1hAAj9bKO+D07gROpkRhXpLvbqm+mUqk6It8
 qEuyLkC/xvD9N//Pya0ZPPE+l23EHQxM/L0tXCYwsc8Zlx3fr678FQktHZuxULaP
 qlGmwub8PvsyMesXBGtgHJ4RT8L07FelBR+E/TpqCSbv/geM7mcc0ojyOKo+OmbD
 vbsUTDNqsMYndzePUBrv/vJRqVLGlbd1a22DKE/YiYBbZu5hughNUB0bHYhYdsml
 mutcRsVTKRbZOi4wiuY/pnveMf/z9wAwKOWd9EBaDWILygVSHkBJJQyhHigBh3F8
 H1hFn1q4ybULdrAUT4YND9Tjb6U8laU0fxg8Z43bay7bFXXElqoj3W2qka9NjAim
 QvWSFvTYQRTIWt6sCshVsUBWj6fBZdHxcJ8Fh+ucJJp+JWl9/aD0A41vK2Jx0LIt
 p8YzBRKEqd1Q/7QD855BArN7HNtQGgShNt4oPVZ7nPnjuQt0+lngu8NR+6X3RKpX
 r8rordyPvzgPL6W+1uV+8hnz1w+YU2xplAbK+zTijwgJVgyf8khSlZQNpFKULrou
 zjtzo/2nB+4C4bvfetNnaOGhi1/AdCHZHyZE35rotpd73SLHvdOrH0Ll9oCVfVrC
 UWbC1E67cHQw97Ni/4CrCsJRBULK01uyszVCxlEkSYInf0UsKlnmy+TqxizwsCVy
 reYQ0ePyWg0=
 =ZpCJ
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix PAT on Xen, which caused i915 driver failures

 - Fix compat INT 80 entry crash on Xen PV guests

 - Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs

 - Fix RSB stuffing regressions

 - Fix ORC unwinding on ftrace trampolines

 - Add Intel Raptor Lake CPU model number

 - Fix (work around) a SEV-SNP bootloader bug providing bogus values in
   boot_params->cc_blob_address, by ignoring the value on !SEV-SNP
   bootups.

 - Fix SEV-SNP early boot failure

 - Fix the objtool list of noreturn functions and annotate snp_abort(),
   which bug confused objtool on gcc-12.

 - Fix the documentation for retbleed

* tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/ABI: Mention retbleed vulnerability info file for sysfs
  x86/sev: Mark snp_abort() noreturn
  x86/sev: Don't use cc_platform_has() for early SEV-SNP calls
  x86/boot: Don't propagate uninitialized boot_params->cc_blob_address
  x86/cpu: Add new Raptor Lake CPU model number
  x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
  x86/nospec: Fix i386 RSB stuffing
  x86/nospec: Unwreck the RSB stuffing
  x86/bugs: Add "unknown" reporting for MMIO Stale Data
  x86/entry: Fix entry_INT80_compat for Xen PV guests
  x86/PAT: Have pat_enabled() properly reflect state when running on Xen
2022-08-28 10:10:23 -07:00
Rik van Riel
c926087eb3 x86/mm: Print likely CPU at segfault time
In a large enough fleet of computers, it is common to have a few bad CPUs.
Those can often be identified by seeing that some commonly run kernel code,
which runs fine everywhere else, keeps crashing on the same CPU core on one
particular bad system.

However, the failure modes in CPUs that have gone bad over the years are
often oddly specific, and the only bad behavior seen might be segfaults
in programs like bash, python, or various system daemons that run fine
everywhere else.

Add a printk() to show_signal_msg() to print the CPU, core, and socket
at segfault time.

This is not perfect, since the task might get rescheduled on another
CPU between when the fault hit, and when the message is printed, but in
practice this has been good enough to help people identify several bad
CPU cores.

For example:

  segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in \
	  segfault[401000+1000] likely on CPU 0 (core 0, socket 0)

This printk can be controlled through /proc/sys/debug/exception-trace.

  [ bp: Massage a bit, add "likely" to the printed line to denote that
    the CPU number is not always reliable. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220805101644.2e674553@imladris.surriel.com
2022-08-24 12:48:05 +02:00
Aaron Lu
88e0a74902 x86/mm: Use proper mask when setting PUD mapping
Commit c164fbb40c43f("x86/mm: thread pgprot_t through
init_memory_mapping()") mistakenly used __pgprot() which doesn't respect
__default_kernel_pte_mask when setting PUD mapping.

Fix it by only setting the one bit we actually need (PSE) and leaving
the other bits (that have been properly masked) alone.

Fixes: c164fbb40c ("x86/mm: thread pgprot_t through init_memory_mapping()")
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-19 08:17:43 -07:00
Linus Torvalds
c4e34dd99f x86: simplify load_unaligned_zeropad() implementation
The exception for the "unaligned access at the end of the page, next
page not mapped" never happens, but the fixup code ends up causing
trouble for compilers to optimize well.

clang in particular ends up seeing it being in the middle of a loop, and
tries desperately to optimize the exception fixup code that is never
really reached.

The simple solution is to just move all the fixups into the exception
handler itself, which moves it all out of the hot case code, and means
that the compiler never sees it or needs to worry about it.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-16 11:03:38 -07:00
Jan Beulich
72cbc8f04f x86/PAT: Have pat_enabled() properly reflect state when running on Xen
After commit ID in the Fixes: tag, pat_enabled() returns false (because
of PAT initialization being suppressed in the absence of MTRRs being
announced to be available).

This has become a problem: the i915 driver now fails to initialize when
running PV on Xen (i915_gem_object_pin_map() is where I located the
induced failure), and its error handling is flaky enough to (at least
sometimes) result in a hung system.

Yet even beyond that problem the keying of the use of WC mappings to
pat_enabled() (see arch_can_pci_mmap_wc()) means that in particular
graphics frame buffer accesses would have been quite a bit less optimal
than possible.

Arrange for the function to return true in such environments, without
undermining the rest of PAT MSR management logic considering PAT to be
disabled: specifically, no writes to the PAT MSR should occur.

For the new boolean to live in .init.data, init_cache_modes() also needs
moving to .init.text (where it could/should have lived already before).

  [ bp: This is the "small fix" variant for stable. It'll get replaced
    with a proper PAT and MTRR detection split upstream but that is too
    involved for a stable backport.
    - additional touchups to commit msg. Use cpu_feature_enabled(). ]

Fixes: bdd8b6c982 ("drm/i915: replace X86_FEATURE_PAT with pat_enabled()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com
2022-08-15 10:51:23 +02:00
Naoya Horiguchi
3a194f3f8a mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
follow_pud_mask() does not support non-present pud entry now.  As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen.  But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit
e66f17ff71 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
commit cbef8478be ("mm/hugetlb: pmd_huge() returns true for non-present
hugepage").

Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:43 -07:00
Linus Torvalds
1612c382ff Misc fixes:
- an old(er) binutils build fix,
  - a new-GCC build fix,
  - and a kexec boot environment fix.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuv4URHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1it3A//fGfrzHGtjHraiBy0H1Erlz0dUa4q/r6v
 xPQVFYteGwL/Ynv2rOJreiEXNhv9pRv0cXXNS5iWh8IcP8IUNw6rfYmgr1aDpXdq
 WkbJvwouX6JSo3g/CMekKd+Mf7NgA4O1OO65E80c4WJnxgd0AYvr6IxJRLR7X0C7
 HwU6p6PmP/RHWT5T170z6sgun+6QdDEYSwFYOhxawL+BJaKEBYnQ0LLQgJazhe7z
 uVxONQA9OdWBwMzvZygbOuTzc990jCHRPYgvYQhSZ8CUPuVzaa7IB9KUXh6lu93d
 a7nqM3GlWTowBULY6Xq7gWJaJ7jsVWXjqo8SWVlb6YwoLR9dgGSW5bCGV0rOA6o3
 yPjQhIQ9H4NOx126wPcCRBh3osGFjqlWUXVw7W51aNgd7hCvlbpWWmREeI/Pm1Ew
 WBjQqpf4l0S+0On5FEFaF7swAG3b6KSNSKw7WBmpmTNt5eWOot0EtnjGW75ATpxM
 +j2fj/1MIZ/Zp+wYaNK/+abM4sXHhYvU9gpPdJslRr+r2AVjy9gCZ/0zuUIVytwC
 gOdV9KhqzlXPJCTm+py7fBt2qM2P5rKT2HBQYiJwIquB2njI0kjUBOJWXsGQ/F/y
 hGd6WY8uDuwzzg5JtyfwE6fPGovxL5GCc4w9CYz0DbP0txPYuhMOdkHtAYLyraAj
 wtdalMt3cT8=
 =EM/G
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - build fix for old(er) binutils

 - build fix for new GCC

 - kexec boot environment fix

* tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
  x86/numa: Use cpumask_available instead of hardcoded NULL check
  x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to be zero
2022-08-06 17:45:37 -07:00
Borislav Petkov
86af8230ce x86/mm: Rename set_memory_present() to set_memory_p()
Have it adhere to the naming convention for those helpers.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220805140702.31538-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-08-06 20:46:13 +02:00
Linus Torvalds
6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Linus Torvalds
7447691ef9 xen: branch for v6.0-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYuooOQAKCRCAXGG7T9hj
 vmmlAPoCfYBh4jKwRnvGvyn+sPQed/r0TH0wnsGK1ccONhyIvAD+IZcSTPsnp4Cj
 m1URGGff2PvAyjOIAzQZbKZomtfICwM=
 =z2e5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - a series fine tuning virtio support for Xen guests, including removal
   the now again unused "platform_has()" feature.

 - a fix for host admin triggered reboot of Xen guests

 - a simple spelling fix

* tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: don't require virtio with grants for non-PV guests
  kernel: remove platform_has() infrastructure
  virtio: replace restricted mem access flag with callback
  xen: Fix spelling mistake
  xen/manage: Use orderly_reboot() to reboot
2022-08-04 15:10:55 -07:00
Siddh Raman Pant
625395c4a0 x86/numa: Use cpumask_available instead of hardcoded NULL check
GCC-12 started triggering a new warning:

  arch/x86/mm/numa.c: In function ‘cpumask_of_node’:
  arch/x86/mm/numa.c:916:39: warning: the comparison will always evaluate as ‘false’ for the address of ‘node_to_cpumask_map’ will never be NULL [-Waddress]
    916 |         if (node_to_cpumask_map[node] == NULL) {
        |                                       ^~

node_to_cpumask_map is of type cpumask_var_t[].

When CONFIG_CPUMASK_OFFSTACK is set, cpumask_var_t is typedef'd to a
pointer for dynamic allocation, else to an array of one element. The
"wicked game" can be checked on line 700 of include/linux/cpumask.h.

The original code in debug_cpumask_set_cpu() and cpumask_of_node() were
probably written by the original authors with CONFIG_CPUMASK_OFFSTACK=y
(i.e. dynamic allocation) in mind, checking if the cpumask was available
via a direct NULL check.

When CONFIG_CPUMASK_OFFSTACK is not set, GCC gives the above warning
while compiling the kernel.

Fix that by using cpumask_available(), which does the NULL check when
CONFIG_CPUMASK_OFFSTACK is set, otherwise returns true. Use it wherever
such checks are made.

Conditional definitions of cpumask_available() can be found along with
the definition of cpumask_var_t. Check the cpumask.h reference mentioned
above.

Fixes: c032ef60d1 ("cpumask: convert node_to_cpumask_map[] to cpumask_var_t")
Fixes: de2d9445f1 ("x86: Unify node_to_cpumask_map handling between 32 and 64bit")
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220731160913.632092-1-code@siddh.me
2022-08-03 11:44:57 +02:00
Linus Torvalds
7d9d077c78 RCU pull request for v5.20 (or whatever)
This pull request contains the following branches:
 
 doc.2022.06.21a: Documentation updates.
 
 fixes.2022.07.19a: Miscellaneous fixes.
 
 nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
 	RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
 	be offloaded at boot time, regardless of kernel boot parameters.
 	This is useful to battery-powered systems such as ChromeOS
 	and Android.  In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
 	boot parameter prevents offloaded callbacks from interfering
 	with real-time workloads and with energy-efficiency mechanisms.
 
 poll.2022.07.21a: Polled grace-period updates, perhaps most notably
 	making these APIs account for both normal and expedited grace
 	periods.
 
 rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
 	the CPU overhead of RCU tasks trace grace periods by more than
 	a factor of two on a system with 15,000 tasks.	The reduction
 	is expected to increase with the number of tasks, so it seems
 	reasonable to hypothesize that a system with 150,000 tasks might
 	see a 20-fold reduction in CPU overhead.
 
 torture.2022.06.21a: Torture-test updates.
 
 ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
 	context tracking, thus reducing the overhead of transitioning to
 	kernel mode from either idle or nohz_full userspace execution
 	for kernels that track context independently of RCU.  This is
 	expected to be helpful primarily for kernels built with
 	CONFIG_NO_HZ_FULL=y.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
 g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
 k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
 kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
 JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
 PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
 egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
 ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
 vX60+QNxvUBLwA==
 =vUNm
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes

 - Callback-offload updates, perhaps most notably a new
   RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
   offloaded at boot time, regardless of kernel boot parameters.

   This is useful to battery-powered systems such as ChromeOS and
   Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
   parameter prevents offloaded callbacks from interfering with
   real-time workloads and with energy-efficiency mechanisms

 - Polled grace-period updates, perhaps most notably making these APIs
   account for both normal and expedited grace periods

 - Tasks RCU updates, perhaps most notably reducing the CPU overhead of
   RCU tasks trace grace periods by more than a factor of two on a
   system with 15,000 tasks.

   The reduction is expected to increase with the number of tasks, so it
   seems reasonable to hypothesize that a system with 150,000 tasks
   might see a 20-fold reduction in CPU overhead

 - Torture-test updates

 - Updates that merge RCU's dyntick-idle tracking into context tracking,
   thus reducing the overhead of transitioning to kernel mode from
   either idle or nohz_full userspace execution for kernels that track
   context independently of RCU.

   This is expected to be helpful primarily for kernels built with
   CONFIG_NO_HZ_FULL=y

* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
  rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
  rcu: Diagnose extended sync_rcu_do_polled_gp() loops
  rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
  rcutorture: Test polled expedited grace-period primitives
  rcu: Add polled expedited grace-period primitives
  rcutorture: Verify that polled GP API sees synchronous grace periods
  rcu: Make Tiny RCU grace periods visible to polled APIs
  rcu: Make polled grace-period API account for expedited grace periods
  rcu: Switch polled grace-period APIs to ->gp_seq_polled
  rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
  rcu/nocb: Add option to opt rcuo kthreads out of RT priority
  rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
  rcu/nocb: Add an option to offload all CPUs on boot
  rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
  rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
  rcu/nocb: Add/del rdp to iterate from rcuog itself
  rcu/tree: Add comment to describe GP-done condition in fqs loop
  rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
  rcu/kvfree: Remove useless monitor_todo flag
  rcu: Cleanup RCU urgency state for offline CPU
  ...
2022-08-02 19:12:45 -07:00
Linus Torvalds
ecf9b7bfea - Have invalid MSR accesses warnings appear only once after a
pr_warn_once() change broke that
 
 - Simplify {JMP,CALL}_NOSPEC and let the objtool retpoline patching
 infra take care of them instead of having unreadable alternative macros
 there
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnvG8ACgkQEsHwGGHe
 VUpbPw//WUoBMnR9B9xpwuk6EBA0rVbQqCt0QYy/h27QuD9aIC+lYduL8CId1AbS
 J+bqDvpqFQRqc9idAvgkjspJNMnbOSqzAx1AbT4gBCH33hPYrB/6F7XasXSDZn0M
 OVJDyvhOhror2I6YuFc1uwjMPBZj8+fgv823io/RKqvJfj/WUoIBK1oxUlzm4rNs
 LkrxgKpvs3QznJ0RNnZUP1kKnezzC1RtxIdz8QD3rirpuZITF8HD04jIUyK7gohh
 XP5Vgt+9BzRCHyn43XT7UEobz93WASDn0YOlDCdOuMxhJONYpnvk8dYJkss2BReG
 oCGZNiUHKNALd/FWUr5EeWoa84XQwRwm2Y+jjdH3al7LIaIbERGXbTIHsWH6JtVk
 +EiXqj+B6d2muZpG/ka0PSgPvbeTeipeugNDzXGCLuWHkjLRwJRV/afUmzLuT7lJ
 zzrejZeXACtWmTLXXt5EPAPWsibapiuA6/4ucyr+jJj7CCY8pmrOGcxqHKlq7MHU
 9Lk50F6Y9jbEe7mRPmR67dj/P9QAFYewjVDLEgOdyganrNNGfRkddJVbckYYZTVF
 YSUsoZBAc5E3Fzo/yQkG3+6K0d13Syuf3QRhGxe2JvYlWt2mktaFsRtcTGSCd9KM
 AJw8tV0FDezxmdKQ8DNkeLG+tAt7m22dRJkhbSUilYQsNR2Cwxg=
 =70Gc
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Have invalid MSR accesses warnings appear only once after a
   pr_warn_once() change broke that

 - Simplify {JMP,CALL}_NOSPEC and let the objtool retpoline patching
   infra take care of them instead of having unreadable alternative
   macros there

* tag 'x86_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/extable: Fix ex_handler_msr() print condition
  x86,nospec: Simplify {JMP,CALL}_NOSPEC
2022-08-01 10:04:00 -07:00
Linus Torvalds
92598ae22f - Rename a PKRU macro to make more sense when reading the code
- Update pkeys documentation
 
 - Avoid reading contended mm's TLB generation var if not absolutely
 necessary along with fixing a case where arch_tlbbatch_flush() doesn't
 adhere to the generation scheme and thus violates the conditions for the
 above avoidance.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnmpYACgkQEsHwGGHe
 VUrINQ/9FGnQya6mTJitM3Ohdzu1lOrHm5+XAxCO3SVzPPQlx0mRZmszzDOIZpG/
 9iCEDhSi+kLdkTwIXk8Nmm1imNT2MSqswjQYr8KDtl69/j12W8Y0Pb5C5tnQnUyi
 FXPiVVCAk0iegNg+QvarQa8Ou6tGWDqFMLzdrq9XNokdBmFq7FCDsOjdwd8So3IY
 95755wDtCxgBXc2TVr08qSpD0Q/VlHKqb5shtzuoBe9a0YLEaRmWne9UzTOx5U6c
 //qk8lmy9ohL8dmN7SgcRITzfpU8ue+/J4oZ+GV9mc/UTW5Ah2WNX+3BFnmCqZrK
 gr7G5pukuuJxFj8yGzGbGIM28OHKYIE+So2Q5pA6Vrqst/oyDJS+pcoxyhAYGYCQ
 hDjp4yu5AUnsPky6h6VHaR8Er5Nvo7YwhdSazcGD+HC7smwbnVEzI5H7MUgcJ05F
 1CkAQSy2TVZe0hhilOu8dcHN23+2ISF8BzxKbn4qtZOsJTN6/U4MYFWl6VPh8P80
 vjZcIJYZ4i6Gz03m7ITk2bHwfOD8f/7UkbZEggO/GYm1BgmxaMB0IogoIkSUG9vN
 CLGZomRMfBcVVS1DTWJsUzRLbNx3x3pL41NrlxPbC/rTmvts5eJAvcDcffPfRGzx
 tCqcASRdV7tQBgMT5MLjmIY8cM1aphdGSdlKVD7QHZ11bJVFZE4=
 =aD0S
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Borislav Petkov:

 - Rename a PKRU macro to make more sense when reading the code

 - Update pkeys documentation

 - Avoid reading contended mm's TLB generation var if not absolutely
   necessary along with fixing a case where arch_tlbbatch_flush()
   doesn't adhere to the generation scheme and thus violates the
   conditions for the above avoidance.

* tag 'x86_mm_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/tlb: Ignore f->new_tlb_gen when zero
  x86/pkeys: Clarify PKRU_AD_KEY macro
  Documentation/protection-keys: Clean up documentation for User Space pkeys
  x86/mm/tlb: Avoid reading mm_tlb_gen when possible
2022-08-01 09:34:39 -07:00
Linus Torvalds
94e37e8489 - A single CONFIG_ symbol correction in a comment
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnlkQACgkQEsHwGGHe
 VUp9ZA//VO6OYECA3ff6x1KBqB6dbC/W2dc54CPy2MSw5xwUlpPuD0ipbRMM7UEn
 zhnMNSwXgtqJdtdJZfvT9XwvrBjigAJJNjR/E3SA7bJ9jk10MEWFY/0YHtshV9vp
 y75c028m3yIxf+97Cku59k2Y69u1LWgM1mpFYVgJJCwkprliY9JMyvhboSXO63v9
 SR9m6jnDOJV1dIdmp7SgQvKEy2gRkz/kli5yzgNgw9Q0t5yodudaq9nEVKh0tWqP
 EQ6TzyORVGmqFZ5Jxti3W6NqUardrCeWwmodC1KwWm7vAcfaJET9ADTj3sAYPG4m
 m26i0fdjixnYAu27adiG7txtVgoZ3JkkNMLkDa30S2Dau1zhmGxdwAFrzLP2P84T
 UGQqsZ7TNkhtLp2Jlb5pfOAdj5q8mI4BDT6KLIXgJYRm0kLkV7W069mcYCKFD2S4
 BCHmDNG1ZVFQUVLG16gdZe4mRlf8mJ8WLCkDbXEAOGHCCB944xLuspgh2L88VrhR
 s8EP8mPn9rXGM4drAsao6+gseS4Q1gJ0E2gFh8YTOb0fFGvlTbcvqWCQSVC6P65z
 xRv1MWKg/VZcPo3Xpe40CELMcobOGxmxXaKQRJp6KSY7SdvidSb41DGv6LKffmPJ
 xckm/13Aip3zVXeBE8hrjulphOAKIfPkujbW0jSTOepTnVBc3Hc=
 =ps1Q
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanup from Borislav Petkov:

 - A single CONFIG_ symbol correction in a comment

* tag 'x86_cleanups_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Refer to the intended config STRICT_DEVMEM in a comment
2022-08-01 09:33:17 -07:00
Juergen Gross
a603002eea virtio: replace restricted mem access flag with callback
Instead of having a global flag to require restricted memory access
for all virtio devices, introduce a callback which can select that
requirement on a per-device basis.

For convenience add a common function returning always true, which can
be used for use cases like SEV.

Per default use a callback always returning false.

As the callback needs to be set in early init code already, add a
virtio anchor which is builtin in case virtio is enabled.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 guest using Xen
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/20220622063838.8854-2-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-08-01 07:42:49 +02:00
Peter Zijlstra
a1a5482a2c x86/extable: Fix ex_handler_msr() print condition
On Fri, Jun 17, 2022 at 02:08:52PM +0300, Stephane Eranian wrote:
> Some changes to the way invalid MSR accesses are reported by the
> kernel is causing some problems with messages printed on the
> console.
>
> We have seen several cases of ex_handler_msr() printing invalid MSR
> accesses once but the callstack multiple times causing confusion on
> the console.

> The problem here is that another earlier commit (5.13):
>
> a358f40600 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
>
> Modifies all the pr_*_once() calls to always return true claiming
> that no caller is ever checking the return value of the functions.
>
> This is why we are seeing the callstack printed without the
> associated printk() msg.

Extract the ONCE_IF(cond) part into __ONCE_LTE_IF() and use that to
implement DO_ONCE_LITE_IF() and fix the extable code.

Fixes: a358f40600 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Stephane Eranian <eranian@google.com>
Link: https://lkml.kernel.org/r/YqyVFsbviKjVGGZ9@worktop.programming.kicks-ass.net
2022-07-21 10:39:42 +02:00
Nadav Amit
8f1d56f64f x86/mm/tlb: Ignore f->new_tlb_gen when zero
Commit aa44284960 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
possible") introduced an optimization to skip superfluous TLB
flushes based on the generation provided in flush_tlb_info.

However, arch_tlbbatch_flush() does not provide any generation in
flush_tlb_info and populates the flush_tlb_info generation with
0.  This 0 is causes the flush_tlb_info to be interpreted as a
superfluous, old flush.  As a result, try_to_unmap_one() would
not perform any TLB flushes.

Fix it by checking whether f->new_tlb_gen is nonzero. Zero value
is anyhow is an invalid generation value. To avoid future
confusion, introduce TLB_GENERATION_INVALID constant and use it
properly. Add warnings to ensure no partial flushes are done with
TLB_GENERATION_INVALID or when f->mm is NULL, since this does not
make any sense.

In addition, add the missing unlikely().

[ dhansen: change VM_BUG_ON() -> VM_WARN_ON(), clarify changelog ]

Fixes: aa44284960 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/20220710232837.3618-1-namit@vmware.com
2022-07-19 09:04:52 -07:00
Anshuman Khandual
4867fbbdd6 x86/mm: move protection_map[] inside the platform
This moves protection_map[] inside the platform and makes it a static. 
This also defines a helper function add_encrypt_protection_map() that can
update the protection_map[] array with pgprot_encrypted().

Link: https://lkml.kernel.org/r/20220711070600.2378316-7-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:14:38 -07:00
Juergen Gross
230ec83d42 x86/pat: Fix x86_has_pat_wp()
x86_has_pat_wp() is using a wrong test, as it relies on the normal
PAT configuration used by the kernel. In case the PAT MSR has been
setup by another entity (e.g. Xen hypervisor) it might return false
even if the PAT configuration is allowing WP mappings. This due to the
fact that when running as Xen PV guest the PAT MSR is setup by the
hypervisor and cannot be changed by the guest. This results in the WP
related entry to be at a different position when running as Xen PV
guest compared to the bare metal or fully virtualized case.

The correct way to test for WP support is:

1. Get the PTE protection bits needed to select WP mode by reading
   __cachemode2pte_tbl[_PAGE_CACHE_MODE_WP] (depending on the PAT MSR
   setting this might return protection bits for a stronger mode, e.g.
   UC-)
2. Translate those bits back into the real cache mode selected by those
   PTE bits by reading __pte2cachemode_tbl[__pte2cm_idx(prot)]
3. Test for the cache mode to be _PAGE_CACHE_MODE_WP

Fixes: f88a68facd ("x86/mm: Extend early_memremap() support with additional attrs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.14
Link: https://lore.kernel.org/r/20220503132207.17234-1-jgross@suse.com
2022-07-13 12:44:04 +02:00
Lukas Bulwahn
9de76f41ea x86/mm: Refer to the intended config STRICT_DEVMEM in a comment
Commit a4866aa812 ("mm: Tighten x86 /dev/mem with zeroing reads") adds a
comment to the function devmem_is_allowed() referring to a non-existing
config STRICT_IOMEM, whereas the comment very likely intended to refer to
the config STRICT_DEVMEM, as the commit adds some behavior for the config
STRICT_DEVMEM.

Most of the initial analysis was actually done by Dave Hansen in the
email thread below (see Link).

Refer to the intended and existing config STRICT_DEVMEM.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/f9074e8d-9314-9d7d-7bf5-5b5538c8be8d@intel.com/
Link: https://lkml.kernel.org/r/20220707115442.21107-1-lukas.bulwahn@gmail.com
2022-07-07 09:51:36 -07:00
Frederic Weisbecker
6f0e6c1598 context_tracking: Take IRQ eqs entrypoints over RCU
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the IRQ extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.

[ paulmck: Apply Stephen Rothwell feedback from -next. ]
[ paulmck: Apply Nathan Chancellor feedback. ]

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05 13:32:59 -07:00
Peter Zijlstra
9bb2ec608a objtool: Update Retpoline validation
Update retpoline validation with the new CONFIG_RETPOLINE requirement of
not having bare naked RET instructions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:33:59 +02:00
Kim Phillips
0ee9073000 x86/sev: Avoid using __x86_return_thunk
Specifically, it's because __enc_copy() encrypts the kernel after
being relocated outside the kernel in sme_encrypt_execute(), and the
RET macro's jmp offset isn't amended prior to execution.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:33:58 +02:00
Peter Xu
d92725256b mm: avoid unnecessary page fault retires on shared memory types
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16 19:48:27 -07:00
Peter Xu
cd16dd0373 mm/x86: remove dead code for hugetlbpage.c
It seems to exist since the old times and never used once.  Remove them.

Link: https://lkml.kernel.org/r/20220525195220.10241-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16 19:48:27 -07:00
Ira Weiny
54ee184404 x86/pkeys: Clarify PKRU_AD_KEY macro
When changing the PKRU_AD_KEY macro to be used for PKS the name came
into question.[1]

The intent of PKRU_AD_KEY is to set an initial value for the PKRU
register but that is just a mask value.

Clarify this by changing the name to PKRU_AD_MASK().

NOTE the checkpatch errors are ignored for the init_pkru_value to align
the values in the code.

[1] https://lore.kernel.org/lkml/eff862e2-bfaa-9e12-42b5-a12467d72a22@intel.com/

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220419170649.1022246-3-ira.weiny@intel.com
2022-06-07 16:06:33 -07:00
Nadav Amit
aa44284960 x86/mm/tlb: Avoid reading mm_tlb_gen when possible
On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
contended and reading it should (arguably) be avoided as much as
possible.

Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
even when it is not necessary (e.g., the mm was already switched).
This is wasteful.

Moreover, one of the existing optimizations is to read mm's tlb_gen to
see if there are additional in-flight TLB invalidations and flush the
entire TLB in such a case. However, if the request's tlb_gen was already
flushed, the benefit of checking the mm's tlb_gen is likely to be offset
by the overhead of the check itself.

Running will-it-scale with tlb_flush1_threads show a considerable
benefit on 56-core Skylake (up to +24%):

threads		Baseline (v5.17+)	+Patch
1		159960			160202
5		310808			308378 (-0.7%)
10		479110			490728
15		526771			562528
20		534495			587316
25		547462			628296
30		579616			666313
35		594134			701814
40		612288			732967
45		617517			749727
50		637476			735497
55		614363			778913 (+24%)

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20220606180123.2485171-1-namit@vmware.com
2022-06-07 08:48:03 -07:00
Juergen Gross
3f9dfbebdc virtio: replace arch_has_restricted_virtio_memory_access()
Instead of using arch_has_restricted_virtio_memory_access() together
with CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, replace those
with platform_has() and a new platform feature
PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
2022-06-06 08:22:01 +02:00
Linus Torvalds
0b7da15c21 Use PAGE_ALIGNED() instead of open coding it in the x86/mm code.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdlsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaDeD/oDyKpbOf+ta83AyZm+h6Q+KzYf44n6
 XRaT8Ip1f/aUk+Ucv2YLpZT/LRiBenmJpsdDRaGhje83pS2cnXStd8UhEBjiWmeq
 vBGR8QpXJtFna1Ouizwwq+Pp2mnktK+YEk+Qf17z9UwrRzqXlgTIJqPWsVY1plmV
 lHVhThdkr09PR4aSBN9TQzTFxG0Qtf3mHbsRyPTzODldMtSbdCy9Zy2xlTigDIJX
 uTU/ykoB8qZpJguacAZpExwxvoRHy95cpxNCQfC5paeAv3gJ0U1b3PAJA4VkVlko
 h/2crEzHV0MZRx12o1le72K/2UfuvTeb3949VD3ZBIE5fPSH/K/niwl9BtntXSQj
 Bs0Kbm3k94tJz2VunG0FO7ayJ8/KL98zLHXQR6yRBTGvtkj9m7vhUMaN268NSLQH
 meAIuMreJejfsKV2Z/z9y+nSWebKdEQ0xkOq51qe3s2ivroKshTUNR0h7S35xyw5
 +R8lzJZH4rVVBOcxzCLTaIVsJbVHGfE65gb6JIyUT2sRRxXm53gYPC2rvIRzi8eA
 1h1ODpz0pn3LewXsigOJGmiK11Dda5uJAF58t1BJmGi9kXSH/CVfYBTcDOXcYNHy
 sA+oq+yrQ5RSywv1D9zuDzOgbcEZ1mB82aWuy4Li+k9uhdCyNIsNy7Nbw9paCDZk
 Oy+IMDLroCieSw==
 =KN22
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm cleanup from Thomas Gleixner:
 "Use PAGE_ALIGNED() instead of open coding it in the x86/mm code"

* tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)
2022-06-05 10:57:35 -07:00
Linus Torvalds
35cdd8656e libnvdimm for 5.19
- Add support for clearing memory error via pwrite(2) on DAX
 
 - Fix 'security overwrite' support in the presence of media errors
 
 - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs
 Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj
 kxOtJEU2bA9naninXGORsDUndiNkwQo=
 =gVIn
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and DAX updates from Dan Williams:
 "New support for clearing memory errors when a file is in DAX mode,
  alongside with some other fixes and cleanups.

  Previously it was only possible to clear these errors using a truncate
  or hole-punch operation to trigger the filesystem to reallocate the
  block, now, any page aligned write can opportunistically clear errors
  as well.

  This change spans x86/mm, nvdimm, and fs/dax, and has received the
  appropriate sign-offs. Thanks to Jane for her work on this.

  Summary:

   - Add support for clearing memory error via pwrite(2) on DAX

   - Fix 'security overwrite' support in the presence of media errors

   - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"

* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  pmem: implement pmem_recovery_write()
  pmem: refactor pmem_clear_poison()
  dax: add .recovery_write dax_operation
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  mce: fix set_mce_nospec to always unmap the whole page
  x86/mce: relocate set{clear}_mce_nospec() functions
  acpi/nfit: rely on mce->misc to determine poison granularity
  testing: nvdimm: asm/mce.h is not needed in nfit.c
  testing: nvdimm: iomap: make __nfit_test_ioremap a macro
  nvdimm: Allow overwrite in the presence of disabled dimms
  tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-27 15:49:30 -07:00
Fanjun Kong
e19d11267f x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)
The <linux/mm.h> already provides the PAGE_ALIGNED() macro. Let's
use this macro instead of IS_ALIGNED() and passing PAGE_SIZE directly.

No change in functionality.

[ mingo: Tweak changelog. ]

Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220526142038.1582839-1-bh1scw@gmail.com
2022-05-27 12:19:56 +02:00
Linus Torvalds
98931dd95f Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
2022-05-26 12:32:41 -07:00
Linus Torvalds
3f306ea2e1 dma-mapping updates for Linux 5.19
- don't over-decrypt memory (Robin Murphy)
  - takes min align mask into account for the swiotlb max mapping size
    (Tianyu Lan)
  - use GFP_ATOMIC in dma-debug (Mikulas Patocka)
  - fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)
  - don't fail on highmem CMA pages in dma_direct_alloc_pages (me)
  - cleanup swiotlb initialization and share more code with swiotlb-xen
    (me, Stefano Stabellini)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKObTQLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYObmA//dIcDB/q4iFGD+WJh4MhM+asx0ZsdF2OJz42WEhgT
 Z9duOrgcneEQundCamqJP9rNTs980LHDA8uWQC5rZEc9vxuRVOdS7bSgYRUwWh6B
 r0ZjOsvQCn+ChoZML8uyk4rfmEINq+EvJuec3G5fgecZOhPuJS2i2uzzv5cHwqgP
 ChC0fwyZlkfdECXgvZXbEoCJLfTgGNlziN6Ai8dirSoqgEQUoCsY89/M7OiEBvV2
 R4XUWD7OvQERfB4t6xLuUHyzf9PAuWB+OiblRVNeAmK3lMjxVrc3k4kIowgklnzD
 8hfmphAa9Zou3zdfi6Gd4fiQRHRVOwKVp1rtqUmJ+lPSiwyMzu64z9ld2+2qac0h
 V4sSr/yJkhxnBT4/0MkTChvhnRobisackpUzNRpiM4ck7cNVb7eAvkISsbH+pWI9
 aEexPhbyskjlV+GOyM4QL4ygG0dpXY0HSyoh6uaSVsaXMycnWIsJCPidXxV1HGV0
 q2/RLHuHwYxia8cYCF01/DQvwOKSjwbU0zModxtRezGD5GYh2C0a+SrA1aX+qiTu
 yGJCs2UHtSQstAt78tTVp499YeDeL/oGSQkPAu8zyRkSczzF+CncGTuXyoJbAWyK
 otcgERWljgZ4scxjfu1uacfoVhKQ7nOu7hiJokL0U80FESAennLC3ZlocvB9h/ff
 HNA=
 =n2rk
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - don't over-decrypt memory (Robin Murphy)

 - takes min align mask into account for the swiotlb max mapping size
   (Tianyu Lan)

 - use GFP_ATOMIC in dma-debug (Mikulas Patocka)

 - fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)

 - don't fail on highmem CMA pages in dma_direct_alloc_pages (me)

 - cleanup swiotlb initialization and share more code with swiotlb-xen
   (me, Stefano Stabellini)

* tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping: (23 commits)
  dma-direct: don't over-decrypt memory
  swiotlb: max mapping size takes min align mask into account
  swiotlb: use the right nslabs-derived sizes in swiotlb_init_late
  swiotlb: use the right nslabs value in swiotlb_init_remap
  swiotlb: don't panic when the swiotlb buffer can't be allocated
  dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
  dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages
  swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm
  x86: remove cruft from <asm/dma-mapping.h>
  swiotlb: remove swiotlb_init_with_tbl and swiotlb_init_late_with_tbl
  swiotlb: merge swiotlb-xen initialization into swiotlb
  swiotlb: provide swiotlb_init variants that remap the buffer
  swiotlb: pass a gfp_mask argument to swiotlb_init_late
  swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction
  swiotlb: make the swiotlb_init interface more useful
  x86: centralize setting SWIOTLB_FORCE when guest memory encryption is enabled
  x86: remove the IOMMU table infrastructure
  MIPS/octeon: use swiotlb_init instead of open coding it
  arm/xen: don't check for xen_initial_domain() in xen_create_contiguous_region
  swiotlb: rename swiotlb_late_init_with_default_size
  ...
2022-05-25 19:18:36 -07:00
Linus Torvalds
0bf13a8436 kernel-hardening updates for v5.19-rc1
- usercopy hardening expanded to check other allocation types
   (Matthew Wilcox, Yuanzheng Song)
 
 - arm64 stackleak behavioral improvements (Mark Rutland)
 
 - arm64 CFI code gen improvement (Sami Tolvanen)
 
 - LoadPin LSM block dev API adjustment (Christoph Hellwig)
 
 - Clang randstruct support (Bill Wendling, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmKL1kMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlz6D/9lYEwDQYwKVK6fsXdgcs/eUkqc
 P06KGm7jDiYiua34LMpgu35wkRcxVDzB92kzQmt7yaVqhlIGjO9wnP+uZrq8q/LS
 X9FSb457fREg0XLPX5XC60abHYyikvgJMf06dSLaBcRq1Wzqwp5JZPpLZJUAM2ab
 rM1Vq0brfF1+lPAPECx1sYYNksP9XTw0dtzUu8D9tlTQDFAhKYhV6Io5yRFkA4JH
 ELSHjJHlNgLYeZE5IfWHRQBb+yofjnt61IwoVkqa5lSfoyvKpBPF5G+3gOgtdkyv
 A8So2aG/bMNUUY80Th5ojiZ6V7z5SYjUmHRil6I/swAdkc825n2wM+AQqsxv6U4I
 VvGz3cxaKklERw5N+EJw4amivcgm1jEppZ7qCx9ysLwVg/LI050qhv/T10TYPmOX
 0sQEpZvbKuqGb6nzWo6DME8OpZ27yIa/oRzBHdkIkfkEefYlKWS+dfvWb/73cltj
 jx066Znk1hHZWGT48EsRmxdGAHn4kfIMcMgIs1ki1OO2II6LoXyaFJ0wSAYItxpz
 5gCmDMjkGFRrtXXPEhi6kfKKpOuQux+BmpbVfEzox7Gnrf45sp92cYLncmpAsFB3
 91nPa4/utqb/9ijFCIinazLdcUBPO8I1C8FOHDWSFCnNt4d3j2ozpLbrKWyQsm7+
 RCGdcy+NU/FH1FwZlg==
 =nxsC
 -----END PGP SIGNATURE-----

Merge tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull kernel hardening updates from Kees Cook:

 - usercopy hardening expanded to check other allocation types (Matthew
   Wilcox, Yuanzheng Song)

 - arm64 stackleak behavioral improvements (Mark Rutland)

 - arm64 CFI code gen improvement (Sami Tolvanen)

 - LoadPin LSM block dev API adjustment (Christoph Hellwig)

 - Clang randstruct support (Bill Wendling, Kees Cook)

* tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (34 commits)
  loadpin: stop using bdevname
  mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr()
  gcc-plugins: randstruct: Remove cast exception handling
  af_unix: Silence randstruct GCC plugin warning
  niu: Silence randstruct warnings
  big_keys: Use struct for internal payload
  gcc-plugins: Change all version strings match kernel
  randomize_kstack: Improve docs on requirements/rationale
  lkdtm/stackleak: fix CONFIG_GCC_PLUGIN_STACKLEAK=n
  arm64: entry: use stackleak_erase_on_task_stack()
  stackleak: add on/off stack variants
  lkdtm/stackleak: check stack boundaries
  lkdtm/stackleak: prevent unexpected stack usage
  lkdtm/stackleak: rework boundary management
  lkdtm/stackleak: avoid spurious failure
  stackleak: rework poison scanning
  stackleak: rework stack high bound handling
  stackleak: clarify variable names
  stackleak: rework stack low bound handling
  stackleak: remove redundant check
  ...
2022-05-24 12:27:09 -07:00
Linus Torvalds
abc8babefb - A gargen variety of fixes which don't fit any other tip bucket:
- Remove function export
  - Correct asm constraint
  - Fix __setup handlers retval
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL6VkACgkQEsHwGGHe
 VUqs6g/+Ikpd4Mrou4P5Ul8QNdN9mEzwUfW6i8VpoA3h1L6mKkZxbUsbSz9xInjw
 MAhrcevujW6GwdQdus2sUcSlX+jxl6c/IlMdf8RegNPY/JBPDX4dRA7rPetvZEDm
 ZiIYVTiEzJoOzPDJeO7a3v5EHPsY6CjsCFhGz7hjIcrwQjzCLkL5MqG+WDAtebe+
 QVdbllD2RlZNPDyHYE5Lqh1h+Y0e4n6kS7LCWxexfHlNOZ5KBRVyIJvz/xOZFZ1/
 9oX0UDD2gfH5chLs8GKsr7cZYERMtNlKBPoxGzl8iKF4iUeiksdj3P5y+mdcFaDG
 YbM7aXewmbyLyiCkh1zXU6Mw3lK1VfUtVXtEYj+qXf1jWp59ctNEJkc6/VAcaKh7
 oS7MNG7Y44B8XwdH7MiqDE7eVCnqEjIR+BIiwjyXNLFP1AXZMAXuBzXPF/vZ3Gyf
 3N5vzO4VNEN6Oa1TReSspKwYvq2uPtHMjLX2rT6Py2ru32mj2dCc5E7GD83RKL8V
 vDIz4VGOZyGfjp6gClMBsyK4mYwSwgXbnOci7DJn56mMf2qzBJITILXc31zz4gX2
 E9kiBu/4Mwjnrx9QRpCNXu7iddBA3YM2NMtNlwBcCgZOFaFz/yOx9TpnugF17WHQ
 VVtQi8wlcsS+F05Y11b7euusMQyk1EpWabIrw8UQd+61Dwpz58Q=
 =/WGB
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:
 "A variety of fixes which don't fit any other tip bucket:

   - Remove unnecessary function export

   - Correct asm constraint

   - Fix __setup handlers retval"

* tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Cleanup the control_va_addr_alignment() __setup handler
  x86: Fix return value of __setup handlers
  x86/delay: Fix the wrong asm constraint in delay_loop()
  x86/amd_nb: Unexport amd_cache_northbridges()
2022-05-23 19:32:59 -07:00
Linus Torvalds
c415b53ad0 - A sparse address space annotation fix
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLxsIACgkQEsHwGGHe
 VUrm0w//foA+nrqdDxGHEj7xLSdymnqZ5KOfjDQyL8hdMOSVdwdUf1XSwQY8TjD6
 7U4gGw0TqBBooVLguKgY6NQz/wiucJ5FuU8uibkESMXfFO5oGPTwRupNAs8tV1T6
 Jhs513VWHrT/8g3XYFqVQMo3MmBmtX3tF0kj+FxT9yOHNQphJ6y2c6CbcpLMZt3s
 MMTaaQh6g65pzi9sW6Gl4iSEhR0COYHEq+zPPFdZOc1vB3xjhyEkz1Fuq/lMR+pr
 K5tTzaZ9tuelcjw7ZPni+WoyapI092f1EWV6nZQb6EV6bXasM+HBDLFPKen4GhyZ
 TqWYqaycDqNFiMkrKrXDHpOCNKWWYshwdFTZ6k4+FZSCx+4y2Sjjt4KdUEJW2qee
 lWBsiqRPNfwkPrg9sMO+3TDt4U28KywsaA6U+oQ+AJn7kLe9dQqNnlzVvajBer32
 +1kk2pY8m245ulVE+pcY+mhVlrOyuVpYk8H/7LnRiZAgucrkhHeJXSoW6VUn1G0G
 MmqLxCQPQ6kF4617m+BXavrUdLV2sotA0xZJ8mG4XNSCMwQ1Ymr2UJM6u3Jrq7SA
 p6XJQv/1e4NluC2h97N1e6nHxM61JJdFxYG/y+eDidnctH+TU6wIM2TFRXBNCsy3
 vqkIimuuV/9PfmMdzdr8o7KX9gmpY4dBeCrOSPlHiA0yaYYdmbY=
 =0DKB
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm fixlet from Borislav Petkov:

 - A sparse address space annotation fix

* tag 'x86_mm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fault: Cast an argument to the proper address space in prefetch()
2022-05-23 18:59:24 -07:00
Linus Torvalds
a13dc4d409 - Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for unnecessary IPIs
 
 - Finally remove a.out support
 
 - The usual trivial cleanups and fixes all over x86
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
 VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
 XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
 qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
 /3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
 0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
 BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
 M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
 xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
 2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
 hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
 TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
 =Dks/
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Serious sanitization and cleanup of the whole APERF/MPERF and
   frequency invariance code along with removing the need for
   unnecessary IPIs

 - Finally remove a.out support

 - The usual trivial cleanups and fixes all over x86

* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86: Remove empty files
  x86/speculation: Add missing srbds=off to the mitigations= help text
  x86/prctl: Remove pointless task argument
  x86/aperfperf: Make it correct on 32bit and UP kernels
  x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  x86/aperfmperf: Replace arch_freq_get_on_cpu()
  x86/aperfmperf: Replace aperfmperf_get_khz()
  x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  x86/aperfmperf: Make parts of the frequency invariance code unconditional
  x86/aperfmperf: Restructure arch_scale_freq_tick()
  x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  x86/aperfmperf: Separate AP/BP frequency invariance init
  x86/smp: Move APERF/MPERF code where it belongs
  x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  x86/process: Fix kernel-doc warning due to a changed function name
  x86: Remove a.out support
  x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
  x86: Replace cpumask_weight() with cpumask_empty() where appropriate
  x86/pkeys: Remove __arch_set_user_pkey_access() declaration
  ...
2022-05-23 18:17:09 -07:00
Linus Torvalds
c5a3d3c01e - Remove a bunch of chicken bit options to turn off CPU features which
are not really needed anymore
 
 - Misc fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLdfgACgkQEsHwGGHe
 VUpB5Q//TIGVgmnSd0YYxY2cIe047lfcd34D+3oEGk0d2FidtirP/tjgBqIXRuY5
 UncoveqBuI/6/7bodP/ANg9DNVXv2489eFYyZtEOLSGnfzV2AU10aw95cuQQG+BW
 YIc6bGSsgfiNo8Vtj4L3xkVqxOrqaCYnh74GTSNNANht3i8KH8Qq9n3qZTuMiF6R
 fH9xWak3TZB2nMzHdYrXh0sSR6eBHN3KYSiT0DsdlU9PUlavlSPFYQRiAlr6FL6J
 BuYQdlUaCQbINvaviGW4SG7fhX32RfF/GUNaBajB40TO6H98KZLpBBvstWQ841xd
 /o44o5wbghoGP1ne8OKwP+SaAV2bE6twd5eO1lpwcpXnQfATvjQ2imxvOiRhy5LY
 pFPt/hko9gKWJ6SI0SQ4tiKJALFPLWD6561scHU6PoriFhv0SRIaPmJyEsDYynMz
 bCXaPPsoovRwwwBfAxxQjljIlhQSBVt3gWZ8NWD1tYbNaqM+WK7xKBaONGh3OCw3
 iK7lsbbljtM0zmANImYyeo7+Hr1NVOmMiK2WZYbxhxgzH3l8v/6EbDt3I70WU57V
 9apCU3/nk/HFpX65SdW5qmuiWLVdH9NXrEqbvaUB4ApT18MdUUugewBhcGnf3Umu
 wEtltzziqcIkxzDoXXpBGWpX31S7PsM2XVDqYC7dwuNttgEw2Fc=
 =7AUX
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CPU feature updates from Borislav Petkov:

 - Remove a bunch of chicken bit options to turn off CPU features which
   are not really needed anymore

 - Misc fixes and cleanups

* tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Add missing prototype for unpriv_ebpf_notify()
  x86/pm: Fix false positive kmemleak report in msr_build_context()
  x86/speculation/srbds: Do not try to turn mitigation off when not supported
  x86/cpu: Remove "noclflush"
  x86/cpu: Remove "noexec"
  x86/cpu: Remove "nosmep"
  x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"
  x86/cpu: Remove "nosep"
  x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
2022-05-23 18:01:31 -07:00
Linus Torvalds
3a755ebcc2 Intel Trust Domain Extensions
This is the Intel version of a confidential computing solution called
 Trust Domain Extensions (TDX). This series adds support to run the
 kernel as part of a TDX guest. It provides similar guest protections to
 AMD's SEV-SNP like guest memory and register state encryption, memory
 integrity protection and a lot more.
 
 Design-wise, it differs from AMD's solution considerably: it uses
 a software module which runs in a special CPU mode called (Secure
 Arbitration Mode) SEAM. As the name suggests, this module serves as sort
 of an arbiter which the confidential guest calls for services it needs
 during its lifetime.
 
 Just like AMD's SNP set, this series reworks and streamlines certain
 parts of x86 arch code so that this feature can be properly accomodated.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
 VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
 kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
 pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
 khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
 r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
 fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
 HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
 IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
 tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
 HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
 VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
 =ac9K
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
 "Intel Trust Domain Extensions (TDX) support.

  This is the Intel version of a confidential computing solution called
  Trust Domain Extensions (TDX). This series adds support to run the
  kernel as part of a TDX guest. It provides similar guest protections
  to AMD's SEV-SNP like guest memory and register state encryption,
  memory integrity protection and a lot more.

  Design-wise, it differs from AMD's solution considerably: it uses a
  software module which runs in a special CPU mode called (Secure
  Arbitration Mode) SEAM. As the name suggests, this module serves as
  sort of an arbiter which the confidential guest calls for services it
  needs during its lifetime.

  Just like AMD's SNP set, this series reworks and streamlines certain
  parts of x86 arch code so that this feature can be properly
  accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/tdx: Fix RETs in TDX asm
  x86/tdx: Annotate a noreturn function
  x86/mm: Fix spacing within memory encryption features message
  x86/kaslr: Fix build warning in KASLR code in boot stub
  Documentation/x86: Document TDX kernel architecture
  ACPICA: Avoid cache flush inside virtual machines
  x86/tdx/ioapic: Add shared bit for IOAPIC base address
  x86/mm: Make DMA memory shared for TD guest
  x86/mm/cpa: Add support for TDX shared memory
  x86/tdx: Make pages shared in ioremap()
  x86/topology: Disable CPU online/offline control for TDX guests
  x86/boot: Avoid #VE during boot for TDX platforms
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/acpi/x86/boot: Add multiprocessor wake-up support
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Port I/O: Add early boot support
  x86/tdx: Port I/O: Add runtime hypercalls
  x86/boot: Port I/O: Add decompression-time support for TDX
  x86/boot: Port I/O: Allow to hook up alternative helpers
  ...
2022-05-23 17:51:12 -07:00
Linus Torvalds
eb39e37d5c AMD SEV-SNP support
Add to confidential guests the necessary memory integrity protection
 against malicious hypervisor-based attacks like data replay, memory
 remapping and others, thus achieving a stronger isolation from the
 hypervisor.
 
 At the core of the functionality is a new structure called a reverse
 map table (RMP) with which the guest has a say in which pages get
 assigned to it and gets notified when a page which it owns, gets
 accessed/modified under the covers so that the guest can take an
 appropriate action.
 
 In addition, add support for the whole machinery needed to launch a SNP
 guest, details of which is properly explained in each patch.
 
 And last but not least, the series refactors and improves parts of the
 previous SEV support so that the new code is accomodated properly and
 not just bolted on.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLU2AACgkQEsHwGGHe
 VUpb/Q//f4LGiJf4nw1flzpe90uIsHNwAafng3NOjeXmhI/EcOlqPf23WHPCgg3Z
 2umfa4sRZyj4aZubDd7tYAoq4qWrQ7pO7viWCNTh0InxBAILOoMPMuq2jSAbq0zV
 ASUJXeQ2bqjYxX4JV4N5f3HT2l+k68M0mpGLN0H+O+LV9pFS7dz7Jnsg+gW4ZP25
 PMPLf6FNzO/1tU1aoYu80YDP1ne4eReLrNzA7Y/rx+S2NAetNwPn21AALVgoD4Nu
 vFdKh4MHgtVbwaQuh0csb/+4vD+tDXAhc8lbIl+Abl9ZxJaDWtAJW5D9e2CnsHk1
 NOkHwnrzizzhtGK1g56YPUVRFAWhZYMOI1hR0zGPLQaVqBnN4b+iahPeRiV0XnGE
 PSbIHSfJdeiCkvLMCdIAmpE5mRshhRSUfl1CXTCdetMn8xV/qz/vG6bXssf8yhTV
 cfLGPHU7gfVmsbR9nk5a8KZ78PaytxOxfIDXvCy8JfQwlIWtieaCcjncrj+sdMJy
 0fdOuwvi4jma0cyYuPolKiS1Hn4ldeibvxXT7CZQlIx6jZShMbpfpTTJs11XdtHm
 PdDAc1TY3AqI33mpy9DhDQmx/+EhOGxY3HNLT7evRhv4CfdQeK3cPVUWgo4bGNVv
 ZnFz7nvmwpyufltW9K8mhEZV267174jXGl6/idxybnlVE7ESr2Y=
 =Y8kW
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull AMD SEV-SNP support from Borislav Petkov:
 "The third AMD confidential computing feature called Secure Nested
  Paging.

  Add to confidential guests the necessary memory integrity protection
  against malicious hypervisor-based attacks like data replay, memory
  remapping and others, thus achieving a stronger isolation from the
  hypervisor.

  At the core of the functionality is a new structure called a reverse
  map table (RMP) with which the guest has a say in which pages get
  assigned to it and gets notified when a page which it owns, gets
  accessed/modified under the covers so that the guest can take an
  appropriate action.

  In addition, add support for the whole machinery needed to launch a
  SNP guest, details of which is properly explained in each patch.

  And last but not least, the series refactors and improves parts of the
  previous SEV support so that the new code is accomodated properly and
  not just bolted on"

* tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  x86/entry: Fixup objtool/ibt validation
  x86/sev: Mark the code returning to user space as syscall gap
  x86/sev: Annotate stack change in the #VC handler
  x86/sev: Remove duplicated assignment to variable info
  x86/sev: Fix address space sparse warning
  x86/sev: Get the AP jump table address from secrets page
  x86/sev: Add missing __init annotations to SEV init routines
  virt: sevguest: Rename the sevguest dir and files to sev-guest
  virt: sevguest: Change driver name to reflect generic SEV support
  x86/boot: Put globals that are accessed early into the .data section
  x86/boot: Add an efi.h header for the decompressor
  virt: sevguest: Fix bool function returning negative value
  virt: sevguest: Fix return value check in alloc_shared_pages()
  x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate()
  virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement
  virt: sevguest: Add support to get extended report
  virt: sevguest: Add support to derive key
  virt: Add SEV-SNP guest driver
  x86/sev: Register SEV-SNP guest request platform device
  x86/sev: Provide support for SNP guest request NAEs
  ...
2022-05-23 17:38:01 -07:00
Jane Chu
5898b43af9 mce: fix set_mce_nospec to always unmap the whole page
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 11:46:44 -07:00
Jane Chu
b3fdf9398a x86/mce: relocate set{clear}_mce_nospec() functions
Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.

While at it, fixup a function name in a comment.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
[sfr: gate {set,clear}_mce_nospec() by CONFIG_X86_64]
Link: https://lore.kernel.org/r/165272527328.90175.8336008202048685278.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 11:46:44 -07:00
Li kunyu
c8db8c2628 mm: functions may simplify the use of return values
p4d_clear_huge may be optimized for void return type and function usage. 
vunmap_p4d_range function saves a few steps here.

Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Nadav Amit
4f83145721 mm: avoid unnecessary flush on change_huge_pmd()
Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with a
lesser importance the access bit) from changing while the PTE is modified.
However, this is not necessary as the x86 CPUs set the dirty-bit
atomically with an additional check that the PTE is (still) present.  One
caveat is Intel's Knights Landing that has a bug and does not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd().  Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Link: https://lkml.kernel.org/r/20220401180821.1986781-4-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:05 -07:00
Adrian-Ken Rueegsegger
280abe14b6 x86/mm: Fix marking of unused sub-pmd ranges
The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.

Use the correct address when applying the mark using memset.

Fixes: 8d400913c2 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch
2022-05-13 12:41:21 +02:00
Kees Cook
595b893e20 randstruct: Reorganize Kconfigs and attribute macros
In preparation for Clang supporting randstruct, reorganize the Kconfigs,
move the attribute macros, and generalize the feature to be named
CONFIG_RANDSTRUCT for on/off, CONFIG_RANDSTRUCT_FULL for the full
randomization mode, and CONFIG_RANDSTRUCT_PERFORMANCE for the cache-line
sized mode.

Cc: linux-hardening@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-4-keescook@chromium.org
2022-05-08 01:33:06 -07:00
Randy Dunlap
12441ccdf5 x86: Fix return value of __setup handlers
__setup() handlers should return 1 to obsolete_checksetup() in
init/main.c to indicate that the boot option has been handled. A return
of 0 causes the boot option/value to be listed as an Unknown kernel
parameter and added to init's (limited) argument (no '=') or environment
(with '=') strings. So return 1 from these x86 __setup handlers.

Examples:

  Unknown kernel command line parameters "apicpmtimer
    BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable", will be
    passed to user space.

  Run /sbin/init as init process
   with arguments:
     /sbin/init
     apicpmtimer
   with environment:
     HOME=/
     TERM=linux
     BOOT_IMAGE=/boot/bzImage-517rc8
     vdso=1
     ring3mwait=disable

Fixes: 2aae950b21 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu")
Fixes: 77b52b4c5c ("x86: add "debugpat" boot option")
Fixes: e16fd002af ("x86/cpufeature: Enable RING3MWAIT for Knights Landing")
Fixes: b8ce335906 ("x86_64: convert to clock events")
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lore.kernel.org/r/20220314012725.26661-1-rdunlap@infradead.org
2022-05-04 16:47:57 +02:00
Sean Christopherson
643d95aac5 Revert "x86/mm: Introduce lookup_address_in_mm()"
Drop lookup_address_in_mm() now that KVM is providing it's own variant
of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
guards against page tables being torn down.  A variant that provides a
non-init mm is inherently dangerous and flawed, as the only reason to use
an mm other than init_mm is to walk a userspace mapping, and
lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between
checking the PAGE_SIZE bit and grabbing the address of the next level
down.

This reverts commit 13c72c060f.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <YmwIi3bXr/1yhYV/@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:40:41 -04:00
Muchun Song
47010c040d mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize".  In this patch , cheanup configs to make code more
expressive.

Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28 23:16:15 -07:00
Christoph Hellwig
e10cd4b009 x86/mm: enable ARCH_HAS_VM_GET_PAGE_PROT
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT.  This also unsubscribes from config
ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and
arch_vm_get_page_prot().

Link: https://lkml.kernel.org/r/20220414062125.609297-6-anshuman.khandual@arm.com
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28 23:16:13 -07:00
Tom Lendacky
5196401556 x86/mm: Fix spacing within memory encryption features message
The spacing is off in the memory encryption features message on AMD
platforms that support memory encryption, e.g.:

  "Memory Encryption Features active:AMD  SEV SEV-ES"

There is no space before "AMD" and two spaces after it. Fix this so that
the message is spaced properly:

  "Memory Encryption Features active: AMD SEV SEV-ES"

Fixes: 968b493173 ("x86/mm: Make DMA memory shared for TD guest")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/02401f3024b18e90bc2508147e22e729436cb6d9.1650298573.git.thomas.lendacky@amd.com
2022-04-19 08:04:17 -07:00
Christoph Hellwig
a3e2309267 x86: centralize setting SWIOTLB_FORCE when guest memory encryption is enabled
Move enabling SWIOTLB_FORCE for guest memory encryption into common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2022-04-18 07:21:10 +02:00
Yury Norov
c2a911d302 x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
Various mm code calls nodes_weight() to check if any bit of a given
nodemask is set.

This can be done more efficiently with nodes_empty() because nodes_empty()
stops traversing the nodemask as soon as it finds first set bit, while
nodes_weight() counts all bits unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220210224933.379149-26-yury.norov@gmail.com
2022-04-10 22:35:38 +02:00
Yury Norov
3a5ff1f6dd x86: Replace cpumask_weight() with cpumask_empty() where appropriate
In some cases, x86 code calls cpumask_weight() to check if any bit of a
given cpumask is set.

This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20220210224933.379149-17-yury.norov@gmail.com
2022-04-10 22:35:38 +02:00
Kirill A. Shutemov
968b493173 x86/mm: Make DMA memory shared for TD guest
Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

The previous patch ("Add support for TDX shared memory") gave TDX guests
the _ability_ to make some pages shared, but did not make any pages
shared. This actually marks SWIOTLB buffers *as* shared.

Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in
TDX guests.  This has several implications:

 - Allows the existing mem_encrypt_init() to be used for TDX which
   sets SWIOTLB buffers shared (aka. "decrypted").
 - Ensures that all DMA is routed via the SWIOTLB mechanism (see
   pci_swiotlb_detect())

Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set
indirectly by selecting X86_MEM_ENCRYPT.

mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to
a generic area of the header.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-28-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Kirill A. Shutemov
9aa6ea6985 x86/tdx: Make pages shared in ioremap()
In TDX guests, guest memory is protected from host access. If a guest
performs I/O, it needs to explicitly share the I/O memory with the host.

Make all ioremap()ed pages that are not backed by normal memory
(IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.

The permissions in PAGE_KERNEL_IO already work for "decrypted" memory
on AMD SEV/SME systems.  That means that they have no need to make a
pgprot_decrypted() call.

TDX guests, on the other hand, _need_ change to PAGE_KERNEL_IO for
"decrypted" mappings.  Add a pgprot_decrypted() for TDX.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-26-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Michael Roth
b190a043c4 x86/sev: Add SEV-SNP feature detection/setup
Initial/preliminary detection of SEV-SNP is done via the Confidential
Computing blob. Check for it prior to the normal SEV/SME feature
initialization, and add some sanity checks to confirm it agrees with
SEV-SNP CPUID/MSR bits.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-39-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Brijesh Singh
dc3f3d2474 x86/mm: Validate memory when changing the C-bit
Add the needed functionality to change pages state from shared
to private and vice-versa using the Page State Change VMGEXIT as
documented in the GHCB spec.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-22-brijesh.singh@amd.com
2022-04-06 13:24:53 +02:00
Brijesh Singh
5e5ccff60a x86/sev: Add helper for validating pages in early enc attribute changes
early_set_memory_{encrypted,decrypted}() are used for changing the page
state from decrypted (shared) to encrypted (private) and vice versa.

When SEV-SNP is active, the page state transition needs to go through
additional steps.

If the page is transitioned from shared to private, then perform the
following after the encryption attribute is set in the page table:

1. Issue the page state change VMGEXIT to add the page as a private
   in the RMP table.
2. Validate the page after its successfully added in the RMP table.

To maintain the security guarantees, if the page is transitioned from
private to shared, then perform the following before clearing the
encryption attribute from the page table.

1. Invalidate the page.
2. Issue the page state change VMGEXIT to make the page shared in the
   RMP table.

early_set_memory_{encrypted,decrypted}() can be called before the GHCB
is setup so use the SNP page state MSR protocol VMGEXIT defined in the
GHCB specification to request the page state change in the RMP table.

While at it, add a helper snp_prep_memory() which will be used in
probe_roms(), in a later patch.

  [ bp: Massage commit message. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-19-brijesh.singh@amd.com
2022-04-06 13:22:54 +02:00
Brijesh Singh
f742b90e61 x86/mm: Extend cc_attr to include AMD SEV-SNP
The CC_ATTR_GUEST_SEV_SNP can be used by the guest to query whether the
SNP (Secure Nested Paging) feature is active.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-10-brijesh.singh@amd.com
2022-04-06 13:02:34 +02:00
Lukas Bulwahn
944fad4583 x86/fault: Cast an argument to the proper address space in prefetch()
Commit in Fixes uses accessors based on the access mode, i.e., it
distinguishes its access if instr carries a user address or a kernel
address.

Since that commit, sparse complains about passing an argument without
__user annotation to get_user(), which expects a pointer of the __user
address space:

  arch/x86/mm/fault.c:152:29: warning: incorrect type in argument 1 (different address spaces)
  arch/x86/mm/fault.c:152:29:    expected void const volatile [noderef] __user *ptr
  arch/x86/mm/fault.c:152:29:    got unsigned char *[assigned] instr

Cast instr to __user when accessing user memory.

No functional change. No change in the generated object code.

  [ bp: Simplify commit message. ]

Fixes: 35f1c89b0c ("x86/fault: Fix AMD erratum #91 errata fixup for user code")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220201144055.5670-1-lukas.bulwahn@gmail.com
2022-04-04 20:08:26 +02:00
Dave Hansen
d39268ad24 x86/mm/tlb: Revert retpoline avoidance approach
0day reported a regression on a microbenchmark which is intended to
stress the TLB flushing path:

	https://lore.kernel.org/all/20220317090415.GE735@xsang-OptiPlex-9020/

It pointed at a commit from Nadav which intended to remove retpoline
overhead in the TLB flushing path by taking the 'cond'-ition in
on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into
'cpumask'.  That allowed the code to use a bunch of earlier direct
calls instead of later indirect calls that need a retpoline.

But, in practice, threads can go idle (and into lazy TLB mode where
they don't need to flush their TLB) between the early and late calls.
It works in this direction and not in the other because TLB-flushing
threads tend to hold mmap_lock for write.  Contention on that lock
causes threads to _go_ idle right in this early/late window.

There was not any performance data in the original commit specific
to the retpoline overhead.  I did a few tests on a system with
retpolines:

	https://lore.kernel.org/all/dd8be93c-ded6-b962-50d4-96b1c3afb2b7@intel.com/

which showed a possible small win.  But, that small win pales in
comparison with the bigger loss induced on non-retpoline systems.

Revert the patch that removed the retpolines.  This was not a
clean revert, but it was self-contained enough not to be too painful.

Fixes: 6035152d8e ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Nadav Amit <namit@vmware.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/164874672286.389.7021457716635788197.tip-bot2@tip-bot2
2022-04-04 19:41:36 +02:00
Borislav Petkov
76ea0025a2 x86/cpu: Remove "noexec"
It doesn't make any sense to disable non-executable mappings -
security-wise or else.

So rip out that switch and move the remaining code into setup.c and
delete setup_nx.c

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-6-bp@alien8.de
2022-04-04 10:17:03 +02:00
Linus Torvalds
1930a6e739 ptrace: Cleanups for v5.18
This set of changes removes tracehook.h, moves modification of all of
 the ptrace fields inside of siglock to remove races, adds a missing
 permission check to ptrace.c
 
 The removal of tracehook.h is quite significant as it has been a major
 source of confusion in recent years.  Much of that confusion was
 around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
 making the semantics clearer).
 
 For people who don't know tracehook.h is a vestiage of an attempt to
 implement uprobes like functionality that was never fully merged, and
 was later superseeded by uprobes when uprobes was merged.  For many
 years now we have been removing what tracehook functionaly a little
 bit at a time.  To the point where now anything left in tracehook.h is
 some weird strange thing that is difficult to understand.
 
 Eric W. Biederman (15):
       ptrace: Move ptrace_report_syscall into ptrace.h
       ptrace/arm: Rename tracehook_report_syscall report_syscall
       ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
       ptrace: Remove arch_syscall_{enter,exit}_tracehook
       ptrace: Remove tracehook_signal_handler
       task_work: Remove unnecessary include from posix_timers.h
       task_work: Introduce task_work_pending
       task_work: Call tracehook_notify_signal from get_signal on all architectures
       task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
       signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
       resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
       resume_user_mode: Move to resume_user_mode.h
       tracehook: Remove tracehook.h
       ptrace: Move setting/clearing ptrace_message into ptrace_stop
       ptrace: Return the signal to continue with from ptrace_stop
 
 Jann Horn (1):
       ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
 
 Yang Li (1):
       ptrace: Remove duplicated include in ptrace.c
 
  MAINTAINERS                          |   1 -
  arch/Kconfig                         |   5 +-
  arch/alpha/kernel/ptrace.c           |   5 +-
  arch/alpha/kernel/signal.c           |   4 +-
  arch/arc/kernel/ptrace.c             |   5 +-
  arch/arc/kernel/signal.c             |   4 +-
  arch/arm/kernel/ptrace.c             |  12 +-
  arch/arm/kernel/signal.c             |   4 +-
  arch/arm64/kernel/ptrace.c           |  14 +--
  arch/arm64/kernel/signal.c           |   4 +-
  arch/csky/kernel/ptrace.c            |   5 +-
  arch/csky/kernel/signal.c            |   4 +-
  arch/h8300/kernel/ptrace.c           |   5 +-
  arch/h8300/kernel/signal.c           |   4 +-
  arch/hexagon/kernel/process.c        |   4 +-
  arch/hexagon/kernel/signal.c         |   1 -
  arch/hexagon/kernel/traps.c          |   6 +-
  arch/ia64/kernel/process.c           |   4 +-
  arch/ia64/kernel/ptrace.c            |   6 +-
  arch/ia64/kernel/signal.c            |   1 -
  arch/m68k/kernel/ptrace.c            |   5 +-
  arch/m68k/kernel/signal.c            |   4 +-
  arch/microblaze/kernel/ptrace.c      |   5 +-
  arch/microblaze/kernel/signal.c      |   4 +-
  arch/mips/kernel/ptrace.c            |   5 +-
  arch/mips/kernel/signal.c            |   4 +-
  arch/nds32/include/asm/syscall.h     |   2 +-
  arch/nds32/kernel/ptrace.c           |   5 +-
  arch/nds32/kernel/signal.c           |   4 +-
  arch/nios2/kernel/ptrace.c           |   5 +-
  arch/nios2/kernel/signal.c           |   4 +-
  arch/openrisc/kernel/ptrace.c        |   5 +-
  arch/openrisc/kernel/signal.c        |   4 +-
  arch/parisc/kernel/ptrace.c          |   7 +-
  arch/parisc/kernel/signal.c          |   4 +-
  arch/powerpc/kernel/ptrace/ptrace.c  |   8 +-
  arch/powerpc/kernel/signal.c         |   4 +-
  arch/riscv/kernel/ptrace.c           |   5 +-
  arch/riscv/kernel/signal.c           |   4 +-
  arch/s390/include/asm/entry-common.h |   1 -
  arch/s390/kernel/ptrace.c            |   1 -
  arch/s390/kernel/signal.c            |   5 +-
  arch/sh/kernel/ptrace_32.c           |   5 +-
  arch/sh/kernel/signal_32.c           |   4 +-
  arch/sparc/kernel/ptrace_32.c        |   5 +-
  arch/sparc/kernel/ptrace_64.c        |   5 +-
  arch/sparc/kernel/signal32.c         |   1 -
  arch/sparc/kernel/signal_32.c        |   4 +-
  arch/sparc/kernel/signal_64.c        |   4 +-
  arch/um/kernel/process.c             |   4 +-
  arch/um/kernel/ptrace.c              |   5 +-
  arch/x86/kernel/ptrace.c             |   1 -
  arch/x86/kernel/signal.c             |   5 +-
  arch/x86/mm/tlb.c                    |   1 +
  arch/xtensa/kernel/ptrace.c          |   5 +-
  arch/xtensa/kernel/signal.c          |   4 +-
  block/blk-cgroup.c                   |   2 +-
  fs/coredump.c                        |   1 -
  fs/exec.c                            |   1 -
  fs/io-wq.c                           |   6 +-
  fs/io_uring.c                        |  11 +-
  fs/proc/array.c                      |   1 -
  fs/proc/base.c                       |   1 -
  include/asm-generic/syscall.h        |   2 +-
  include/linux/entry-common.h         |  47 +-------
  include/linux/entry-kvm.h            |   2 +-
  include/linux/posix-timers.h         |   1 -
  include/linux/ptrace.h               |  81 ++++++++++++-
  include/linux/resume_user_mode.h     |  64 ++++++++++
  include/linux/sched/signal.h         |  17 +++
  include/linux/task_work.h            |   5 +
  include/linux/tracehook.h            | 226 -----------------------------------
  include/uapi/linux/ptrace.h          |   2 +-
  kernel/entry/common.c                |  19 +--
  kernel/entry/kvm.c                   |   9 +-
  kernel/exit.c                        |   3 +-
  kernel/livepatch/transition.c        |   1 -
  kernel/ptrace.c                      |  47 +++++---
  kernel/seccomp.c                     |   1 -
  kernel/signal.c                      |  62 +++++-----
  kernel/task_work.c                   |   4 +-
  kernel/time/posix-cpu-timers.c       |   1 +
  mm/memcontrol.c                      |   2 +-
  security/apparmor/domain.c           |   1 -
  security/selinux/hooks.c             |   1 -
  85 files changed, 372 insertions(+), 495 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
 j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
 Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
 Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
 DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
 0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
 S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
 3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
 /HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
 4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
 eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
 disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
 =uEro
 -----END PGP SIGNATURE-----

Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull ptrace cleanups from Eric Biederman:
 "This set of changes removes tracehook.h, moves modification of all of
  the ptrace fields inside of siglock to remove races, adds a missing
  permission check to ptrace.c

  The removal of tracehook.h is quite significant as it has been a major
  source of confusion in recent years. Much of that confusion was around
  task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
  semantics clearer).

  For people who don't know tracehook.h is a vestiage of an attempt to
  implement uprobes like functionality that was never fully merged, and
  was later superseeded by uprobes when uprobes was merged. For many
  years now we have been removing what tracehook functionaly a little
  bit at a time. To the point where anything left in tracehook.h was
  some weird strange thing that was difficult to understand"

* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ptrace: Remove duplicated include in ptrace.c
  ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
  ptrace: Return the signal to continue with from ptrace_stop
  ptrace: Move setting/clearing ptrace_message into ptrace_stop
  tracehook: Remove tracehook.h
  resume_user_mode: Move to resume_user_mode.h
  resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
  signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
  task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
  task_work: Call tracehook_notify_signal from get_signal on all architectures
  task_work: Introduce task_work_pending
  task_work: Remove unnecessary include from posix_timers.h
  ptrace: Remove tracehook_signal_handler
  ptrace: Remove arch_syscall_{enter,exit}_tracehook
  ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
  ptrace/arm: Rename tracehook_report_syscall report_syscall
  ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-28 17:29:53 -07:00
Anshuman Khandual
4cc79b3303 mm/migration: add trace events for base page and HugeTLB migrations
This adds two trace events for base page and HugeTLB page migrations.
These events, closely follow the implementation details like setting and
removing of PTE migration entries, which are essential operations for
migration.  The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
<events/migration.h> and <events/tlb.h> based trace events.  Hence drop
redundant CREATE_TRACE_POINTS from other places which could have otherwise
conflicted during build.

Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-24 19:06:45 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Oscar Salvador
1ca75fa7f1 arch/x86/mm/numa: Do not initialize nodes twice
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
nodes could be allocated at three different places.

 - numa_register_memblks
 - init_cpu_to_node
 - init_gi_nodes

All these calls happen at setup_arch, and have the following order:

setup_arch
  ...
  x86_numa_init
   numa_init
    numa_register_memblks
  ...
  init_cpu_to_node
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node
  init_gi_nodes
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node

numa_register_memblks() is only interested in those nodes which have
memory, so it skips over any memoryless node it founds.  Later on, when
we have read ACPI's SRAT table, we call init_cpu_to_node() and
init_gi_nodes(), which initialize any memoryless node we might have that
have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.

So far so good, but the thing is that after ("mm: handle uninitialized
numa nodes gracefully"), we allocate all possible NUMA nodes in
free_area_init(), meaning we have a picture like the following:

setup_arch
  x86_numa_init
   numa_init
    numa_register_memblks  <-- allocate non-memoryless node
  x86_init.paging.pagetable_init
   ...
    free_area_init
     free_area_init_memoryless <-- allocate memoryless node
  init_cpu_to_node
   alloc_node_data             <-- allocate memoryless node with CPU
   free_area_init_memoryless_node
  init_gi_nodes
   alloc_node_data             <-- allocate memoryless node with Initiator
   free_area_init_memoryless_node

free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
go ahead and allocate a new pg_data_t struct without checking anything,
meaning we end up allocating twice.

It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.

So get rid of init_memory_less_node() and just set the node online.

Note that setting the node online is needed, otherwise we choke down the
chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->...  and we blow up in
bus_add_device().  As can be seen here:

  BUG: kernel NULL pointer dereference, address: 0000000000000060
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
  RIP: 0010:bus_add_device+0x5a/0x140
  Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
  RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
  RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
  RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
  R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
  R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
  Call Trace:
   device_add+0x4c0/0x910
   __register_one_node+0x97/0x2d0
   __try_online_node+0x85/0xc0
   try_online_node+0x25/0x40
   cpu_up+0x4f/0x100
   bringup_nonboot_cpus+0x4f/0x60
   smp_init+0x26/0x79
   kernel_init_freeable+0x130/0x2f1
   kernel_init+0x17/0x150
   ret_from_fork+0x22/0x30

The reason is simple, by the time bringup_nonboot_cpus() gets called, we
did not register the node_subsys bus yet, so we crash when
bus_add_device() tries to dereference bus()->p.

The following shows the order of the calls:

kernel_init_freeable
 smp_init
  bringup_nonboot_cpus
   ...
     bus_add_device()      <- we did not register node_subsys yet
 do_basic_setup
  do_initcalls
   postcore_initcall(register_node_type);
    register_node_type
     subsys_system_register
      subsys_register
       bus_register         <- register node_subsys bus

Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning we do not
end up calling register_one_node() in the first place.

This is subtle, broken and deserves a deep analysis and thought about
how to put this into shape, but for now let us have this easy fix for
the leaking memory issue.

[osalvador@suse.de: add comments]
  Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de

Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:06 -07:00
Linus Torvalds
95ab0e8768 Changes for this cycle were:
- Fix address filtering for Intel/PT,ARM/CoreSight
  - Enable Intel/PEBS format 5
  - Allow more fixed-function counters for x86
  - Intel/PT: Enable not recording Taken-Not-Taken packets
  - Add a few branch-types
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI4WdIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jdTA/7BADTYzFCbdwPzHt2mR8osv7k+pDvYxs9
 wxNjyi1X7N8cPkhqgIg9CfdhdyDOqo7+J4fG17f2qbwjNK7b2Fb1/U6ZoZaf+f8F
 W0e2LX5KZTXUhkA+TEjrXvYD9FmJaCPM/l2RQg8U7okBs2kb0H6QT2Yn21wd1roC
 WwI5KFiWSVS1IzpVLaXjDh+FJfJHd75ReMqJeus+QoVQ9NHeuI+t4DglSB1IBi54
 d/zeVXE/Y4dFTQOrU06S2HxcOEptvXZsPmVLvKab/veeGGyWiGPxQpvu6bXm6u3x
 0sV+dn67zut2m2pQlUZUucgGTSYIZTpOe+rNukTB9hJ4XeN4/1ohOOCrOuYM+63P
 lGFbN1v+LD7Wc6C2eEhw8G5GEL0qbwzFNQ06O3EOFi7C7GKn7WS/ET6XuuMOERFk
 uxEPb4pFtbBlJ0SriCprFJSd5NL3PORZlLIhv4hGH5hilLR1TFeKDuwZaM4noQxU
 dL3rKGLi9H+P46Eni9H28+0gDISbv1xL+WivHOFQNmhBqAZO52ZcF3J+dgBaR1B5
 pBxVTycFpZMjxSZnqTE0gMsFaLIpVGc+75Chns1rajR0mEtRtJUQUbYz4tK4zb0E
 dZR1p+VF6+DYmSRhiqeaTi9uz9oE8kMa8o/EcbFIg/9BgEnUwJXU20bjnar30xQ7
 9OIn7r9hjHI=
 =XPuo
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event updates from Ingo Molnar:

 - Fix address filtering for Intel/PT,ARM/CoreSight

 - Enable Intel/PEBS format 5

 - Allow more fixed-function counters for x86

 - Intel/PT: Enable not recording Taken-Not-Taken packets

 - Add a few branch-types

* tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix the build on !CONFIG_PHYS_ADDR_T_64BIT
  perf: Add irq and exception return branch types
  perf/x86/intel/uncore: Make uncore_discovery clean for 64 bit addresses
  perf/x86/intel/pt: Add a capability and config bit for disabling TNTs
  perf/x86/intel/pt: Add a capability and config bit for event tracing
  perf/x86/intel: Increase max number of the fixed counters
  KVM: x86: use the KVM side max supported fixed counter
  perf/x86/intel: Enable PEBS format 5
  perf/core: Allow kernel address filter when not filtering the kernel
  perf/x86/intel/pt: Fix address filter config for 32-bit kernel
  perf/core: Fix address filter parser for multiple filters
  x86: Share definition of __is_canonical_address()
  perf/x86/intel/pt: Relax address filter validation
2022-03-22 13:06:49 -07:00
Linus Torvalds
eaa54b1458 - Remove a misleading message and an unused function
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmI4l6UACgkQEsHwGGHe
 VUonRw/7BQUx2A8O6mMtQDGzDVJlKobKb/6t+VhmdrC7YHUbMjgl8+Q9EjZIGkE4
 EF3Wb/5gUmnVhWzB2y5veE1mX4wlcXs0HyGL5iSn7X4UYxpPLdCSTcAriyj6I/0F
 fId9aTcb9dbiXGnK4F+tBRVphoe2uLCJUAdbNoprOBWUsmIjKaMewl+sLg8m1+f5
 mtTY83koAtblSCtkUP/sLx/1RM+LO2uep11W20m44eRXkc8CJ3mOiMoLwCZGyyaM
 66y/6i2QqPCWEUE8VUklrEM+gI/vGn1yb6DJo465aauHFabteYaiTa0kAiZ+xyOn
 UlybAhL2nJlkCJ+mt2+FNujdvJu2z8MGNTuQPgI0CeYlRllGuvvNfkpP6PLhqE+c
 HUfNbmJgB163c3w8QOlppkmCImiup+wtm8r3w4h0brD68sqnRv1AkXp83hC4MBNP
 k1/S/3GCLOFjq6LHvZJycq8r1NpbNPKGNq81kjNKobfWZHX3fEclVGJGjiDkJQhC
 VA4hCtIUnpagpMHwPHZ9fdHROHWCDJjLaEY5L/qiGnrBJPfwVpbmRv86k8kE+hJu
 IgoqRF1DSWMlhg3lNKGnoodTvgrWJM/HZgp/exrY0/N83AMatcWmaAlwWkrGQpGY
 HdnSKzXSHIwLFlf7WiVCoUDpRU4zRZzUUFm3mMqgdAJV8mwufsA=
 =EF36
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Remove a misleading message and an unused function

* tag 'x86_cleanups_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Remove the 'strange power saving mode' hint from unknown NMI handler
  x86/pat: Remove the unused set_pages_array_wt() function
2022-03-21 11:49:16 -07:00
Linus Torvalds
6b9bfb1365 - Add shared confidential computing code which will be used by both
vendors instead of proliferating home-grown solutions for technologies
 which are pretty similar
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmI4kiQACgkQEsHwGGHe
 VUojvA//QD5VxsqPq+RAQWAFiWGHCpFed2szc2Q5eAZj6CEmXcqBOdTqaoHpJpVl
 L1uvB6oLq8WTOea0V3xGu1kfiLRuq1fo0mqZeTxe3iZ3kUk/SU0wGfTLDECB58mI
 P5A+CZFiAk4XJ/kRqJWNxmd5kIDjhlCx4ysVbPl1vm/qfS6FEGb5HUr317kbOYwK
 zw5cEajnYu2KA6bI8nGuy30vmvn97gpy98vCiCzKrcBPggO8WHiJ+kqD72BhP5em
 z7mh4aFrAPVbIMqd/Xb5La3zvP7Vii4Tz9mSUsKy/Ige+ghFZQ18LPk2yANvmWeN
 hIFDqSsESR2go0tKvSrzPln8h93hKx/TPbiF9jVMISBZFdWCGQvzCrYDHqzHFQJ1
 zHw0lxdFQimfhs5YlEumZCqq2Dc7w3OGCVfP22+t7pNhnixPT3Dlie0Ya6z/aXV3
 VNcqckDDZLijQlf0iPhbw2fBs9ErTcB3OXHKmX78Zxb4hP4WJx8QK4lMPzFkPd9H
 bTEquYQWIPsjdRTlMl50nCpNHtAzo56H01G6ZPPx/5Y7Lt38UXJERfdqBhQjNF6F
 ILPMrOn/BHU9snlqSCh7SxhRiRdafThIJHsi5zQrDC4rPvlwi5kinIzGnPyOuDbO
 qwwnPOzx855/Zw0swKrQRXaxU7lwGKo529yKZWt7r8WB12tSOao=
 =zWVD
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing updates from Borislav Petkov:

 - Add shared confidential computing code which will be used by both
   vendors instead of proliferating home-grown solutions for
   technologies (SEV/SNP and TDX) which are pretty similar

* tag 'x86_cc_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/cpa: Generalize __set_memory_enc_pgtable()
  x86/coco: Add API to handle encryption mask
  x86/coco: Explicitly declare type of confidential computing platform
  x86/cc: Move arch/x86/{kernel/cc_platform.c => coco/core.c}
2022-03-21 11:38:53 -07:00
Eric W. Biederman
8ca07e17c9 task_work: Remove unnecessary include from posix_timers.h
Break a header file circular dependency by removing the unnecessary
include of task_work.h from posix_timers.h.

sched.h -> posix-timers.h
posix-timers.h -> task_work.h
task_work.h -> sched.h

Add missing includes of task_work.h to:
arch/x86/mm/tlb.c
kernel/time/posix-cpu-timers.c

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-6-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 13:38:01 -06:00
Ross Philipson
445c1470b6 x86/boot: Add setup_indirect support in early_memremap_is_setup_data()
The x86 boot documentation describes the setup_indirect structures and
how they are used. Only one of the two functions in ioremap.c that needed
to be modified to be aware of the introduction of setup_indirect
functionality was updated. Adds comparable support to the other function
where it was missing.

Fixes: b3c72fc9a7 ("x86/boot: Introduce setup_indirect")
Signed-off-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1645668456-22036-3-git-send-email-ross.philipson@oracle.com
2022-03-09 12:49:46 +01:00
Ross Philipson
7228918b34 x86/boot: Fix memremap of setup_indirect structures
As documented, the setup_indirect structure is nested inside
the setup_data structures in the setup_data list. The code currently
accesses the fields inside the setup_indirect structure but only
the sizeof(struct setup_data) is being memremapped. No crash
occurred but this is just due to how the area is remapped under the
covers.

Properly memremap both the setup_data and setup_indirect structures
in these cases before accessing them.

Fixes: b3c72fc9a7 ("x86/boot: Introduce setup_indirect")
Signed-off-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com
2022-03-09 12:49:44 +01:00
Brijesh Singh
1e8c5971c2 x86/mm/cpa: Generalize __set_memory_enc_pgtable()
The kernel provides infrastructure to set or clear the encryption mask
from the pages for AMD SEV, but TDX requires few tweaks.

- TDX and SEV have different requirements to the cache and TLB
  flushing.

- TDX has own routine to notify VMM about page encryption status change.

Modify __set_memory_enc_pgtable() and make it flexible enough to cover
both AMD SEV and Intel TDX. The AMD-specific behavior is isolated in the
callbacks under x86_platform.guest. TDX will provide own version of said
callbacks.

  [ bp: Beat into submission. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20220223043528.2093214-1-brijesh.singh@amd.com
2022-02-23 19:14:29 +01:00
Kirill A. Shutemov
b577f542f9 x86/coco: Add API to handle encryption mask
AMD SME/SEV uses a bit in the page table entries to indicate that the
page is encrypted and not accessible to the VMM.

TDX uses a similar approach, but the polarity of the mask is opposite to
AMD: if the bit is set the page is accessible to VMM.

Provide vendor-neutral API to deal with the mask: cc_mkenc() and
cc_mkdec() modify given address to make it encrypted/decrypted. It can
be applied to phys_addr_t, pgprotval_t or page table entry value.

pgprot_encrypted() and pgprot_decrypted() reimplemented using new
helpers.

The implementation will be extended to cover TDX.

pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio).
cc_mkdec() called by pgprot_decrypted(). Export cc_mkdec().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20220222185740.26228-5-kirill.shutemov@linux.intel.com
2022-02-23 19:14:29 +01:00
Kirill A. Shutemov
655a0fa34b x86/coco: Explicitly declare type of confidential computing platform
The kernel derives the confidential computing platform
type it is running as from sme_me_mask on AMD or by using
hv_is_isolation_supported() on HyperV isolation VMs. This detection
process will be more complicated as more platforms get added.

Declare a confidential computing vendor variable explicitly and set it
via cc_set_vendor() on the respective platform.

  [ bp: Massage commit message, fixup HyperV check. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20220222185740.26228-4-kirill.shutemov@linux.intel.com
2022-02-23 19:14:16 +01:00
Christoph Hellwig
4509d950a6 x86/pat: Remove the unused set_pages_array_wt() function
Commit

  623dffb2a2 ("x86/mm/pat: Add set_memory_wt() for Write-Through type")

added it but there were no users.

  [ bp: Add a commit message. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220223072852.616143-1-hch@lst.de
2022-02-23 13:34:08 +01:00
Adrian Hunter
1fb85d06ad x86: Share definition of __is_canonical_address()
Reduce code duplication by moving canonical address code to a common header
file.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-3-adrian.hunter@intel.com
2022-02-02 13:11:42 +01:00
Linus Torvalds
f56caedaf9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...
2022-01-15 20:37:06 +02:00
Qi Zheng
36ef159f44 mm: remove redundant check about FAULT_FLAG_ALLOW_RETRY bit
Since commit 4064b98270 ("mm: allow VM_FAULT_RETRY for multiple
times") allowed VM_FAULT_RETRY for multiple times, the
FAULT_FLAG_ALLOW_RETRY bit of fault_flag will not be changed in the page
fault path, so the following check is no longer needed:

	flags & FAULT_FLAG_ALLOW_RETRY

So just remove it.

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/20211110123358.36511-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:27 +02:00
Linus Torvalds
64ad946152 - Get rid of all the .fixup sections because this generates
misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
 LIVEPATCH as the backtrace misses the function which is being fixed up.
 
 - Add Straight Light Speculation mitigation support which uses a new
 compiler switch -mharden-sls= which sticks an INT3 after a RET or an
 indirect branch in order to block speculation after them. Reportedly,
 CPUs do speculate behind such insns.
 
 - The usual set of cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHfKA0ACgkQEsHwGGHe
 VUqLJg/2I2X2xXr5filJVaK+sQgmvDzk67DKnbxRBW2xcPF+B5sSW5yhe3G5UPW7
 SJVdhQ3gHcTiliGGlBf/VE7KXbqxFN0vO4/VFHZm78r43g7OrXTxz6WXXQRJ1n67
 U3YwRH3b6cqXZNFMs+X4bJt6qsGJM1kdTTZ2as4aERnaFr5AOAfQvfKbyhxLe/XA
 3SakfYISVKCBQ2RkTfpMpwmqlsatGFhTC5IrvuDQ83dDsM7O+Dx1J6Gu3fwjKmie
 iVzPOjCh+xTpZQp/SIZmt7MzoduZvpSym4YVyHvEnMiexQT4AmyaRthWqrhnEXY/
 qOvj8/XIqxmix8EaooGqRIK0Y2ZegxkPckNFzaeC3lsWohwMIGIhNXwHNEeuhNyH
 yvNGAW9Cq6NeDRgz5MRUXcimYw4P4oQKYLObS1WqFZhNMqm4sNtoEAYpai/lPYfs
 zUDckgXF2AoPOsSqy3hFAVaGovAgzfDaJVzkt0Lk4kzzjX2WQiNLhmiior460w+K
 0l2Iej58IajSp3MkWmFH368Jo8YfUVmkjbbpsmjsBppA08e1xamJB7RmswI/Ezj6
 s5re6UioCD+UYdjWx41kgbvYdvIkkZ2RLrktoZd/hqHrOLWEIiwEbyFO2nRFJIAh
 YjvPkB1p7iNuAeYcP1x9Ft9GNYVIsUlJ+hK86wtFCqy+abV+zQ==
 =R52z
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Get rid of all the .fixup sections because this generates
   misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
   LIVEPATCH as the backtrace misses the function which is being fixed
   up.

 - Add Straight Line Speculation mitigation support which uses a new
   compiler switch -mharden-sls= which sticks an INT3 after a RET or an
   indirect branch in order to block speculation after them. Reportedly,
   CPUs do speculate behind such insns.

 - The usual set of cleanups and improvements

* tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  x86/entry_32: Fix segment exceptions
  objtool: Remove .fixup handling
  x86: Remove .fixup section
  x86/word-at-a-time: Remove .fixup usage
  x86/usercopy: Remove .fixup usage
  x86/usercopy_32: Simplify __copy_user_intel_nocache()
  x86/sgx: Remove .fixup usage
  x86/checksum_32: Remove .fixup usage
  x86/vmx: Remove .fixup usage
  x86/kvm: Remove .fixup usage
  x86/segment: Remove .fixup usage
  x86/fpu: Remove .fixup usage
  x86/xen: Remove .fixup usage
  x86/uaccess: Remove .fixup usage
  x86/futex: Remove .fixup usage
  x86/msr: Remove .fixup usage
  x86/extable: Extend extable functionality
  x86/entry_32: Remove .fixup usage
  x86/entry_64: Remove .fixup usage
  x86/copy_mc_64: Remove .fixup usage
  ...
2022-01-12 16:31:19 -08:00
Peter Zijlstra
9cdbeec409 x86/entry_32: Fix segment exceptions
The LKP robot reported that commit in Fixes: caused a failure. Turns out
the ldt_gdt_32 selftest turns into an infinite loop trying to clear the
segment.

As discovered by Sean, what happens is that PARANOID_EXIT_TO_KERNEL_MODE
in the handle_exception_return path overwrites the entry stack data with
the task stack data, restoring the "bad" segment value.

Instead of having the exception retry the instruction, have it emulate
the full instruction. Replace EX_TYPE_POP_ZERO with EX_TYPE_POP_REG
which will do the equivalent of: POP %reg; MOV $imm, %reg.

In order to encode the segment registers, add them as registers 8-11 for
32-bit.

By setting regs->[defg]s the (nested) RESTORE_REGS will pop this value
at the end of the exception handler and by increasing regs->sp, it will
have skipped the stack slot.

This was debugged by Sean Christopherson <seanjc@google.com>.

 [ bp: Add EX_REG_GS too. ]

Fixes: aa93e2ad74 ("x86/entry_32: Remove .fixup usage")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/Yd1l0gInc4zRcnt/@hirez.programming.kicks-ass.net
2022-01-12 16:38:25 +01:00
Linus Torvalds
ca1a46d6f5 slab changes for 5.17
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmHYFIIACgkQ4CHKc/GJ
 qRBXqwf+JrWc3PCRF4xKeYmi367RgSX9D8kFCcAry1F+iuq1ssqlDBy/vEp1KtXE
 t2Xyn6PILgzGcYdK1/CVNigwAom2NRcb8fHamjjopqYk8wor9m46I564Z6ItVg2I
 SCcWhHEuD7M66tmBS+oex3n+LOZ4jPUPhkn5KH04/LSTrR5dzn1op6CnFbpOUZn1
 Uy9qB6EbjuyhsONHnO/CdoRUU07K+KqEkzolXFCqpI2Vqf+VBvAwi+RpDLfKkr6l
 Vp4PT03ixVsOWhGaJcf7hijKCRyfhsLp7Zyg33pzwpXyngqrowwUPVDMKPyqBy6O
 ktehRk+cOQiAi7KnpECljof+NR15Qg==
 =/Nyj
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - Separate struct slab from struct page - an offshot of the page folio
   work.

   Struct page fields used by slab allocators are moved from struct page
   to a new struct slab, that uses the same physical storage. Similar to
   struct folio, it always is a head page. This brings better type
   safety, separation of large kmalloc allocations from true slabs, and
   cleanup of related objcg code.

 - A SLAB_MERGE_DEFAULT config optimization.

* tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (33 commits)
  mm/slob: Remove unnecessary page_mapcount_reset() function call
  bootmem: Use page->index instead of page->freelist
  zsmalloc: Stop using slab fields in struct page
  mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled
  mm/slub: Simplify struct slab slabs field definition
  mm/sl*b: Differentiate struct slab fields by sl*b implementations
  mm/kfence: Convert kfence_guarded_alloc() to struct slab
  mm/kasan: Convert to struct folio and struct slab
  mm/slob: Convert SLOB to use struct slab and struct folio
  mm/memcg: Convert slab objcgs from struct page to struct slab
  mm: Convert struct page to struct slab in functions used by other subsystems
  mm/slab: Finish struct page to struct slab conversion
  mm/slab: Convert most struct page to struct slab by spatch
  mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab
  mm/slub: Finish struct page to struct slab conversion
  mm/slub: Convert most struct page to struct slab by spatch
  mm/slub: Convert pfmemalloc_match() to take a struct slab
  mm/slub: Convert __free_slab() to use struct slab
  mm/slub: Convert alloc_slab_page() to return a struct slab
  mm/slub: Convert print_page_info() to print_slab_info()
  ...
2022-01-10 11:58:12 -08:00
Linus Torvalds
48a60bdb2b - Add a set of thread_info.flags accessors which snapshot it before
accesing it in order to prevent any potential data races, and convert
 all users to those new accessors
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcgFoACgkQEsHwGGHe
 VUqXeRAAvcNEfFw6BvXeGfFTxKmOrsRtu2WCkAkjvamyhXMCrjBqqHlygLJFCH5i
 2mc6HBohzo4vBFcgi3R5tVkGazqlthY1KUM9Jpk7rUuUzi0phTH7n/MafZOm9Es/
 BHYcAAyT/NwZRbCN0geccIzBtbc4xr8kxtec7vkRfGDx8B9/uFN86xm7cKAaL62G
 UDs0IquDPKEns3A7uKNuvKztILtuZWD1WcSkbOULJzXgLkb+cYKO1Lm9JK9rx8Ds
 8tjezrJgOYGLQyyv0i3pWelm3jCZOKUChPslft0opvVUbrNd8piehvOm9CWopHcB
 QsYOWchnULTE9o4ZAs/1PkxC0LlFEWZH8bOLxBMTDVEY+xvmDuj1PdBUpncgJbOh
 dunHzsvaWproBSYUXA9nKhZWTVGl+CM8Ks7jXjl3IPynLd6cpYZ/5gyBVWEX7q3e
 8htG95NzdPPo7doxMiNSKGSmSm0Np1TJ/i89vsYeGfefsvsq53Fyjhu7dIuTWHmU
 2YUe6qHs6dF9x1bkHAAZz6T9Hs4BoGQBcXUnooT9JbzVdv2RfTPsrawdu8dOnzV1
 RhwCFdFcll0AIEl0T9fCYzUI/Ga8ZS0roXs5NZ4wl0lwr0BGFwiU8WC1FUdGsZo9
 0duaa0Tpv0OWt6rIMMB/E9QsqCDsQ4CMHuQpVVw+GOO5ux9kMms=
 =v6Xn
 -----END PGP SIGNATURE-----

Merge tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull thread_info flag accessor helper updates from Borislav Petkov:
 "Add a set of thread_info.flags accessors which snapshot it before
  accesing it in order to prevent any potential data races, and convert
  all users to those new accessors"

* tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  powerpc: Snapshot thread flags
  powerpc: Avoid discarding flags in system_call_exception()
  openrisc: Snapshot thread flags
  microblaze: Snapshot thread flags
  arm64: Snapshot thread flags
  ARM: Snapshot thread flags
  alpha: Snapshot thread flags
  sched: Snapshot thread flags
  entry: Snapshot thread flags
  x86: Snapshot thread flags
  thread_info: Add helpers to snapshot thread flags
2022-01-10 11:34:10 -08:00
Linus Torvalds
4a692ae360 - Flush *all* mappings from the TLB after switching to the trampoline
pagetable to prevent any stale entries' presence
 
 - Flush global mappings from the TLB, in addition to the CR3-write,
 after switching off of the trampoline_pgd during boot to clear the
 identity mappings
 
 - Prevent instrumentation issues resulting from the above changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcEGIACgkQEsHwGGHe
 VUp14RAAgo6BbW9J82Pyl55egIhcQDdGsa16Gdm9S/AFIIW/NhwYo9ydrgtzr/70
 3XKpJYX7nH7PUKYRmoca/m3NnzUU+wnjSGS1XMyB3bJvn2/8S1qeuwBty2VP2dYM
 iS2eGRLjVjbMWwQUSK7tPJa5wi11zUqLIyCe3t0YiWso6TK7xKaVJTQ3/19Xc+/a
 zVQ5VpmzglUTxA6xGCvTDn5IUViUb8QmIuw7Ty6QtQEoI6T3qQvPkdJNXOxDcHNy
 9gDGf4O+5YlPCxYsNEkWDDa02zSZ2aWFSq76b98VyMiOK0xts+ktnAwq6oes+as9
 ZLIipOu5aIkj8te7he0FelyvPhZAVzrFvvmMf1U+EV3PqbyVkabhk5SBeP5v8CZy
 bM4eYNuJ2FLvFpUCC9zQ/MNVQ6ZtxN15rrrsTqk46KLPBHmHp/Aj9W/DP4zpCcNg
 Wwh4xbnGNIN8jZBiBJG6R6q7oM/lZt/loEicxm2QFZHtAIYMsiUmE99HnIREjUHd
 +0mwo2rHniie9zh6GoybX8OcbZCLYGdfe3iPvlO9fQpyDTn8IUIlnruDlUiTBMDM
 fX4J2dynh7xXRH1WW+MwxDv4n400+C08SG9zTD0qPCbGhYwNscMlZhA2JN6mlPep
 spuRPOzzwUUxqjXkDloeDDJNUQ8r032OB2LMhWSbLApJrJM9/QA=
 =cM+z
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Borislav Petkov:

 - Flush *all* mappings from the TLB after switching to the trampoline
   pagetable to prevent any stale entries' presence

 - Flush global mappings from the TLB, in addition to the CR3-write,
   after switching off of the trampoline_pgd during boot to clear the
   identity mappings

 - Prevent instrumentation issues resulting from the above changes

* tag 'x86_mm_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Prevent early boot triple-faults with instrumentation
  x86/mm: Include spinlock_t definition in pgtable.
  x86/mm: Flush global TLB when switching to trampoline page-table
  x86/mm/64: Flush global TLB on boot and AP bringup
  x86/realmode: Add comment for Global bit usage in trampoline_pgd
  x86/mm: Add missing <asm/cpufeatures.h> dependency to <asm/page_64.h>
2022-01-10 09:51:38 -08:00
Matthew Wilcox (Oracle)
c5e97ed154 bootmem: Use page->index instead of page->freelist
page->freelist is for the use of slab.  Using page->index is the same
set of bits as page->freelist, and by using an integer instead of a
pointer, we can avoid casts.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <x86@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
2022-01-06 12:27:03 +01:00
Peter Zijlstra
d5d797dcbd x86/usercopy: Remove .fixup usage
Typically usercopy does whole word copies followed by a number of byte
copies to finish the tail. This means that on exception it needs to
compute the remaining length as: words*sizeof(long) + bytes.

Create a new extable handler to do just this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101326.081701085@infradead.org
2021-12-11 09:09:50 +01:00
Peter Zijlstra
5ce8e39f55 x86/sgx: Remove .fixup usage
Create EX_TYPE_FAULT_SGX which does as EX_TYPE_FAULT does, except adds
this extra bit that SGX really fancies having.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.961246679@infradead.org
2021-12-11 09:09:49 +01:00
Peter Zijlstra
d52a7344bd x86/msr: Remove .fixup usage
Rework the MSR accessors to remove .fixup usage. Add two new extable
types (to the 4 already existing msr ones) using the new register
infrastructure to record which register should get the error value.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.364084212@infradead.org
2021-12-11 09:09:47 +01:00
Peter Zijlstra
4b5305decc x86/extable: Extend extable functionality
In order to remove further .fixup usage, extend the extable
infrastructure to take additional information from the extable entry
sites.

Specifically add _ASM_EXTABLE_TYPE_REG() and EX_TYPE_IMM_REG that
extend the existing _ASM_EXTABLE_TYPE() by taking an additional
register argument and encoding that and an s16 immediate into the
existing s32 type field. This limits the actual types to the first
byte, 255 seem plenty.

Also add a few flags into the type word, specifically CLEAR_AX and
CLEAR_DX which clear the return and extended return register.

Notes:
 - due to the % in our register names it's hard to make it more
   generally usable as arm64 did.
 - the s16 is far larger than used in these patches, future extentions
   can easily shrink this to get more bits.
 - without the bitfield fix this will not compile, because: 0xFF > -1
   and we can't even extract the TYPE field.

[nathanchance: Build fix for clang-lto builds:
 https://lkml.kernel.org/r/20211210234953.3420108-1-nathan@kernel.org
]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20211110101325.303890153@infradead.org
2021-12-11 09:09:46 +01:00
Peter Zijlstra
aa93e2ad74 x86/entry_32: Remove .fixup usage
Where possible, push the .fixup into code, at the tail of functions.

This is hard for macros since they're used in multiple functions,
therefore introduce a new extable handler to pop zeros.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.245184699@infradead.org
2021-12-11 09:09:46 +01:00
Kirill A. Shutemov
20f07a044a x86/sev: Move common memory encryption code to mem_encrypt.c
SEV and TDX both protect guest memory from host accesses. They both use
guest physical address bits to communicate to the hardware which pages
receive protection or not. SEV and TDX both assume that all I/O (real
devices and virtio) must be performed to pages *without* protection.

To add this support, AMD SEV code forces force_dma_unencrypted() to
decrypt DMA pages when DMA pages were allocated for I/O. It also uses
swiotlb_update_mem_attributes() to update decryption bits in SWIOTLB DMA
buffers.

Since TDX also uses a similar memory sharing design, all the above
mentioned changes can be reused. So move force_dma_unencrypted(),
SWIOTLB update code and virtio changes out of mem_encrypt_amd.c to
mem_encrypt.c.

Introduce a new config option X86_MEM_ENCRYPT that can be selected by
platforms which use x86 memory encryption features (needed in both AMD
SEV and Intel TDX guest platforms).

Since the code is moved from mem_encrypt_amd.c, inherit the same make
flags.

This is preparation for enabling TDX memory encryption support and it
has no functional changes.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20211206135505.75045-4-kirill.shutemov@linux.intel.com
2021-12-08 16:49:53 +01:00
Kuppuswamy Sathyanarayanan
dbca5e1a04 x86/sev: Rename mem_encrypt.c to mem_encrypt_amd.c
Both Intel TDX and AMD SEV implement memory encryption features. But the
bulk of the code in mem_encrypt.c is AMD-specific. Rename the file to
mem_encrypt_amd.c. A subsequent patch will extract the parts that can be
shared by both TDX and AMD SEV/SME into a generic file.

No functional changes.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20211206135505.75045-3-kirill.shutemov@linux.intel.com
2021-12-08 16:49:47 +01:00
Kuppuswamy Sathyanarayanan
8260b9820f x86/sev: Use CC_ATTR attribute to generalize string I/O unroll
INS/OUTS are not supported in TDX guests and cause #UD. Kernel has to
avoid them when running in TDX guest. To support existing usage, string
I/O operations are unrolled using IN/OUT instructions.

AMD SEV platform implements this support by adding unroll
logic in ins#bwl()/outs#bwl() macros with SEV-specific checks.
Since TDX VM guests will also need similar support, use
CC_ATTR_GUEST_UNROLL_STRING_IO and generic cc_platform_has() API to
implement it.

String I/O helpers were the last users of sev_key_active() interface and
sev_enable_key static key. Remove them.

 [ bp: Move comment too and do not delete it. ]

Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20211206135505.75045-2-kirill.shutemov@linux.intel.com
2021-12-08 16:49:42 +01:00
Peter Zijlstra
f94909ceb1 x86: Prepare asm files for straight-line-speculation
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.

  find arch/x86/ -name \*.S | while read file
  do
	sed -i 's/\<ret[q]*\>/RET/' $file
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
2021-12-08 12:25:37 +01:00
Joerg Roedel
f154f29085 x86/mm/64: Flush global TLB on boot and AP bringup
The AP bringup code uses the trampoline_pgd page-table which
establishes global mappings in the user range of the address space.
Flush the global TLB entries after the indentity mappings are removed so
no stale entries remain in the TLB.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org
2021-12-06 09:38:48 +01:00
Joerg Roedel
9de4999050 x86/realmode: Add comment for Global bit usage in trampoline_pgd
Document the fact that using the trampoline_pgd will result in the
creation of global TLB entries in the user range of the address
space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211202153226.22946-2-joro@8bytes.org
2021-12-04 13:50:08 +01:00
Mark Rutland
dca99fb643 x86: Snapshot thread flags
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.

To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.

Convert them all to the new flag accessor helpers.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211129130653.2037928-12-mark.rutland@arm.com
2021-12-01 00:06:43 +01:00
Paolo Bonzini
b9ecb9a997 Merge branch 'kvm-guest-sev-migration' into kvm-master
Add guest api and guest kernel support for SEV live migration.

Introduces a new hypercall to notify the host of changes to the page
encryption status.  If the page is encrypted then it must be migrated
through the SEV firmware or a helper VM sharing the key.  If page is
not encrypted then it can be migrated normally by userspace.  This new
hypercall is invoked using paravirt_ops.

Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
2021-11-11 07:40:26 -05:00
Ashish Kalra
f4495615d7 x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature.
The guest support for detecting and enabling SEV Live migration
feature uses the following logic :

 - kvm_init_plaform() checks if its booted under the EFI

   - If not EFI,

     i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl()
         to enable the SEV live migration support

   - If EFI,

     i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read
        the UEFI variable which indicates OVMF support for live migration

     ii) the variable indicates live migration is supported, issue a wrmsrl() to
          enable the SEV live migration support

The EFI live migration check is done using a late_initcall() callback.

Also, ensure that _bss_decrypted section is marked as decrypted in the
hypervisor's guest page encryption status tracking.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11 07:37:36 -05:00
Brijesh Singh
064ce6c550 mm: x86: Invoke hypercall when page encryption status is changed
Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11 07:37:24 -05:00
Linus Torvalds
59a2ceeef6 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "87 patches.

  Subsystems affected by this patch series: mm (pagecache and hugetlb),
  procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
  init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
  sysvfs, kcov, gdb, resource, selftests, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
  ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
  ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
  selftests/kselftest/runner/run_one(): allow running non-executable files
  virtio-mem: disallow mapping virtio-mem memory via /dev/mem
  kernel/resource: disallow access to exclusive system RAM regions
  kernel/resource: clean up and optimize iomem_is_exclusive()
  scripts/gdb: handle split debug for vmlinux
  kcov: replace local_irq_save() with a local_lock_t
  kcov: avoid enable+disable interrupts if !in_task()
  kcov: allocate per-CPU memory on the relevant node
  Documentation/kcov: define `ip' in the example
  Documentation/kcov: include types.h in the example
  sysv: use BUILD_BUG_ON instead of runtime check
  kernel/fork.c: unshare(): use swap() to make code cleaner
  seq_file: fix passing wrong private data
  seq_file: move seq_escape() to a header
  signal: remove duplicate include in signal.h
  crash_dump: remove duplicate include in crash_dump.h
  crash_dump: fix boolreturn.cocci warning
  hfs/hfsplus: use WARN_ON for sanity check
  ...
2021-11-09 10:11:53 -08:00
Kefeng Wang
0a96c902d4 x86: mm: rename __is_kernel_text() to is_x86_32_kernel_text()
Commit b56cd05c55 ("x86/mm: Rename is_kernel_text to __is_kernel_text"),
add '__' prefix not to get in conflict with existing is_kernel_text() in
<linux/kallsyms.h>.

We will add __is_kernel_text() for the basic kernel text range check in
the next patch, so use private is_x86_32_kernel_text() naming for x86
special check.

Link: https://lkml.kernel.org/r/20210930071143.63410-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:02:51 -08:00
Linus Torvalds
512b7931ad Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "257 patches.

  Subsystems affected by this patch series: scripts, ocfs2, vfs, and
  mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
  gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
  pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
  memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
  vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
  cleanups, kfence, and damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
  mm/damon: remove return value from before_terminate callback
  mm/damon: fix a few spelling mistakes in comments and a pr_debug message
  mm/damon: simplify stop mechanism
  Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
  Docs/admin-guide/mm/damon/start: simplify the content
  Docs/admin-guide/mm/damon/start: fix a wrong link
  Docs/admin-guide/mm/damon/start: fix wrong example commands
  mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
  mm/damon: remove unnecessary variable initialization
  Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
  mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
  selftests/damon: support watermarks
  mm/damon/dbgfs: support watermarks
  mm/damon/schemes: activate schemes based on a watermarks mechanism
  tools/selftests/damon: update for regions prioritization of schemes
  mm/damon/dbgfs: support prioritization weights
  mm/damon/vaddr,paddr: support pageout prioritization
  mm/damon/schemes: prioritize regions within the quotas
  mm/damon/selftests: support schemes quotas
  mm/damon/dbgfs: support quotas of schemes
  ...
2021-11-06 14:08:17 -07:00
David Hildenbrand
5c11f00b09 x86: remove memory hotplug support on X86_32
CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just
restricted it to 64 bit.  Let's remove the unused x86 32bit
implementation and simplify the Kconfig.

Link: https://lkml.kernel.org/r/20210929143600.49379-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:42 -07:00
Mike Rapoport
4421cca0a3 memblock: use memblock_free for freeing virtual pointers
Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()

The callers are updated with the below semantic patch and manual
addition of (void *) casting to pointers that are represented by
unsigned long variables.

    @@
    identifier vaddr;
    expression size;
    @@
    (
    - memblock_phys_free(__pa(vaddr), size);
    + memblock_free(vaddr, size);
    |
    - memblock_free_ptr(vaddr, size);
    + memblock_free(vaddr, size);
    )

[sfr@canb.auug.org.au: fixup]
  Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au

Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:41 -07:00
Mike Rapoport
3ecc68349b memblock: rename memblock_free to memblock_phys_free
Since memblock_free() operates on a physical range, make its name
reflect it and rename it to memblock_phys_free(), so it will be a
logical counterpart to memblock_phys_alloc().

The callers are updated with the below semantic patch:

    @@
    expression addr;
    expression size;
    @@
    - memblock_free(addr, size);
    + memblock_phys_free(addr, size);

Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:41 -07:00
Linus Torvalds
44261f8e28 hyperv-next for 5.16
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmGBMQUTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXmE5B/9MK3Ju+tc6C8eyR3Ic4XBYHJ3voEKO
 M+R90gggBriDOgkz4B8vF+k0aD8wevXAUtmCSXonDzCh5H7GoyfrVZmJEVkwlioH
 ZMSMlFHcjGhCPIXhLbNtfo/NsAYEtT/lRM2lLGCSbdGuKabylXKujVdhuSIcRPdj
 Rj5innUgcAywOoxG6WzFt3JBzM33UQErCGfUF2b7Rvp9E+Zii4vIMxkMzUpnkEHH
 F8WMEdL0DqH5ThOs0MslNgy03pUC9wk1d5DNd9ytYHqiSQtcQZhFHw/P6dxzUFlW
 OptWv31PXUIsiJf4Zi9hmfjgUl+KZHeacZ2hXtidAo86VPcIjVs25OQW
 =40fn
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Initial patch set for Hyper-V isolation VM support (Tianyu Lan)

 - Fix a warning on preemption (Vitaly Kuznetsov)

 - A bunch of misc cleanup patches

* tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Protect set_hv_tscchange_cb() against getting preempted
  Drivers: hv : vmbus: Adding NULL pointer check
  x86/hyperv: Remove duplicate include
  x86/hyperv: Remove duplicated include in hv_init
  Drivers: hv: vmbus: Remove unused code to check for subchannels
  Drivers: hv: vmbus: Initialize VMbus ring buffer for Isolation VM
  Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message
  x86/hyperv: Add ghcb hvcall support for SNP VM
  x86/hyperv: Add Write/Read MSR registers via ghcb page
  Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM
  x86/hyperv: Add new hvcall guest address host visibility support
  x86/hyperv: Initialize shared memory boundary in the Isolation VM.
  x86/hyperv: Initialize GHCB page in Isolation VM
2021-11-02 10:56:49 -07:00
Linus Torvalds
cc0356d6a0 - Do not #GP on userspace use of CLI/STI but pretend it was a NOP to
keep old userspace from breaking. Adjust the corresponding iopl selftest
 to that.
 
 - Improve stack overflow warnings to say which stack got overflowed and
 raise the exception stack sizes to 2 pages since overflowing the single
 page of exception stack is very easy to do nowadays with all the tracing
 machinery enabled. With that, rip out the custom mapping of AMD SEV's
 too.
 
 - A bunch of changes in preparation for FGKASLR like supporting more
 than 64K section headers in the relocs tool, correct ORC lookup table
 size to cover the whole kernel .text and other adjustments.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uugACgkQEsHwGGHe
 VUroKw//e8BJ3Aun8bg00FHxfiMGbPYcozjLGDkaoMtMDZ8WlfCUrvtqYICEr8eB
 UU0eRyygAPI167dre1O9JvAcbilkNTKntaU6qbu/ZVyUwS3+Jkjwsotbqn3xKtkd
 QDDTDNiCU+beCJ2ZbspbrPgEh13+H0MwMHUfRxZB9Scpmo6aGSEaU3g295f6GX57
 VFGJ/LNov5MV1dTD7Pp/h6/Nb+R6WmflKcBzJmQxYuKyKX+g1xsSv0VSga+t+uf3
 M9pUkizqTiUxzC2eLgtcEZTqqBHu810E8M76FmhKBUMilsFJT5YAJTiqyahwHXds
 HYarOFRgcnFuJPd29vn8UHjqeeoi6ru8GtcZYzccEc7U3ku/gXPaDJ9ffmvhs7vU
 pJX5Um3GiiFm0w/ZZOKDqh78wRAsCKLN+jIoyszuhkkNchZSj/jKfOgdd3EmcZst
 6L6rxBA4oRHwNOgM7uVMp+jFeRe1/prR280OWWH0D4QmmuqybThOdO23Iuh/Deth
 W3qPUH3UQtfSWxGy2yODzJ1ciuGAr/AzJZ9zjg04e3Vl0DkEpyWtLKJiG3ClXZag
 Nj+3xc4xYH2Aw+M0HRaONk5XVKLpqVjuAfgU5iLQa0YSUbtrR+wCWvY8KgQNbAqK
 xZmzYzQ89stwVCuGKx10gPsL3jSJ3VCylMfqdHD2Ajmld1yApr0=
 =DOZU
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Do not #GP on userspace use of CLI/STI but pretend it was a NOP to
   keep old userspace from breaking. Adjust the corresponding iopl
   selftest to that.

 - Improve stack overflow warnings to say which stack got overflowed and
   raise the exception stack sizes to 2 pages since overflowing the
   single page of exception stack is very easy to do nowadays with all
   the tracing machinery enabled. With that, rip out the custom mapping
   of AMD SEV's too.

 - A bunch of changes in preparation for FGKASLR like supporting more
   than 64K section headers in the relocs tool, correct ORC lookup table
   size to cover the whole kernel .text and other adjustments.

* tag 'x86_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/x86/iopl: Adjust to the faked iopl CLI/STI usage
  vmlinux.lds.h: Have ORC lookup cover entire _etext - _stext
  x86/boot/compressed: Avoid duplicate malloc() implementations
  x86/boot: Allow a "silent" kaslr random byte fetch
  x86/tools/relocs: Support >64K section headers
  x86/sev: Make the #VC exception stacks part of the default stacks storage
  x86: Increase exception stack sizes
  x86/mm/64: Improve stack overflow warnings
  x86/iopl: Fake iopl(3) CLI/STI usage
2021-11-02 07:56:47 -07:00
Linus Torvalds
20273d2588 - Export sev_es_ghcb_hv_call() so that HyperV Isolation VMs can use it too
- Non-urgent fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/xXMACgkQEsHwGGHe
 VUpFohAAn1FcRfgUh4a7SZQudhWaYPye0Yaf9c9acJIDYfls4Qg3ZLvSNGS0QChW
 pcjNQzr42UymxZKq1t6JGaUlD0vkfW0p+w5wueeIxMltWG0oZXgUPhqWrFTLwBtR
 g5Gio3Jum1CULCMokS6W4MjJSkTtX5NyYPg+m5Siowy10cbBdYA4wJaKnwGslPT7
 4pCDQP5159cjmG9WthKppxUdFy/vql0NJhjxmUkha39eVJ7yLoWvJoubQqqGnqXF
 XHwFolZGBxm4Ed4XoUjtz4HgI0VD1JOImUBPqnaE/uyrU7bqqywe5/PpZP051xtF
 anpWBm8KbZFsh220bSRJdFQxQBiXaIA41tfBiqVQhrgPy6TKgq7glhD4/ZjvUAdu
 DDg2HYEnK3dBAOCa7zIj/+uTijD1nvvuhQblGB2PnvnD2RWWgl+0vZ9Wqspo0EyW
 ry5V7hGCMC3mgFexTtvwd1hvMJVYrKfyn2XcP9B+zdgpUJ9DprB+g1O1J6NkGe1r
 SKS6itMokVRd+I+16iFQh0PuywqldbNv9dby6bd+dtvxAcVER2vUA0C7wmjqX4Mx
 bpftPrNhdNmgQAYlN/tRIfh2t2cFTJnWegVBBErdEfafiqKL9lU8gQlMVgwY10o+
 a1ALQ5cUI9Y0xS4cJtfVBVIekqIwEbmniS66iMlMiEJx+Ar6T8g=
 =Gql9
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Export sev_es_ghcb_hv_call() so that HyperV Isolation VMs can use it
   too

 - Non-urgent fixes and cleanups

* tag 'x86_sev_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV
  x86/sev: Allow #VC exceptions on the VC2 stack
  x86/sev: Fix stack type check in vc_switch_off_ist()
  x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
  x86/sev: Carve out HV call's return value verification
2021-11-01 15:52:26 -07:00
Linus Torvalds
6e5772c8d9 Add an interface called cc_platform_has() which is supposed to be used
by confidential computing solutions to query different aspects of the
 system. The intent behind it is to unify testing of such aspects instead
 of having each confidential computing solution add its own set of tests
 to code paths in the kernel, leading to an unwieldy mess.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uLUACgkQEsHwGGHe
 VUqGbQ/+LOmz8hmL5vtbXw/lVonCSBRKI2KVefnN2VtQ3rjtCq8HlNoq/hAdi15O
 WntABFV8u4daNAcssp+H/p+c8Mt/NzQa60TRooC5ZIynSOCj4oZQxTWjcnR4Qxrf
 oABy4sp09zNW31qExtTVTwPC/Ejzv4hA0Vqt9TLQOSxp7oYVYKeDJNp79VJK64Yz
 Ky7epgg8Pauk0tAT76ATR4kyy9PLGe4/Ry0bOtAptO4NShL1RyRgI0ywUmptJHSw
 FV/MnoexdAs4V8+4zPwyOkf8YMDnhbJcvFcr7Yd9AEz2q9Z1wKCgi1M3aZIoW8lV
 YMXECMGe9DfxmEJbnP5zbnL6eF32x+tbq+fK8Ye4V2fBucpWd27zkcTXjoP+Y+zH
 NLg+9QykR9QCH75YCOXcAg1Q5hSmc4DaWuJymKjT+W7MKs89ywjq+ybIBpLBHbQe
 uN9FM/CEKXx8nQwpNQc7mdUE5sZeCQ875028RaLbLx3/b6uwT6rBlNJfxl/uxmcZ
 iF1kG7Cx4uO+7G1a9EWgxtWiJQ8GiZO7PMCqEdwIymLIrlNksAk7nX2SXTuH5jIZ
 YDuBj/Xz2UUVWYFm88fV5c4ogiFlm9Jeo140Zua/BPdDJd2VOP013rYxzFE/rVSF
 SM2riJxCxkva8Fb+8TNiH42AMhPMSpUt1Nmd1H2rcEABRiT83Ow=
 =Na0U
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull generic confidential computing updates from Borislav Petkov:
 "Add an interface called cc_platform_has() which is supposed to be used
  by confidential computing solutions to query different aspects of the
  system.

  The intent behind it is to unify testing of such aspects instead of
  having each confidential computing solution add its own set of tests
  to code paths in the kernel, leading to an unwieldy mess"

* tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
  x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
  x86/sev: Replace occurrences of sev_active() with cc_platform_has()
  x86/sme: Replace occurrences of sme_active() with cc_platform_has()
  powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
  x86/sev: Add an x86 version of cc_platform_has()
  arch/cc: Introduce a function to check for confidential computing features
  x86/ioremap: Selectively build arch override encryption functions
2021-11-01 15:16:52 -07:00
Tianyu Lan
810a521265 x86/hyperv: Add new hvcall guest address host visibility support
Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted(). Add HYPERVISOR feature check in the
hv_is_isolation_supported() to optimize in non-virtualization
environment.

Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-4-ltykernel@gmail.com
[ wei: fix conflicts with tip ]
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-10-28 11:21:33 +00:00
Wei Liu
e82f2069b5 Merge remote-tracking branch 'tip/x86/cc' into hyperv-next 2021-10-28 10:46:03 +00:00
Thomas Gleixner
079ec41b22 x86/fpu: Provide a proper function for ex_handler_fprestore()
To make upcoming changes for support of dynamically enabled features
simpler, provide a proper function for the exception handler which removes
exposure of FPU internals.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011540.053515012@linutronix.de
2021-10-20 15:27:29 +02:00
Thomas Gleixner
0ae67cc34f x86/fpu: Remove internal.h dependency from fpu/signal.h
In order to remove internal.h make signal.h independent of it.

Include asm/fpu/xstate.h to fix a missing update_regset_xstate_info()
prototype, which is
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.844565975@linutronix.de
2021-10-20 15:27:29 +02:00
Thomas Gleixner
a0ff0611c2 x86/fpu: Move KVMs FPU swapping to FPU core
Swapping the host/guest FPU is directly fiddling with FPU internals which
requires 5 exports. The upcoming support of dynamically enabled states
would even need more.

Implement a swap function in the FPU core code and export that instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de
2021-10-20 15:27:27 +02:00
Tom Lendacky
e7d445ab26 x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
When runtime support for converting between 4-level and 5-level pagetables
was added to the kernel, the SME code that built pagetables was updated
to use the pagetable functions, e.g. p4d_offset(), etc., in order to
simplify the code. However, the use of the pagetable functions in early
boot code requires the use of the USE_EARLY_PGTABLE_L5 #define in order to
ensure that the proper definition of pgtable_l5_enabled() is used.

Without the #define, pgtable_l5_enabled() is #defined as
cpu_feature_enabled(X86_FEATURE_LA57). In early boot, the CPU features
have not yet been discovered and populated, so pgtable_l5_enabled() will
return false even when 5-level paging is enabled. This causes the SME code
to always build 4-level pagetables to perform the in-place encryption.
If 5-level paging is enabled, switching to the SME pagetables results in
a page-fault that kills the boot.

Adding the #define results in pgtable_l5_enabled() using the
__pgtable_l5_enabled variable set in early boot and the SME code building
pagetables for the proper paging level.

Fixes: aad983913d ("x86/mm/encrypt: Simplify sme_populate_pgd() and sme_populate_pgd_large()")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> # 4.18.x
Link: https://lkml.kernel.org/r/2cb8329655f5c753905812d951e212022a480475.1634318656.git.thomas.lendacky@amd.com
2021-10-19 14:07:17 +02:00