Commit Graph

22327 Commits

Author SHA1 Message Date
Linus Torvalds
47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Linus Torvalds
fffe3ae0ee hmm related patches for 5.9
This series adds reporting of the page table order from hmm_range_fault()
 and some optimization of migrate_vma():
 
 - Report the size of the page table mapping out of hmm_range_fault(). This
   makes it easier to establish a large/huge/etc mapping in the device's
   page table.
 
 - Allow devices to ignore the invalidations during migration in cases
   where the migration is not going to change pages. For instance migrating
   pages to a device does not require the device to invalidate pages
   already in the device.
 
 - Update nouveau and hmm_tests to use the above
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl8oocYACgkQOG33FX4g
 mxqd3Q/+OClUADmrI+EGJAPI7VD3EYfyZdnMCcp39AYNfySQPN9+fCMF5hVD5U7x
 KZVflR/zKUIZJVvdD8yAdrynZ1sHBG/HEzDyoaKcGzfCKq5LEAEnP5FG3xsiDjkO
 QX7w6qIGDz59gaeanQKNzqaR3DMpBwO/0D5/80DWXv+WgmxsAphanJYlo4eWyq4D
 EGq8EndCxairkTLpPlDHvFottL5kAKDXEinSAwWGQeZJkRY93vj+HZAQaeltmB1K
 SDdZr7lsEg2RhtRjzT7CkA2bkCERKL3xEc4VWaCAZw+qm8aeswADVOSo5E5F7DMI
 NUsB/p4GZ2CvIog/y3g/aSGluevdYJHTH8ip1BnNr2qCcXSEqHKsmyKpVNZztSUl
 uljyT17ZzTsdR4xj50tM27fzgDaavWrwFZTsJxUifuvAO9rHvGDVpaN8ZIU9iZei
 PTsGQvfoHDmWBWKX1dkIUGq+UoGwEAYRGk+XU0OYZCK97xmjRnGVoH0FTOk4DNQs
 +A0250oTOrvdSGiv0fNT5qpWpFsQ/84h8Lz6ubAD3okVo1bk9cFMe2argQl+E2qI
 TGM9ZHS8rphJNWwiPm8xrgf9eQ9bNp3ilCsIzBBpqZq8elwaL6a3ySieDPE734Ar
 FZEeEYTvj5Z/gXtyo/gxVKhltCc4U8kPqye9uexTInz4zBUUZOM=
 =omAU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "Ralph has been working on nouveau's use of hmm_range_fault() and
  migrate_vma() which resulted in this small series. It adds reporting
  of the page table order from hmm_range_fault() and some optimization
  of migrate_vma():

   - Report the size of the page table mapping out of hmm_range_fault().

     This makes it easier to establish a large/huge/etc mapping in the
     device's page table.

   - Allow devices to ignore the invalidations during migration in cases
     where the migration is not going to change pages.

     For instance migrating pages to a device does not require the
     device to invalidate pages already in the device.

   - Update nouveau and hmm_tests to use the above"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  mm/hmm/test: use the new migration invalidation
  nouveau/svm: use the new migration invalidation
  mm/notifier: add migration invalidation type
  mm/migrate: add a flags parameter to migrate_vma
  nouveau: fix storing invalid ptes
  nouveau/hmm: support mapping large sysmem pages
  nouveau: fix mapping 2MB sysmem pages
  nouveau/hmm: fault one page at a time
  mm/hmm: add tests for hmm_pfn_to_map_order()
  mm/hmm: provide the page mapping order in hmm_range_fault()
2020-08-05 13:28:50 -07:00
Linus Torvalds
2ed90dbbf7 dma-mapping updates for 5.9
- make support for dma_ops optional
  - move more code out of line
  - add generic support for a dma_ops bypass mode
  - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8oGscLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNfEhAAmFwd6BBHGwAhXUchoIue5vdNnuY3GiBFRzUdz67W
 zRYYgZYiPjl+MwflRmwPcoWEnGzmweRa2s6OnyDostiCRauioa8BuQfGqJasf1yZ
 D36dFNVHGW0o6pRDUQkd688k/4A6szwuwpq83qi4e8X2I9QzAITHtW8izjfPM923
 FlJzxEFggbB2TvwfUXOZhmpuG4Dog8S7VZ1Uz4QAg0Z/5FDqIKAAG2aZMqCXBbiX
 01E8tr0AqU/jn2xpc8O+DJGFiYIRhqhyNxQbH6qz1Q3xGFSokcLYm3YqkqVOgpn1
 DLs2UFDxWkly/F+wGnYtju7OD9VGPywzOcW125/LIsApYN5R/rYrtQzK41eq7Mp5
 HY3tqgNTIMdnl4so7QXeU4Vxj+lUdPlI26NZGszcM5AVftdTX8KjGdS+0+PBza6i
 i7trwG7J5/DnwiBCvEKoul7Ul1psUMTSvYwINTXRqsU4mZXhhx/mwyXbtruELnkj
 3agM98u6hoalLNjd2aueh+NjMZi1r+MchTrfRvTcxJ+yQ5BoR5kF+iz7eT/LtZ72
 AqWwimsPGNkLHUa0TrqWql5tv90cdDkBZzWXVbixwxRfgynWYLE6jugeIy8hwjFf
 GjO5XKbBwnWPjdSzFsVMPeuNpmr7ZjVHHewy2Q/jWQAIOyeof0VztEl23LN5yUkx
 pc8=
 =90UK
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - make support for dma_ops optional

 - move more code out of line

 - add generic support for a dma_ops bypass mode

 - misc cleanups

* tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: cleanup dma_alloc_contiguous
  dma-debug: use named initializers for dir2name
  powerpc: use the generic dma_ops_bypass mode
  dma-mapping: add a dma_ops_bypass flag to struct device
  dma-mapping: make support for dma ops optional
  dma-mapping: inline the fast path dma-direct calls
  dma-mapping: move the remaining DMA API calls out of line
2020-08-04 17:29:57 -07:00
Linus Torvalds
4f30a60aa7 close-range-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcpgAKCRCRxhvAZXjc
 ogPeAQDv1ncqtNroFAC4pJ4tQhH7JSjW0OltiMk/AocY/J2SdQD9GJ15luYJ0/om
 697q/Z68sndRynhdoZlMuf3oYuBlHQw=
 =3ZhE
 -----END PGP SIGNATURE-----

Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull close_range() implementation from Christian Brauner:
 "This adds the close_range() syscall. It allows to efficiently close a
  range of file descriptors up to all file descriptors of a calling
  task.

  This is coordinated with the FreeBSD folks which have copied our
  version of this syscall and in the meantime have already merged it in
  April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

  The syscall originally came up in a discussion around the new mount
  API and making new file descriptor types cloexec by default. During
  this discussion, Al suggested the close_range() syscall.

  First, it helps to close all file descriptors of an exec()ing task.
  This can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

  The code snippet above is one way of working around the problem that
  file descriptors are not cloexec by default. This is aggravated by the
  fact that we can't just switch them over without massively regressing
  userspace. For a whole class of programs having an in-kernel method of
  closing all file descriptors is very helpful (e.g. demons, service
  managers, programming language standard libraries, container managers
  etc.).

  Second, it allows userspace to avoid implementing closing all file
  descriptors by parsing through /proc/<pid>/fd/* and calling close() on
  each file descriptor and other hacks. From looking at various
  large(ish) userspace code bases this or similar patterns are very
  common in service managers, container runtimes, and programming
  language runtimes/standard libraries such as Python or Rust.

  In addition, the syscall will also work for tasks that do not have
  procfs mounted and on kernels that do not have procfs support compiled
  in. In such situations the only way to make sure that all file
  descriptors are closed is to call close() on each file descriptor up
  to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

  Based on Linus' suggestion close_range() also comes with a new flag
  CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
  right before exec. This would usually be expressed in the sequence:

        unshare(CLONE_FILES);
        close_range(3, ~0U);

  as pointed out by Linus it might be desirable to have this be a part
  of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
  gets especially handy when we're closing all file descriptors above a
  certain threshold.

  Test-suite as always included"

* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLOSE_RANGE_UNSHARE tests
  close_range: add CLOSE_RANGE_UNSHARE
  tests: add close_range() tests
  arch: wire-up close_range()
  open: add close_range()
2020-08-04 15:12:02 -07:00
Linus Torvalds
9ba27414f2 fork-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXyge/QAKCRCRxhvAZXjc
 oildAQCCWpnTeXm6hrIE3VZ36X5npFtbaEthdBVAUJM7mo0FYwEA8+Wbnubg6jCw
 mztkXCnTfU7tApUdhKtQzcpEws45/Qk=
 =REE/
 -----END PGP SIGNATURE-----

Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fork cleanups from Christian Brauner:
 "This is cleanup series from when we reworked a chunk of the process
  creation paths in the kernel and switched to struct
  {kernel_}clone_args.

  High-level this does two main things:

   - Remove the double export of both do_fork() and _do_fork() where
     do_fork() used the incosistent legacy clone calling convention.

     Now we only export _do_fork() which is based on struct
     kernel_clone_args.

   - Remove the copy_thread_tls()/copy_thread() split making the
     architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

  This switches all remaining architectures to select
  HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
  convention. The current split makes the process creation codepaths
  more convoluted than they need to be. Each architecture has their own
  copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
  has a copy_thread_tls() function.

  The split is not needed anymore nowadays, all architectures support
  CLONE_SETTLS but quite a few of them never bothered to select
  HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
  and use the old calling convention. Removing this split cleans up the
  process creation codepaths and paves the way for implementing clone3()
  on such architectures since it requires the copy_thread_tls() calling
  convention.

  After having made each architectures support copy_thread_tls() this
  series simply renames that function back to copy_thread(). It also
  switches all architectures that call do_fork() directly over to
  _do_fork() and the struct kernel_clone_args calling convention. This
  is a corollary of switching the architectures that did not yet support
  it over to copy_thread_tls() since do_fork() is conditional on not
  supporting copy_thread_tls() (Mostly because it lacks a separate
  argument for tls which is trivial to fix but there's no need for this
  function to exist.).

  The do_fork() removal is in itself already useful as it allows to to
  remove the export of both do_fork() and _do_fork() we currently have
  in favor of only _do_fork(). This has already been discussed back when
  we added clone3(). The legacy clone() calling convention is - as is
  probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

  that is aggravated by the fact that some architectures such as sparc
  follow the CLONE_BACKWARDSx calling convention but don't really select
  the corresponding config option since they call do_fork() directly.

  So do_fork() enforces a somewhat arbitrary calling convention in the
  first place that doesn't really help the individual architectures that
  deviate from it. They can thus simply be switched to _do_fork()
  enforcing a single calling convention. (I really hope that any new
  architectures will __not__ try to implement their own calling
  conventions...)

  Most architectures already have made a similar switch (m68k comes to
  mind).

  Overall this removes more code than it adds even with a good portion
  of added comments. It simplifies a chunk of arch specific assembly
  either by moving the code into C or by simply rewriting the assembly.

  Architectures that have been touched in non-trivial ways have all been
  actually boot and stress tested: sparc and ia64 have been tested with
  Debian 9 images. They are the two architectures which have been
  touched the most. All non-trivial changes to architectures have seen
  acks from the relevant maintainers. nios2 with a custom built
  buildroot image. h8300 I couldn't get something bootable to test on
  but the changes have been fairly automatic and I'm sure we'll hear
  people yell if I broke something there.

  All other architectures that have been touched in trivial ways have
  been compile tested for each single patch of the series via git rebase
  -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
  even though they have just been trivially touched (removal of the
  HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
  basically "core architectures" and since it is trivial to get your
  hands on a useable image"

* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: rename copy_thread_tls() back to copy_thread()
  arch: remove HAVE_COPY_THREAD_TLS
  unicore: switch to copy_thread_tls()
  sh: switch to copy_thread_tls()
  nds32: switch to copy_thread_tls()
  microblaze: switch to copy_thread_tls()
  hexagon: switch to copy_thread_tls()
  c6x: switch to copy_thread_tls()
  alpha: switch to copy_thread_tls()
  fork: remove do_fork()
  h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  sparc: unconditionally enable HAVE_COPY_THREAD_TLS
  sparc: share process creation helpers between sparc and sparc64
  sparc64: enable HAVE_COPY_THREAD_TLS
  fork: fold legacy_clone_args_valid() into _do_fork()
2020-08-04 14:47:45 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Michael Ellerman
0c83b277ad powerpc: Fix circular dependency between percpu.h and mmu.h
Recently random.h started including percpu.h (see commit
f227e3ec3b ("random32: update the net random state on interrupt and
activity")), which broke corenet64_smp_defconfig:

  In file included from /linux/arch/powerpc/include/asm/paca.h:18,
                   from /linux/arch/powerpc/include/asm/percpu.h:13,
                   from /linux/include/linux/random.h:14,
                   from /linux/lib/uuid.c:14:
  /linux/arch/powerpc/include/asm/mmu.h:139:22: error: unknown type name 'next_tlbcam_idx'
    139 | DECLARE_PER_CPU(int, next_tlbcam_idx);

This is due to a circular header dependency:
  asm/mmu.h includes asm/percpu.h, which includes asm/paca.h, which
  includes asm/mmu.h

Which means DECLARE_PER_CPU() isn't defined when mmu.h needs it.

We can fix it by moving the include of paca.h below the include of
asm-generic/percpu.h.

This moves the include of paca.h out of the #ifdef __powerpc64__, but
that is OK because paca.h is almost entirely inside #ifdef
CONFIG_PPC64 anyway.

It also moves the include of paca.h out of the #ifdef CONFIG_SMP,
which could possibly break something, but seems to have no ill
effects.

Fixes: f227e3ec3b ("random32: update the net random state on interrupt and activity")
Cc: stable@vger.kernel.org # v5.8
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200804130558.292328-1-mpe@ellerman.id.au
2020-08-04 23:15:59 +10:00
Linus Torvalds
0408497800 Power management updates for 5.9-rc1
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
 
  - Add Ice Lake server idle states table to the intel_idle driver
    and eliminate a redundant static variable from it (Chen Yu,
    Rafael Wysocki).
 
  - Eliminate all W=1 build warnings from cpufreq (Lee Jones).
 
  - Add support for Sapphire Rapids and for Power Limit 4 to the
    Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
 
  - Fix function name in kerneldoc comments in the idle_inject power
    capping driver (Yangtao Li).
 
  - Fix locking issues with cpufreq governors and drop a redundant
    "weak" function definition from cpufreq (Viresh Kumar).
 
  - Rearrange cpufreq to register non-modular governors at the
    core_initcall level and allow the default cpufreq governor to
    be specified in the kernel command line (Quentin Perret).
 
  - Extend, fix and clean up the intel_pstate driver (Srinivas
    Pandruvada, Rafael Wysocki):
 
    * Add a new sysfs attribute for disabling/enabling CPU
      energy-efficiency optimizations in the processor.
 
    * Make the driver avoid enabling HWP if EPP is not supported.
 
    * Allow the driver to handle numeric EPP values in the sysfs
      interface and fix the setting of EPP via sysfs in the active
      mode.
 
    * Eliminate a static checker warning and clean up a kerneldoc
      comment.
 
  - Clean up some variable declarations in the powernv cpufreq
    driver (Wei Yongjun).
 
  - Fix up the ->enter_s2idle callback definition to cover the case
    when it points to the same function as ->idle correctly (Neal
    Liu).
 
  - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
 
  - Make the PM core emit "changed" uevent when adding/removing the
    "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
 
  - Add a helper macro for declaring PM callbacks and use it in the
    MMC jz4740 driver (Paul Cercueil).
 
  - Fix white space in some places in the hibernate code and make the
    system-wide PM code use "const char *" where appropriate (Xiang
    Chen, Alexey Dobriyan).
 
  - Add one more "unsafe" helper macro to the freezer to cover the NFS
    use case (He Zhe).
 
  - Change the language in the generic PM domains framework to use
    parent/child terminology and clean up a typo and some comment
    fromatting in that code (Kees Cook, Geert Uytterhoeven).
 
  - Update the operating performance points OPP framework (Lukasz
    Luba, Andrew-sh.Cheng, Valdis Kletnieks):
 
    * Refactor dev_pm_opp_of_register_em() and update related drivers.
 
    * Add a missing function export.
 
    * Allow disabled OPPs in dev_pm_opp_get_freq().
 
  - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
    Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
 
    * Add support for delayed timers to the devfreq core and make the
      Samsung exynos5422-dmc driver use it.
 
    * Unify sysfs interface to use "df-" as a prefix in instance names
      consistently.
 
    * Fix devfreq_summary debugfs node indentation.
 
    * Add the rockchip,pmu phandle to the rk3399_dmc driver DT
      bindings.
 
    * List Dmitry Osipenko as the Tegra devfreq driver maintainer.
 
    * Fix typos in the core devfreq code.
 
  - Update the pm-graph utility to version 5.7 including a number of
    fixes related to suspend-to-idle (Todd Brandt).
 
  - Fix coccicheck errors and warnings in the cpupower utility (Shuah
    Khan).
 
  - Replace HTTP links with HTTPs ones in multiple places (Alexander
    A. Klimov).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8oO24SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx7ZQP/0lQ0yABnASnwomdOH6+K/m7rvc+e9FE
 zx5pTDQswhU5tM7SQAIKqe0uSI+okF2UrBrT5onA16F+JUbnrbexJLazBPfVTTGF
 AKpKEQ7Wh69Wz+Y6cQZjm1dTuRL+dlBJuBrzR2tLSnONPMMHuFcO3xd7lgE9UAxC
 oGEf393taA6OqcUNRQIa2gqbq+k1qhKjeDucGkbOaoJ6CL0ZyWI+Tfw1WWaBBGv0
 /2wBd6V513OH8WtQCW6H3YpHmhYW6OwL8w19KyGcjPRGJaeaIP4W/Ng7mkvgL5ZB
 vZqg3XiufFV9uTe8W1NQaVv/NjlN256OteuK809aosTVjD0dhFkhBYg5TLu6HbQq
 C/NciZ+78oLedWLT73EUfw3NyS+V0jk6X2EIlBUwNi0Qw1B1pCifGOCKzWFFe5cr
 ci4xr4FG7dBkxScOxwFAU2s5TdPHLOkGkQtg4jZr0OYDrzkyLEdsnZEUjLPORo+0
 6EBXGfTOSy2CBHcYswRtzJr/1pUTzj7oejhTAMCCuYW2r3VyQtnYcVjlehtp20if
 6BfmGisk8nmtxlSm+/Y2FqKa4bNnSTMmr0UJQ+Rjp0tHs47QeucI0ORfZ5nPaBac
 +ptvIjWmn3xejT/+oAehpH9066Iuy66vzHdnj7x5+WAsmYS8n8OFtlBFkYELmLJB
 3xI5hIl7WtGo
 =8cUO
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The most significant change here is the extension of the Energy Model
  to cover non-CPU devices (as well as CPUs) from Lukasz Luba.

  There is also some new hardware support (Ice Lake server idle states
  table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
  RAPL driver), some new functionality in the existing drivers (eg. a
  new switch to disable/enable CPU energy-efficiency optimizations in
  intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
  core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
  devfreq), including the elimination of W=1 build warnings from cpufreq
  done by Lee Jones.

  Specifics:

   - Make the Energy Model cover non-CPU devices (Lukasz Luba).

   - Add Ice Lake server idle states table to the intel_idle driver and
     eliminate a redundant static variable from it (Chen Yu, Rafael
     Wysocki).

   - Eliminate all W=1 build warnings from cpufreq (Lee Jones).

   - Add support for Sapphire Rapids and for Power Limit 4 to the Intel
     RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).

   - Fix function name in kerneldoc comments in the idle_inject power
     capping driver (Yangtao Li).

   - Fix locking issues with cpufreq governors and drop a redundant
     "weak" function definition from cpufreq (Viresh Kumar).

   - Rearrange cpufreq to register non-modular governors at the
     core_initcall level and allow the default cpufreq governor to be
     specified in the kernel command line (Quentin Perret).

   - Extend, fix and clean up the intel_pstate driver (Srinivas
     Pandruvada, Rafael Wysocki):

       * Add a new sysfs attribute for disabling/enabling CPU
         energy-efficiency optimizations in the processor.

       * Make the driver avoid enabling HWP if EPP is not supported.

       * Allow the driver to handle numeric EPP values in the sysfs
         interface and fix the setting of EPP via sysfs in the active
         mode.

       * Eliminate a static checker warning and clean up a kerneldoc
         comment.

   - Clean up some variable declarations in the powernv cpufreq driver
     (Wei Yongjun).

   - Fix up the ->enter_s2idle callback definition to cover the case
     when it points to the same function as ->idle correctly (Neal Liu).

   - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).

   - Make the PM core emit "changed" uevent when adding/removing the
     "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).

   - Add a helper macro for declaring PM callbacks and use it in the MMC
     jz4740 driver (Paul Cercueil).

   - Fix white space in some places in the hibernate code and make the
     system-wide PM code use "const char *" where appropriate (Xiang
     Chen, Alexey Dobriyan).

   - Add one more "unsafe" helper macro to the freezer to cover the NFS
     use case (He Zhe).

   - Change the language in the generic PM domains framework to use
     parent/child terminology and clean up a typo and some comment
     fromatting in that code (Kees Cook, Geert Uytterhoeven).

   - Update the operating performance points OPP framework (Lukasz Luba,
     Andrew-sh.Cheng, Valdis Kletnieks):

       * Refactor dev_pm_opp_of_register_em() and update related drivers.

       * Add a missing function export.

       * Allow disabled OPPs in dev_pm_opp_get_freq().

   - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
     Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):

       * Add support for delayed timers to the devfreq core and make the
         Samsung exynos5422-dmc driver use it.

       * Unify sysfs interface to use "df-" as a prefix in instance
         names consistently.

       * Fix devfreq_summary debugfs node indentation.

       * Add the rockchip,pmu phandle to the rk3399_dmc driver DT
         bindings.

       * List Dmitry Osipenko as the Tegra devfreq driver maintainer.

       * Fix typos in the core devfreq code.

   - Update the pm-graph utility to version 5.7 including a number of
     fixes related to suspend-to-idle (Todd Brandt).

   - Fix coccicheck errors and warnings in the cpupower utility (Shuah
     Khan).

   - Replace HTTP links with HTTPs ones in multiple places (Alexander A.
     Klimov)"

* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
  cpuidle: ACPI: fix 'return' with no value build warning
  cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
  cpufreq: intel_pstate: Rearrange the storing of new EPP values
  intel_idle: Customize IceLake server support
  PM / devfreq: Fix the wrong end with semicolon
  PM / devfreq: Fix indentaion of devfreq_summary debugfs node
  PM / devfreq: Clean up the devfreq instance name in sysfs attr
  memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
  memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
  memory: samsung: exynos5422-dmc: Use delayed timer as default
  PM / devfreq: Add support delayed timer for polling mode
  dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
  PM / devfreq: tegra: Add Dmitry as a maintainer
  PM / devfreq: event: Fix trivial spelling
  PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
  cpuidle: change enter_s2idle() prototype
  cpuidle: psci: Prevent domain idlestates until consumers are ready
  cpuidle: psci: Convert PM domain to platform driver
  cpuidle: psci: Fix error path via converting to a platform driver
  cpuidle: psci: Fail cpuidle registration if set OSI mode failed
  ...
2020-08-03 20:28:08 -07:00
Linus Torvalds
9ba19ccd2d These were the main changes in this cycle:
- LKMM updates: mostly documentation changes, but also some new litmus tests for atomic ops.
 
  - KCSAN updates: the most important change is that GCC 11 now has all fixes in place
                   to support KCSAN, so GCC support can be enabled again. Also more annotations.
 
  - futex updates: minor cleanups and simplifications
 
  - seqlock updates: merge preparatory changes/cleanups for the 'associated locks' facilities.
 
  - lockdep updates:
     - simplify IRQ trace event handling
     - add various new debug checks
     - simplify header dependencies, split out <linux/lockdep_types.h>, decouple
       lockdep from other low level headers some more
     - fix NMI handling
 
  - misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8n9/wRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZFQ//dD+AKw9Nym+WbylovmeD0qxWxPyeN/jG
 vBVDTOJIJLtZTkZf6YHcYOJlPwaMDYUQluqTPQhsaQZy/NoEb5NM2cFAj2R9gjyT
 O8665T1dvhW9Sh353mBpuwviqdrnvCeHTBEcglSlFY7hxToYAflUN0+DXGVtNys8
 PFNf3L9SHT0GLVC8+di/eJzQaRqxiB0Pq7kvh2RvPJM/dcQNA9Ho3CCNO5j6qGoY
 u7OnMT8xJXkgbdjjUO4RO0v9VjMuNthZ2JiONDgvgKtJfIL2wt5YXIv1EYX0GuWp
 WZgIzE4o1G7GJOOzKpFfZFyK8grHu2fWgK1plvodWjlLkBmltJZ1qyOM+wngd/m2
 TgtPo73/YFbxFUbbBpkb0eiIaH2t99kMvfCWd05+GiPCtzn9UL9GfFRWd42vonwc
 sQWjFrHKlnuzifUfNcLmKg7R2nUtF3Dm/SydiTJ+9NtH/QA17YJKWnlE1moulNtQ
 p7H7+8UdcvSQ7F38A74v2IYNIyDsv5qcE8ar4QHdaanBBX/LCyD0UlfgsgxEReXf
 GDKkpx7LFQlI6Y2YB+dZgkCwhNBl3/OQ3v6hC95B37fA67dAIQyPIWHiHbaM+029
 gghqU4GcUcbjSnHPzl9PPL+hi9MyXrMjpb7CBXytg4NI4EE1waHR+0kX14V8ndRj
 MkWQOKPUgB0=
 =3MTT
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - LKMM updates: mostly documentation changes, but also some new litmus
   tests for atomic ops.

 - KCSAN updates: the most important change is that GCC 11 now has all
   fixes in place to support KCSAN, so GCC support can be enabled again.
   Also more annotations.

 - futex updates: minor cleanups and simplifications

 - seqlock updates: merge preparatory changes/cleanups for the
   'associated locks' facilities.

 - lockdep updates:
    - simplify IRQ trace event handling
    - add various new debug checks
    - simplify header dependencies, split out <linux/lockdep_types.h>,
      decouple lockdep from other low level headers some more
    - fix NMI handling

 - misc cleanups and smaller fixes

* tag 'locking-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  kcsan: Improve IRQ state trace reporting
  lockdep: Refactor IRQ trace events fields into struct
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  lockdep: Add preemption enabled/disabled assertion APIs
  seqlock: Implement raw_seqcount_begin() in terms of raw_read_seqcount()
  seqlock: Add kernel-doc for seqcount_t and seqlock_t APIs
  seqlock: Reorder seqcount_t and seqlock_t API definitions
  seqlock: seqcount_t latch: End read sections with read_seqcount_retry()
  seqlock: Properly format kernel-doc code samples
  Documentation: locking: Describe seqlock design and usage
  locking/qspinlock: Do not include atomic.h from qspinlock_types.h
  locking/atomic: Move ATOMIC_INIT into linux/types.h
  lockdep: Move list.h inclusion into lockdep.h
  locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs
  futex: Remove unused or redundant includes
  futex: Consistently use fshared as boolean
  futex: Remove needless goto's
  futex: Remove put_futex_key()
  rwsem: fix commas in initialisation
  docs: locking: Replace HTTP links with HTTPS ones
  ...
2020-08-03 14:39:35 -07:00
Oliver O'Halloran
2075ec9896 powerpc/powernv/sriov: Fix use of uninitialised variable
Initialising the value before using it is generally regarded as a good
idea so do that.

Fixes: 4c51f3e1e8 ("powerpc/powernv/sriov: Make single PE mode a per-BAR setting")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200803075408.132601-1-oohall@gmail.com
2020-08-03 22:13:13 +10:00
Michael Ellerman
8d8a629d00 powerpc/40x: Fix assembler warning about r0
The assembler says:
  arch/powerpc/kernel/head_40x.S:623: Warning: invalid register expression

It's objecting to the use of r0 as the RA argument. That's because
when RA = 0 the literal value 0 is used, rather than the content of
r0, making the use of r0 in the source potentially confusing.

Fix it to use a literal 0, the generated code is identical.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722022422.825197-1-mpe@ellerman.id.au
2020-08-03 22:12:41 +10:00
Rafael J. Wysocki
c81b30c895 Merge branch 'pm-cpufreq'
* pm-cpufreq: (24 commits)
  cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
  cpufreq: intel_pstate: Rearrange the storing of new EPP values
  cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
  cpufreq: intel_pstate: Clean up aperf_mperf_shift description
  cpufreq: powernv: Make some symbols static
  cpufreq: amd_freq_sensitivity: Mark sometimes used ID structs as __maybe_unused
  cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift()
  cpufreq: pcc-cpufreq: Mark sometimes used ID structs as __maybe_unused
  cpufreq: powernow-k8: Mark 'hi' and 'lo' dummy variables as __always_unused
  cpufreq: acpi-cpufreq: Mark sometimes used ID structs as __maybe_unused
  cpufreq: acpi-cpufreq: Mark 'dummy' variable as __always_unused
  cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues
  cpufreq: pasemi: Include header file for {check,restore}_astate prototypes
  cpufreq: cpufreq_governor: Demote store_sampling_rate() header to standard comment block
  cpufreq: cpufreq: Demote lots of function headers unworthy of kerneldoc status
  cpufreq: freq_table: Demote obvious misuse of kerneldoc to standard comment blocks
  cpufreq: Replace HTTP links with HTTPS ones
  cpufreq: intel_pstate: Fix static checker warning for epp variable
  cpufreq: Remove the weakly defined cpufreq_default_governor()
  cpufreq: Specify default governor on command line
  ...
2020-08-03 13:12:36 +02:00
David S. Miller
bd0b33b248 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Resolved kernel/bpf/btf.c using instructions from merge commit
69138b34a7

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-02 01:02:12 -07:00
Vaibhav Jain
af0870c4e7 powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200731064153.182203-3-vaibhav@linux.ibm.com
2020-07-31 22:55:28 +10:00
Vaibhav Jain
2d02bf835e powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200731064153.182203-2-vaibhav@linux.ibm.com
2020-07-31 22:55:27 +10:00
Michal Suchanek
d3a133aa0e powerpc/perf: Consolidate perf_callchain_user_[64|32]()
perf_callchain_user_64() and perf_callchain_user_32() are nearly
identical. Consolidate into one function with thin wrappers.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
[mpe: Adapt to copy_from_user_nofault(), minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200406210022.32265-1-msuchanek@suse.de
2020-07-30 22:53:49 +10:00
Nathan Lynch
a0ff72f9f5 powerpc/pseries/hotplug-cpu: Remove double free in error path
In the unlikely event that the device tree lacks a /cpus node,
find_dlpar_cpus_to_add() oddly frees the cpu_drcs buffer it has been
passed before returning an error. Its only caller also frees the
buffer on error.

Remove the less conventional kfree() of a caller-supplied buffer from
find_dlpar_cpus_to_add().

Fixes: 90edf184b9 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190919231633.1344-1-nathanl@linux.ibm.com
2020-07-30 22:53:49 +10:00
Nathan Lynch
5d8b1f9dea powerpc/pseries/mobility: Add pr_debug() for device tree changes
When investigating issues with partition migration or resource
reassignments it is helpful to have a log of which nodes and
properties in the device tree have changed. Use pr_debug() so it's
easy to enable these at runtime with the dynamic debug facility.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627053044.9238-3-nathanl@linux.ibm.com
2020-07-30 22:53:49 +10:00
Nathan Lynch
494a66f34e powerpc/pseries/mobility: Set pr_fmt()
The pr_err() callsites in mobility.c already manually include a
"mobility:" prefix, let's make it official for the benefit of messages
to be added later.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627053044.9238-2-nathanl@linux.ibm.com
2020-07-30 22:53:48 +10:00
Nathan Lynch
6ec54363f1 powerpc/cacheinfo: Warn if cache object chain becomes unordered
This can catch cases where the device tree has gotten mishandled into
an inconsistent state at runtime, e.g. the cache nodes for both the
source and the destination are present after a migration.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627051537.7298-5-nathanl@linux.ibm.com
2020-07-30 22:53:48 +10:00
Nathan Lynch
1b3da8ffaa powerpc/cacheinfo: Improve diagnostics about malformed cache lists
If we have a bug which causes us to start with the wrong kind of OF
node when linking up the cache tree, it's helpful for debugging to
print information about what we found vs what we expected. So replace
uses of WARN_ON_ONCE with WARN_ONCE, which lets us include an
informative message instead of a contentless backtrace.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627051537.7298-4-nathanl@linux.ibm.com
2020-07-30 22:53:48 +10:00
Nathan Lynch
be6f885e97 powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
We know that every OF node we deal with in this code is under /cpus,
so we can make the debug messages a little less verbose without losing
information.

E.g.

cacheinfo: creating L1 dcache and icache for /cpus/PowerPC,POWER8@0
cacheinfo: creating L2 ucache for /cpus/l2-cache@2006
cacheinfo: creating L3 ucache for /cpus/l3-cache@3106

becomes

cacheinfo: creating L1 dcache and icache for PowerPC,POWER8@0
cacheinfo: creating L2 ucache for l2-cache@2006
cacheinfo: creating L3 ucache for l3-cache@3106

Replace all '%pOF' specifiers with '%pOFP'.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627051537.7298-3-nathanl@linux.ibm.com
2020-07-30 22:53:44 +10:00
Nathan Lynch
e2b3c165f2 powerpc/cacheinfo: Set pr_fmt()
Set pr_fmt() so we get a nice prefix on messages.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627051537.7298-2-nathanl@linux.ibm.com
2020-07-30 11:36:45 +10:00
Vladis Dronov
aff779515a powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
Certain warnings are emitted for powerpc code when building with a gcc-10
toolset:

    WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch in
    reference from the function remove_pmd_table() to the function
    .meminit.text:split_kernel_mapping()
    The function remove_pmd_table() references
    the function __meminit split_kernel_mapping().
    This is often because remove_pmd_table lacks a __meminit
    annotation or the annotation of split_kernel_mapping is wrong.

Add the appropriate __init and __meminit annotations to make modpost not
complain. In all the cases there are just a single callsite from another
__init or __meminit function:

__meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
__init prom_init() -> setup_secure_guest()
__init xive_spapr_init() -> xive_spapr_disabled()

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200729133741.62789-1-vdronov@redhat.com
2020-07-30 10:50:07 +10:00
Peter Zijlstra
f05d67179d Merge branch 'locking/header' 2020-07-29 16:14:21 +02:00
Herbert Xu
7ca8cf5347 locking/atomic: Move ATOMIC_INIT into linux/types.h
This patch moves ATOMIC_INIT from asm/atomic.h into linux/types.h.
This allows users of atomic_t to use ATOMIC_INIT without having to
include atomic.h as that way may lead to header loops.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20200729123105.GB7047@gondor.apana.org.au
2020-07-29 16:14:18 +02:00
Hari Bathini
2e6bd221d9 powerpc/kexec_file: Enable early kernel OPAL calls
Kernels built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
to be filled with OPAL base & entry addresses respectively. Setting
these registers allows the kernel to perform OPAL calls before the
device tree is parsed.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602303975.575379.5032301944162937479.stgit@hbathini
2020-07-29 23:47:55 +10:00
Hari Bathini
b5667d13be powerpc/kexec_file: Fix kexec load failure with lack of memory hole
The kexec purgatory has to run in real mode. Only the first memory
block maybe accessible in real mode. And, unlike the case with panic
kernel, no memory is set aside for regular kexec load. Another thing
to note is, the memory for crashkernel is reserved at an offset of
128MB. So, when crashkernel memory is reserved, the memory ranges to
load kexec segments shrink further as the generic code only looks for
memblock free memory ranges and in all likelihood only a tiny bit of
memory from 0 to 128MB would be available to load kexec segments.

With kdump being used by default in general, kexec file load is likely
to fail almost always. This can be fixed by changing the memory hole
lookup logic for regular kexec to use the same method as kdump. This
would mean that most kexec segments will overlap with crashkernel
memory region. That should still be ok as the pages, whose destination
address isn't available while loading, are placed in an intermediate
location till a flush to the actual destination address happens during
kexec boot sequence.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602302326.575379.14038896654942043093.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
6ecd0163d3 powerpc/kexec_file: Add appropriate regions for memory reserve map
While initrd, elfcorehdr and backup regions are already added to the
reserve map, there are a few missing regions that need to be added to
the memory reserve map. Add them here. And now that all the changes to
load panic kernel are in place, claim likewise.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602300473.575379.4218568032039284448.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
cb350c1f1f powerpc/kexec_file: Prepare elfcore header for crashing kernel
Prepare elf headers for the crashing kernel's core file using
crash_prepare_elf64_headers() and pass on this info to kdump kernel by
updating its command line with elfcorehdr parameter. Also, add
elfcorehdr location to reserve map to avoid it from being stomped on
while booting.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
[mpe: Ensure cmdline is nul terminated]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602298855.575379.15819225623219909517.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
1a1cf93c20 powerpc/kexec_file: Setup backup region for kdump kernel
Though kdump kernel boots from loaded address, the first 64KB of it is
copied down to real 0. So, setup a backup region and let purgatory
copy the first 64KB of crashed kernel into this backup region before
booting into kdump kernel. Update reserve map with backup region and
crashed kernel's memory to avoid kdump kernel from accidentially using
that memory.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602294718.575379.16216507537038008623.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
7c64e21a1c powerpc/kexec_file: Restrict memory usage of kdump kernel
Kdump kernel, used for capturing the kernel core image, is supposed
to use only specific memory regions to avoid corrupting the image to
be captured. The regions are crashkernel range - the memory reserved
explicitly for kdump kernel, memory used for the tce-table, the OPAL
region and RTAS region as applicable. Restrict kdump kernel memory
to use only these regions by setting up usable-memory DT property.
Also, tell the kdump kernel to run at the loaded address by setting
the magic word at 0x5c.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602284284.575379.6962016255404325493.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
adfefc609e powerpc/drmem: Make LMB walk a bit more flexible
Currently, numa & prom are the only users of drmem LMB walk code.
Loading kdump with kexec_file also needs to walk the drmem LMBs to
setup the usable memory ranges for kdump kernel. But there are couple
of issues in using the code as is. One, walk_drmem_lmb() code is built
into the .init section currently, while kexec_file needs it later.
Two, there is no scope to pass data to the callback function for
processing and/or erroring out on certain conditions.

Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when an
error is encountered in the callback function.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602282727.575379.3979857013827701828.stgit@hbathini
2020-07-29 23:47:54 +10:00
Hari Bathini
b8e55a3e5c powerpc/kexec_file: Avoid stomping memory used by special regions
crashkernel region could have an overlap with special memory regions
like OPAL, RTAS, TCE table & such. These regions are referred to as
excluded memory ranges. Setup these ranges during image probe in order
to avoid them while finding the buffer for different kdump segments.
Override arch_kexec_locate_mem_hole() to locate a memory hole taking
these ranges into account.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602281047.575379.6636807148335160795.stgit@hbathini
2020-07-29 23:47:53 +10:00
Hari Bathini
180adfc532 powerpc/kexec_file: Add helper functions for getting memory ranges
In kexec case, the kernel to be loaded uses the same memory layout as
the running kernel. So, passing on the DT of the running kernel would
be good enough.

But in case of kdump, different memory ranges are needed to manage
loading the kdump kernel, booting into it and exporting the elfcore of
the crashing kernel. The ranges are exclude memory ranges, usable
memory ranges, reserved memory ranges and crash memory ranges.

Exclude memory ranges specify the list of memory ranges to avoid while
loading kdump segments. Usable memory ranges list the memory ranges
that could be used for booting kdump kernel. Reserved memory ranges
list the memory regions for the loading kernel's reserve map. Crash
memory ranges list the memory ranges to be exported as the crashing
kernel's elfcore.

Add helper functions for setting up the above mentioned memory ranges.
This helpers facilitate in understanding the subsequent changes better
and make it easy to setup the different memory ranges listed above, as
and when appropriate.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602279194.575379.8526552316948643550.stgit@hbathini
2020-07-29 23:47:53 +10:00
Hari Bathini
19031275a5 powerpc/kexec_file: Mark PPC64 specific code
Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
rename purgatory/trampoline.S to purgatory/trampoline_64.S in the same
spirit. No functional changes.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602276920.575379.10390965946438306388.stgit@hbathini
2020-07-29 23:47:53 +10:00
Anton Blanchard
fdaa7ce201 powerpc/configs: Add BLK_DEV_NVME to pseries_defconfig
I've forgotten to manually enable NVME when building pseries kernels
for machines with NVME adapters. Since it's a reasonably common
configuration, enable it by default.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200729040828.2312966-1-anton@ozlabs.org
2020-07-29 23:47:53 +10:00
Mahesh Salgaonkar
ada68a66b7 powerpc/64s: Move HMI IRQ stat from percpu variable to paca.
With the proposed change in percpu bootmem allocator to use page
mapping [1], the percpu first chunk memory area can come from vmalloc
ranges. This makes the HMI (Hypervisor Maintenance Interrupt) handler
crash the kernel whenever percpu variable is accessed in real mode.
This patch fixes this issue by moving the HMI IRQ stat inside paca for
safe access in realmode.

[1] https://lore.kernel.org/linuxppc-dev/20200608070904.387440-1-aneesh.kumar@linux.ibm.com/

Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159290806973.3642154.5244613424529764050.stgit@jupiter
2020-07-29 23:47:53 +10:00
David Lamparter
d3c61954fc powerpc/fsl/dts: add missing P4080DS I2C devices
This just adds the zl2006 voltage regulators / power monitors and the
onboard I2C eeproms.  The ICS9FG108 clock chip doesn't seem to have a
driver, so it is left in the DTS as a comment.  And for good measure,
the SPD eeproms are tagged as such.

Signed-off-by: David Lamparter <equinox@diac24.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20180920230422.GK487685@eidolon.nox.tf
2020-07-29 23:47:52 +10:00
Alastair D'Silva
c75d42e4c7 ocxl: Remove unnecessary externs
Function declarations don't need externs, remove the existing ones
so they are consistent with newer code

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200415012343.919255-2-alastair@d-silva.org
2020-07-29 23:47:52 +10:00
Balamuruhan S
b859c95cf4 powerpc/test_emulate_step: Add testcases for divde[.] and divdeu[.] instructions
Add testcases for divde, divde., divdeu, divdeu. emulated instructions
to cover few scenarios,
  - with same dividend and divisor to have undefine RT
    for divdeu[.]
  - with divide by zero to have undefine RT for both
    divde[.] and divdeu[.]
  - with negative dividend to cover -|divisor| < r <= 0 if
    the dividend is negative for divde[.]
  - normal case with proper dividend and divisor for both
    divde[.] and divdeu[.]

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200728130308.1790982-4-bala24@linux.ibm.com
2020-07-29 23:47:52 +10:00
Balamuruhan S
151c32bf5e powerpc/sstep: Add support for divde[.] and divdeu[.] instructions
This patch adds emulation support for divde, divdeu instructions,
  - Divide Doubleword Extended (divde[.])
  - Divide Doubleword Extended Unsigned (divdeu[.])

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200728130308.1790982-3-bala24@linux.ibm.com
2020-07-29 23:47:52 +10:00
Balamuruhan S
8902c6f963 powerpc/ppc-opcode: Add divde and divdeu opcodes
Include instruction opcodes for divde and divdeu as macros.

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200728130308.1790982-2-bala24@linux.ibm.com
2020-07-29 23:47:52 +10:00
Joerg Roedel
56fbacc9bf Merge branches 'arm/renesas', 'arm/qcom', 'arm/mediatek', 'arm/omap', 'arm/exynos', 'arm/smmu', 'ppc/pamu', 'x86/vt-d', 'x86/amd' and 'core' into next 2020-07-29 14:42:00 +02:00
Wei Yongjun
cf1ae052e0 powerpc/powernv/sriov: Remove unused but set variable 'phb'
Gcc report warning as follows:

arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning:
 variable 'phb' set but not used [-Wunused-but-set-variable]
  602 |  struct pnv_phb        *phb;
      |                         ^~~

This variable is not used, so this commit removing it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727171112.2781-1-weiyongjun1@huawei.com
2020-07-29 22:30:34 +10:00
Qinglang Miao
b6ac59d39a powerpc: use for_each_child_of_node() macro
Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200728022807.87815-1-miaoqinglang@huawei.com
2020-07-29 22:30:33 +10:00
Nicholas Piggin
f2af201002 powerpc/build: vdso linker warning for orphan sections
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200303012748.4190929-1-npiggin@gmail.com
2020-07-29 21:09:37 +10:00
Gustavo A. R. Silva
5e66a0cb5f powerpc: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727224201.GA10133@embeddedor
2020-07-29 21:09:37 +10:00
Aneesh Kumar K.V
bf6b7661f4 powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE
This adds a kernel command line option that can be used to disable GTSE support.
Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.

This was done so that we can do VM migration between configs that enable/disable
GTSE support via hypervisor. To migrate a VM from a system that supports
GTSE to a system that doesn't, we can boot the guest with
radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
invalidates.

The check for hcall availability is done in pSeries_setup_arch so that
the panic message appears on the console. This should only happen on
a hypervisor that doesn't force the guest to hash translation even
though it can't handle the radix GTSE=0 request via CAS. With
radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate
hcall it should force the LPAR to hash translation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727085908.420806-1-aneesh.kumar@linux.ibm.com
2020-07-29 21:09:37 +10:00
Aneesh Kumar K.V
a5a8b258da powerpc/kvm/cma: Improve kernel log during boot
Current kernel gives:

[    0.000000] cma: Reserved 26224 MiB at 0x0000007959000000
[    0.000000] hugetlb_cma: reserve 65536 MiB, up to 16384 MiB per node
[    0.000000] cma: Reserved 16384 MiB at 0x0000001800000000

With the fix

[    0.000000] kvm_cma_reserve: reserving 26214 MiB for global area
[    0.000000] cma: Reserved 26224 MiB at 0x0000007959000000
[    0.000000] hugetlb_cma: reserve 65536 MiB, up to 16384 MiB per node
[    0.000000] cma: Reserved 16384 MiB at 0x0000001800000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200713150749.25245-2-aneesh.kumar@linux.ibm.com
2020-07-29 21:09:37 +10:00
Aneesh Kumar K.V
ef26b76d1a powerpc/hugetlb/cma: Allocate gigantic hugetlb pages using CMA
commit: cf11e85fc0 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
added support for allocating gigantic hugepages using CMA. This patch
enables the same for powerpc

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200713150749.25245-1-aneesh.kumar@linux.ibm.com
2020-07-29 21:09:37 +10:00
Balamuruhan S
81a413259a powerpc/xmon: Use dcbf inplace of dcbi instruction for 64bit Book3S
Data Cache Block Invalidate (dcbi) instruction implemented back in
PowerPC architecture version 2.03. But as per Power Processor Users Manual
it is obsolete and not supported by POWER8/POWER9 core. Attempt to use of
this illegal instruction results in a hypervisor emulation assistance
interrupt. So, ifdef it out the option `i` in xmon for 64bit Book3S.

  0:mon> fi
  cpu 0x0: Vector: 700 (Program Check) at [c000000003be74a0]
      pc: c000000000102030: cacheflush+0x180/0x1a0
      lr: c000000000101f3c: cacheflush+0x8c/0x1a0
      sp: c000000003be7730
     msr: 8000000000081033
    current = 0xc0000000035e5c00
    paca    = 0xc000000001910000   irqmask: 0x03   irq_happened: 0x01
      pid   = 1025, comm = bash
  Linux version 5.6.0-rc5-g5aa19adac (root@ltc-wspoon6) (gcc version 7.4.0
  (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #1 SMP Tue Mar 10 04:38:41 CDT 2020
  cpu 0x0: Exception 700 (Program Check) in xmon, returning to main loop
  [c000000003be7c50] c00000000084abb0 __handle_sysrq+0xf0/0x2a0
  [c000000003be7d00] c00000000084b3c0 write_sysrq_trigger+0xb0/0xe0
  [c000000003be7d30] c0000000004d1edc proc_reg_write+0x8c/0x130
  [c000000003be7d60] c00000000040dc7c __vfs_write+0x3c/0x70
  [c000000003be7d80] c000000000410e70 vfs_write+0xd0/0x210
  [c000000003be7dd0] c00000000041126c ksys_write+0xdc/0x130
  [c000000003be7e20] c00000000000b9d0 system_call+0x5c/0x68
  --- Exception: c01 (System Call) at 00007fffa345e420
  SP (7ffff0b08ab0) is in userspace

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200330075954.538773-1-bala24@linux.ibm.com
2020-07-29 21:08:30 +10:00
Michael Ellerman
ee36d867b2 powerpc: Drop old comment about CONFIG_POWER
There's a comment in time.h referring to CONFIG_POWER, which doesn't
exist. That confuses scripts/checkkconfigsymbols.py.

Presumably the comment was referring to a CONFIG_POWER vs CONFIG_PPC,
in which case for CONFIG_POWER we would #define __USE_RTC to 1. But
instead we have CONFIG_PPC_BOOK3S_601, and these days we have
IS_ENABLED().

So the comment is no longer relevant, drop it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-9-mpe@ellerman.id.au
2020-07-29 21:08:27 +10:00
Michael Ellerman
157dad8678 powerpc/kvm: Use correct CONFIG symbol in comment
This comment refers to the non-existent CONFIG_PPC_BOOK3S_XX, which
confuses scripts/checkkconfigsymbols.py.

Change it to use the correct symbol.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-8-mpe@ellerman.id.au
2020-07-29 21:08:25 +10:00
Michael Ellerman
e5eff89657 powerpc/boot: Fix CONFIG_PPC_MPC52XX references
Commit 866bfc75f4 ("powerpc: conditionally compile platform-specific
serial drivers") made some code depend on CONFIG_PPC_MPC52XX, which
doesn't exist.

Fix it to use CONFIG_PPC_MPC52xx.

Fixes: 866bfc75f4 ("powerpc: conditionally compile platform-specific serial drivers")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-7-mpe@ellerman.id.au
2020-07-29 21:08:22 +10:00
Michael Ellerman
69eeff0224 powerpc/32s: Remove TAUException wart in traps.c
All 32 and 64-bit builds that don't have CONFIG_TAU_INT enabled (all
of them), get a definition of TAUException() in traps.c.

On 64-bit it's completely useless, and just wastes ~120 bytes of text.
On 32-bit it allows the kernel to link because head_32.S calls it
unconditionally.

Instead follow the example of altivec_assist_exception(), and if
CONFIG_TAU_INT is not enabled just point it at unknown_exception using
the preprocessor.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-6-mpe@ellerman.id.au
2020-07-29 21:08:18 +10:00
Michael Ellerman
df4d4ef224 powerpc/32s: Fix CONFIG_BOOK3S_601 uses
We have two uses of CONFIG_BOOK3S_601, which doesn't exist. Fix them
to use CONFIG_PPC_BOOK3S_601 which is the correct symbol.

Fixes: 12c3f1fd87 ("powerpc/32s: get rid of CPU_FTR_601 feature")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-5-mpe@ellerman.id.au
2020-07-29 21:08:15 +10:00
Michael Ellerman
07e571ea59 powerpc/64e: Drop dead BOOK3E_MMU_TLB_STATS code
This code was merged 11 years ago in commit 13363ab9b9 ("powerpc:
Add definitions used by exception handling on 64-bit Book3E") but was
never able to be built because CONFIG_BOOK3E_MMU_TLB_STATS never
existed. Remove it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-4-mpe@ellerman.id.au
2020-07-29 21:08:12 +10:00
Michael Ellerman
8cdcde5f76 powerpc/52xx: Fix comment about CONFIG_BDI*
There's a comment in lite5200_sleep.S that refers to "CONFIG_BDI*".

This confuses scripts/checkkconfigsymbols.py, which thinks it should
be able to find CONFIG_BDI.

Change the comment to refer to CONFIG_BDI_SWITCH which is presumably
roughly what it was referring to. AFAICS there never has been a
CONFIG_BDI.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-3-mpe@ellerman.id.au
2020-07-29 21:08:09 +10:00
Michael Ellerman
0fcce25b77 powerpc/configs: Remove dead symbols
Remove references to symbols that no longer exist as reported by
scripts/checkkconfigsymbols.py.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-2-mpe@ellerman.id.au
2020-07-29 21:08:06 +10:00
Michael Ellerman
fbb44c9a08 powerpc/configs: Drop old symbols from ppc6xx_defconfig
ppc6xx_defconfig refers to quite a few symbols that no longer exist,
as reported by scripts/checkkconfigsymbols.py, remove them.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-1-mpe@ellerman.id.au
2020-07-29 21:08:01 +10:00
Bharata B Rao
55548a86eb powerpc/mm: Limit resize_hpt_for_hotplug() call to hash guests only
During memory hotplug and unplug, resize_hpt_for_hotplug() gets called
for both hash and radix guests but it should be called only for hash
guests. Though the call does nothing in the radix guest case, it is
cleaner to push this call into hash specific memory hotplug routines.

Reported-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727095704.1432916-1-bharata@linux.ibm.com
2020-07-29 21:02:12 +10:00
Michael Ellerman
773b3e53df powerpc/mm: Remove custom stack expansion checking
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The logic aims to prevent userspace from doing bad accesses below the
stack pointer. However as long as the stack is < 1MB in size, we allow
all accesses without further checks. Adding some debug I see that I
can do a full kernel build and LTP run, and not a single process has
used more than 1MB of stack. So for the majority of processes the
logic never even fires.

We also recently found a nasty bug in this code which could cause
userspace programs to be killed during signal delivery. It went
unnoticed presumably because most processes use < 1MB of stack.

The generic mm code has also grown support for stack guard pages since
this code was originally written, so the most heinous case of the
stack expanding into other mappings is now handled for us.

Finally although some other arches have special logic in this path,
from what I can tell none of x86, arm64, arm and s390 impose any extra
checks other than those in expand_stack().

So drop our complicated logic and like other architectures just let
the stack expand as long as its within the rlimit.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Daniel Axtens <dja@axtens.net>
Link: https://lore.kernel.org/r/20200724092528.1578671-4-mpe@ellerman.id.au
2020-07-29 21:02:12 +10:00
Michael Ellerman
63dee5df43 powerpc: Allow 4224 bytes of stack expansion for the signal frame
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.

The 2048 byte allowance below the stack pointer is there to cover the
288 byte "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.

Unfortunately since then the signal frame has expanded, and is now
4224 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4224 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.

The total size of the signal frame is the size of struct rt_sigframe
(which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
64-bit).

The 2048 byte allowance was correct until 2008 as the signal frame
was:

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1440 */
        /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
        long unsigned int          _unused[2];           /*  1440    16 */
        unsigned int               tramp[6];             /*  1456    24 */
        struct siginfo *           pinfo;                /*  1480     8 */
        void *                     puc;                  /*  1488     8 */
        struct siginfo     info;                         /*  1496   128 */
        /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
        char                       abigap[288];          /*  1624   288 */

        /* size: 1920, cachelines: 15, members: 7 */
        /* padding: 8 */
};

1920 + 128 = 2048

Then in commit ce48b21007 ("powerpc: Add VSX context save/restore,
ptrace and signal support") (Jul 2008) the signal frame expanded to
2304 bytes:

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */	<--
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        long unsigned int          _unused[2];           /*  1696    16 */
        unsigned int               tramp[6];             /*  1712    24 */
        struct siginfo *           pinfo;                /*  1736     8 */
        void *                     puc;                  /*  1744     8 */
        struct siginfo     info;                         /*  1752   128 */
        /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
        char                       abigap[288];          /*  1880   288 */

        /* size: 2176, cachelines: 17, members: 7 */
        /* padding: 8 */
};

2176 + 128 = 2304

At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.

Then in 2010 commit 320b2b8de1 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.

That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        struct ucontext    uc_transact;                  /*  1696  1696 */	<--
        /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
        long unsigned int          _unused[2];           /*  3392    16 */
        unsigned int               tramp[6];             /*  3408    24 */
        struct siginfo *           pinfo;                /*  3432     8 */
        void *                     puc;                  /*  3440     8 */
        struct siginfo     info;                         /*  3448   128 */
        /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
        char                       abigap[288];          /*  3576   288 */

        /* size: 3872, cachelines: 31, members: 8 */
        /* padding: 8 */
        /* last cacheline: 32 bytes */
};

3872 + 128 = 4000

And commit 573ebfa660 ("powerpc: Increase stack redzone for 64-bit
userspace to 512 bytes") (Feb 2014):

struct rt_sigframe {
        struct ucontext    uc;                           /*     0  1696 */
        /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
        struct ucontext    uc_transact;                  /*  1696  1696 */
        /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
        long unsigned int          _unused[2];           /*  3392    16 */
        unsigned int               tramp[6];             /*  3408    24 */
        struct siginfo *           pinfo;                /*  3432     8 */
        void *                     puc;                  /*  3440     8 */
        struct siginfo     info;                         /*  3448   128 */
        /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
        char                       abigap[512];          /*  3576   512 */	<--

        /* size: 4096, cachelines: 32, members: 8 */
        /* padding: 8 */
};

4096 + 128 = 4224

Then finally in 2017, commit 1be7107fbe ("mm: larger stack guard
gap, between vmas") exposed us to the existing bug, because it changed
the stack VMA to be the correct/real size, meaning our stack expansion
code is now triggered.

Fix it by increasing the allowance to 4224 bytes.

Hard-coding 4224 is obviously unsafe against future expansions of the
signal frame in the same way as the existing code. We can't easily use
sizeof() because the signal frame structure is not in a header. We
will either fix that, or rip out all the custom stack expansion
checking logic entirely.

Fixes: ce48b21007 ("powerpc: Add VSX context save/restore, ptrace and signal support")
Cc: stable@vger.kernel.org # v2.6.27+
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au
2020-07-29 21:02:12 +10:00
Nicholas Piggin
107c55005f powerpc/pseries: Add KVM guest doorbell restrictions
KVM guests have certain restrictions and performance quirks when using
doorbells. This patch moves the EPAPR KVM guest test so it can be shared
with PSERIES, and uses that in doorbell setup code to apply the KVM
guest quirks and  improves IPI performance for two cases:

 - PowerVM guests may now use doorbells even if they are secure.

 - KVM guests no longer use doorbells if XIVE is available.

There is a valid complaint that "KVM guest" is not a very reasonable
thing to test for, it's preferable for the hypervisor to advertise
particular behaviours to the guest so they could change if the
hypervisor implementation or configuration changes. However in this case
we were already assuming a KVM guest worst case, so this patch is about
containing those quirks. If KVM later advertises fast doorbells, we
should test for that and override the quirks.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-4-npiggin@gmail.com
2020-07-29 21:02:10 +10:00
Nicholas Piggin
5b06d1679f powerpc/pseries: Use doorbells even if XIVE is available
KVM supports msgsndp in guests by trapping and emulating the
instruction, so it was decided to always use XIVE for IPIs if it is
available. However on PowerVM systems, msgsndp can be used and gives
better performance. On large systems, high XIVE interrupt rates can
have sub-linear scaling, and using msgsndp can reduce the load on
the interrupt controller.

So switch to using core local doorbells even if XIVE is available.
This reduces performance for KVM guests with an SMT topology by
about 50% for ping-pong context switching between SMT vCPUs. An
option vector (or dt-cpu-ftrs) could be defined to disable msgsndp
to get KVM performance back.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-3-npiggin@gmail.com
2020-07-29 21:02:09 +10:00
Nicholas Piggin
1f0ce49743 powerpc: Inline doorbell sending functions
These are only called in one place for a given platform, so inline
them for performance.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
[mpe: Fix build errors related to KVM]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-2-npiggin@gmail.com
2020-07-29 21:02:09 +10:00
Athira Rajeev
443359aebc powerpc/perf: Fix MMCRA_BHRB_DISABLE define for binutils < 2.28
Commit 9908c826d5 ("powerpc/perf: Add Power10 PMU feature to DT CPU
features") defines MMCRA_BHRB_DISABLE as `0x2000000000UL`. Binutils
version less than 2.28 doesn't support UL suffix.

  arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
  arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of range (0x0000002000000000 is not between 0xffffffffffff8000 and 0x000000000000ffff)

Fix this by wrapping it with the `_UL` macro.

Fixes: 9908c826d5 ("Add Power10 PMU feature to DT CPU features")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1595996214-5833-1-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-29 21:02:09 +10:00
Ralph Campbell
5143192cd4 mm/migrate: add a flags parameter to migrate_vma
The src_owner field in struct migrate_vma is being used for two purposes,
it acts as a selection filter for which types of pages are to be migrated
and it identifies device private pages owned by the caller.

Split this into separate parameters so the src_owner field can be used
just to identify device private pages owned by the caller of
migrate_vma_setup().

Rename the src_owner field to pgmap_owner to reflect it is now used only
to identify which device private pages to migrate.

Link: https://lore.kernel.org/r/20200723223004.9586-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-07-28 16:20:33 -03:00
Laurent Dufour
81ab595ddd KVM: PPC: Book3S HV: Rework secure mem slot dropping
When a secure memslot is dropped, all the pages backed in the secure
device (aka really backed by secure memory by the Ultravisor)
should be paged out to a normal page. Previously, this was
achieved by triggering the page fault mechanism which is calling
kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory
slot is flagged as invalid and gfn_to_pfn() is then not trying to access
the page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to call directly instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.  As
__kvmppc_svm_page_out needs the vma pointer to migrate the pages,
the VMA is fetched in a lazy way, to not trigger find_vma() all
the time. In addition, the mmap_sem is held in read mode during
that time, not in write mode since the virual memory layout is not
impacted, and kvm->arch.uvmem_lock prevents concurrent operation
on the secure device.

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
        [modified check on the VMA in kvmppc_uvmem_drop_pages]
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
	[modified the changelog description]
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Laurent Dufour
f1b87ea878 KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up
kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it up earlier in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Laurent Dufour
a2ce720038 KVM: PPC: Book3S HV: Migrate hot plugged memory
When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.

Call kvmppc_uv_migrate_mem_slot() to accomplish this.
Disable page-merge for all pages in the memory slot.

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[rearranged the code, and modified the commit log]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Ram Pai
dfaa973ae9 KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs
The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the
pages of the SVM before calling H_SVM_INIT_DONE. This causes a huge
delay in tranistioning the VM to SVM. The Ultravisor is only interested
in the pages that contain the kernel, initrd and other important data
structures. The rest contain throw-away content.

However if not all pages are requested by the Ultravisor, the Hypervisor
continues to consider the GFNs corresponding to the non-requested pages
as normal GFNs. This can lead to data-corruption and undefined behavior.

In H_SVM_INIT_DONE handler, move all the PFNs associated with the SVM's
GFNs to secure-PFNs. Skip the GFNs that are already Paged-in or Shared
or Paged-in followed by a Paged-out.

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Ram Pai
651a631011 KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs
During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.

The ability to identify the state of a GFN is needed to skip migration
of its PFN to secure-PFN during ESM transition.

The code is re-organized to track the states of a GFN as explained
below.

************************************************************************
 1. States of a GFN
    ---------------
 The GFN can be in one of the following states.

 (a) Secure - The GFN is secure. The GFN is associated with
 	a Secure VM, the contents of the GFN is not accessible
 	to the Hypervisor.  This GFN can be backed by a secure-PFN,
 	or can be backed by a normal-PFN with contents encrypted.
 	The former is true when the GFN is paged-in into the
 	ultravisor. The latter is true when the GFN is paged-out
 	of the ultravisor.

 (b) Shared - The GFN is shared. The GFN is associated with a
 	a secure VM. The contents of the GFN is accessible to
 	Hypervisor. This GFN is backed by a normal-PFN and its
 	content is un-encrypted.

 (c) Normal - The GFN is a normal. The GFN is associated with
 	a normal VM. The contents of the GFN is accesible to
 	the Hypervisor. Its content is never encrypted.

 2. States of a VM.
    ---------------

 (a) Normal VM:  A VM whose contents are always accessible to
 	the hypervisor.  All its GFNs are normal-GFNs.

 (b) Secure VM: A VM whose contents are not accessible to the
 	hypervisor without the VM's consent.  Its GFNs are
 	either Shared-GFN or Secure-GFNs.

 (c) Transient VM: A Normal VM that is transitioning to secure VM.
 	The transition starts on successful return of
 	H_SVM_INIT_START, and ends on successful return
 	of H_SVM_INIT_DONE. This transient VM, can have GFNs
 	in any of the three states; i.e Secure-GFN, Shared-GFN,
 	and Normal-GFN.	The VM never executes in this state
 	in supervisor-mode.

 3. Memory slot State.
    ------------------
  	The state of a memory slot mirrors the state of the
  	VM the memory slot is associated with.

 4. VM State transition.
    --------------------

  A VM always starts in Normal Mode.

  H_SVM_INIT_START moves the VM into transient state. During this
  time the Ultravisor may request some of its GFNs to be shared or
  secured. So its GFNs can be in one of the three GFN states.

  H_SVM_INIT_DONE moves the VM entirely from transient state to
  secure-state. At this point any left-over normal-GFNs are
  transitioned to Secure-GFN.

  H_SVM_INIT_ABORT moves the transient VM back to normal VM.
  All its GFNs are moved to Normal-GFNs.

  UV_TERMINATE transitions the secure-VM back to normal-VM. All
  the secure-GFN and shared-GFNs are tranistioned to normal-GFN
  Note: The contents of the normal-GFN is undefined at this point.

 5. GFN state implementation:
    -------------------------

 Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
 when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
 set, and contains the value of the secure-PFN.
 It is associated with a normal-PFN; also called mem_pfn, when
 the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
 The value of the normal-PFN is not tracked.

 Shared GFN is associated with a normal-PFN. Its pfn[] has
 KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
 is not tracked.

 Normal GFN is associated with normal-PFN. Its pfn[] has
 no flag set. The value of the normal-PFN is not tracked.

 6. Life cycle of a GFN
    --------------------
 --------------------------------------------------------------
 |        |     Share  |  Unshare | SVM       |H_SVM_INIT_DONE|
 |        |operation   |operation | abort/    |               |
 |        |            |          | terminate |               |
 -------------------------------------------------------------
 |        |            |          |           |               |
 | Secure |     Shared | Secure   |Normal     |Secure         |
 |        |            |          |           |               |
 | Shared |     Shared | Secure   |Normal     |Shared         |
 |        |            |          |           |               |
 | Normal |     Shared | Secure   |Normal     |Secure         |
 --------------------------------------------------------------

 7. Life cycle of a VM
    --------------------
 --------------------------------------------------------------------
 |         |  start    |  H_SVM_  |H_SVM_   |H_SVM_     |UV_SVM_    |
 |         |  VM       |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE  |
 |         |           |          |         |           |           |
 --------- ----------------------------------------------------------
 |         |           |          |         |           |           |
 | Normal  | Normal    | Transient|Error    |Error      |Normal     |
 |         |           |          |         |           |           |
 | Secure  |   Error   | Error    |Error    |Error      |Normal     |
 |         |           |          |         |           |           |
 |Transient|   N/A     | Error    |Secure   |Normal     |Normal     |
 --------------------------------------------------------------------

************************************************************************

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Ram Pai
2027a24a75 KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
Page-merging of pages in memory-slots associated with a Secure VM
is disabled in H_SVM_PAGE_IN handler.

This operation should have been done the much earlier; the moment the VM
is initiated for secure-transition. Delaying this operation increases
the probability for those pages to acquire new references, making it
impossible to migrate those pages in H_SVM_PAGE_IN handler.

Disable page-migration in H_SVM_INIT_START handling.

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Ram Pai
48908a3833 KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c
Without this fix, git is confused. It generates wrong
function context for code changes in subsequent patches.
Weird, but true.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-28 12:34:52 +10:00
Al Viro
47e12855a9 powerpc: switch to ->regset_get()
Note: compat variant of REGSET_TM_CGPR is almost certainly wrong;
it claims to be 48*64bit, but just as compat REGSET_GPR it stores
44*32bit of (truncated) registers + 4 32bit zeros... followed by
48 more 32bit zeroes.  Might be too late to change - it's a userland
ABI, after all ;-/

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:31:07 -04:00
Al Viro
7a896028ad kill elf_fpxregs_t
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:29:23 -04:00
Rafael J. Wysocki
80e3036866 Merge back cpufreq material for v5.9. 2020-07-27 12:34:55 +02:00
Michael Ellerman
5f987caec5 powerpc/fadump: Fix build error with CONFIG_PRESERVE_FA_DUMP=y
skiroot_defconfig fails:

arch/powerpc/kernel/fadump.c:48:17: error: ‘cpus_in_fadump’ defined but not used
   48 | static atomic_t cpus_in_fadump;

Fix it by moving the definition into the #ifdef where it's used.

Fixes: ba608c4fa1 ("powerpc/fadump: fix race between pstore write and fadump crash trigger")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727070341.595634-1-mpe@ellerman.id.au
2020-07-27 17:04:54 +10:00
Nicholas Piggin
909adfc66b powerpc/64s/hash: Fix hash_preload running with interrupts enabled
Commit 2f92447f9f ("powerpc/book3s64/hash: Use the pte_t address from the
caller") removed the local_irq_disable from hash_preload, but it was
required for more than just the page table walk: the hash pte busy bit is
effectively a lock which may be taken in interrupt context, and the local
update flag test must not be preempted before it's used.

This solves apparent lockups with perf interrupting __hash_page_64K. If
get_perf_callchain then also takes a hash fault on the same page while it
is already locked, it will loop forever taking hash faults, which looks like
this:

  cpu 0x49e: Vector: 100 (System Reset) at [c00000001a4f7d70]
      pc: c000000000072dc8: hash_page_mm+0x8/0x800
      lr: c00000000000c5a4: do_hash_page+0x24/0x38
      sp: c0002ac1cc69ac70
     msr: 8000000000081033
    current = 0xc0002ac1cc602e00
    paca    = 0xc00000001de1f280   irqmask: 0x03   irq_happened: 0x01
      pid   = 20118, comm = pread2_processe
  Linux version 5.8.0-rc6-00345-g1fad14f18bc6
  49e:mon> t
  [c0002ac1cc69ac70] c00000000000c5a4 do_hash_page+0x24/0x38 (unreliable)
  --- Exception: 300 (Data Access) at c00000000008fa60 __copy_tofrom_user_power7+0x20c/0x7ac
  [link register   ] c000000000335d10 copy_from_user_nofault+0xf0/0x150
  [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
  [c0002ac1cc69afa0] c000000000109df0 read_user_stack_64+0x70/0xf0
  [c0002ac1cc69afd0] c000000000109fcc perf_callchain_user_64+0x15c/0x410
  [c0002ac1cc69b060] c000000000109c00 perf_callchain_user+0x20/0x40
  [c0002ac1cc69b080] c00000000031c6cc get_perf_callchain+0x25c/0x360
  [c0002ac1cc69b120] c000000000316b50 perf_callchain+0x70/0xa0
  [c0002ac1cc69b140] c000000000316ddc perf_prepare_sample+0x25c/0x790
  [c0002ac1cc69b1a0] c000000000317350 perf_event_output_forward+0x40/0xb0
  [c0002ac1cc69b220] c000000000306138 __perf_event_overflow+0x88/0x1a0
  [c0002ac1cc69b270] c00000000010cf70 record_and_restart+0x230/0x750
  [c0002ac1cc69b620] c00000000010d69c perf_event_interrupt+0x20c/0x510
  [c0002ac1cc69b730] c000000000027d9c performance_monitor_exception+0x4c/0x60
  [c0002ac1cc69b750] c00000000000b2f8 performance_monitor_common_virt+0x1b8/0x1c0
  --- Exception: f00 (Performance Monitor) at c0000000000cb5b0 pSeries_lpar_hpte_insert+0x0/0x160
  [link register   ] c0000000000846f0 __hash_page_64K+0x210/0x540
  [c0002ac1cc69ba50] 0000000000000000 (unreliable)
  [c0002ac1cc69bb00] c000000000073ae0 update_mmu_cache+0x390/0x3a0
  [c0002ac1cc69bb70] c00000000037f024 wp_page_copy+0x364/0xce0
  [c0002ac1cc69bc20] c00000000038272c do_wp_page+0xdc/0xa60
  [c0002ac1cc69bc70] c0000000003857bc handle_mm_fault+0xb9c/0x1b60
  [c0002ac1cc69bd50] c00000000006c434 __do_page_fault+0x314/0xc90
  [c0002ac1cc69be20] c00000000000c5c8 handle_page_fault+0x10/0x2c
  --- Exception: 300 (Data Access) at 00007fff8c861fe8
  SP (7ffff6b19660) is in userspace

Fixes: 2f92447f9f ("powerpc/book3s64/hash: Use the pte_t address from the caller")
Reported-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727060947.10060-1-npiggin@gmail.com
2020-07-27 17:02:09 +10:00
Randy Dunlap
86052e407e powerpc/powernv/pci.h: delete duplicated word
Drop the repeated word "for".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-10-rdunlap@infradead.org
2020-07-27 00:01:32 +10:00
Randy Dunlap
3b56ed4b46 powerpc/smu.h: delete duplicated word
Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-9-rdunlap@infradead.org
2020-07-27 00:01:32 +10:00
Randy Dunlap
850659392a powerpc/reg.h: delete duplicated word
Drop the repeated word "a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-8-rdunlap@infradead.org
2020-07-27 00:01:32 +10:00
Randy Dunlap
db10f55000 powerpc/ppc_asm.h: delete duplicated word
Drop the repeated word "in".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-7-rdunlap@infradead.org
2020-07-27 00:01:32 +10:00
Randy Dunlap
028cc22d29 powerpc/hw_breakpoint.h: delete duplicated word
Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-6-rdunlap@infradead.org
2020-07-27 00:01:31 +10:00
Randy Dunlap
8965aa4b68 powerpc/epapr_hcalls.h: delete duplicated words
Drop the repeated words "file" and "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-5-rdunlap@infradead.org
2020-07-27 00:01:31 +10:00
Randy Dunlap
dc9bf323d6 powerpc/cputime.h: delete duplicated word
Drop the repeated word "use".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-4-rdunlap@infradead.org
2020-07-27 00:01:31 +10:00
Randy Dunlap
92be1fca08 powerpc/book3s/radix-4k.h: delete duplicated word
Drop the repeated word "per".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-3-rdunlap@infradead.org
2020-07-27 00:01:31 +10:00
Randy Dunlap
10a4a016d6 powerpc/book3s/mmu-hash.h: delete duplicated word
Drop the repeated word "below".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726003809.20454-2-rdunlap@infradead.org
2020-07-27 00:01:31 +10:00
Li RongQing
e280261897 powerpc/lib: remove memcpy_flushcache redundant return
Align it with other architectures and none of the callers has
been interested its return

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1556278590-14727-1-git-send-email-lirongqing@baidu.com
2020-07-27 00:01:31 +10:00
Christophe Leroy
e54e30bca4 powerpc/ptdump: Refactor update of pg_state
In note_page(), the pg_state is updated the same way in two places.

Add note_page_update_state() to do it.

Also include the display of boundary markers there as it is missing
"no level" leg, leading to a mismatch when the first two markers
are at the same address and the first displayed area uses that
address.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a284a809f01c705bbaab303b06fda216f147a99a.1593429426.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:31 +10:00
Christophe Leroy
846feeace5 powerpc/ptdump: Refactor update of st->last_pa
st->last_pa is always updated in note_page() so it can
be done outside the if/elseif/else block.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/610d6b1a60ad0bedef865a90153c1110cfaa507e.1593429426.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
6ca055322d powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX
When STRICT_KERNEL_RWX is set, we want to set NX bit on vmalloc
segments. But modules require exec.

Use a dedicated segment for modules. There is not much space
above kernel, and we don't waste vmalloc space to do alignment.
Therefore, we take the segment before PAGE_OFFSET for modules.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eb8faba9148b6cf17c696ba776b4e8ee2f6313bf.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
f1a1f7a15e powerpc/32s: Kernel space starts at TASK_SIZE
Kernel space starts at TASK_SIZE. Select kernel page table
when address is over TASK_SIZE.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/893425e32cd0a003539573b2d115e0ffa98bc26c.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
b6be1bb7f7 powerpc/32: Set user/kernel boundary at TASK_SIZE instead of PAGE_OFFSET
User space stops at TASK_SIZE. At the moment, kernel space starts
at PAGE_OFFSET.

In order to use space between TASK_SIZE and PAGE_OFFSET for modules,
make TASK_SIZE the limit between user and kernel space.

Note that fault.c already considers TASK_SIZE as the boundary between
user and kernel space.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b38b52cd8dabbb56fbd6f9219d6f3cdccbb43b44.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
c496433197 powerpc/32s: Only leave NX unset on segments used for modules
Instead of leaving NX unset on all segments above the start
of vmalloc space, only leave NX unset on segments used for
modules.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7172c0f5253419315e434a1816ee3d6ed6505bc0.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
7fbc22ce29 powerpc: Use MODULES_VADDR if defined
In order to allow allocation of modules outside of vmalloc space,
use MODULES_VADDR and MODULES_END when MODULES_VADDR is defined.

Redefine module_alloc() when MODULES_VADDR defined.
Unmap corresponding KASAN shadow memory.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7ecf5fff1eef67d450e73fc412b6ec3818483d75.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Christophe Leroy
ccc8fcf72a powerpc/lib: Prepare code-patching for modules allocated outside vmalloc space
Use is_vmalloc_or_module_addr() instead of is_vmalloc_addr()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7d884db0e5a6f521331639d8c0f13e520d5a4fef.1593428200.git.christophe.leroy@csgroup.eu
2020-07-27 00:01:30 +10:00
Wei Yongjun
19a551b254 powerpc/papr_scm: Make some symbols static
The sparse tool complains as follows:

arch/powerpc/platforms/pseries/papr_scm.c:97:1: warning:
 symbol 'papr_nd_regions' was not declared. Should it be static?
arch/powerpc/platforms/pseries/papr_scm.c:98:1: warning:
 symbol 'papr_ndr_lock' was not declared. Should it be static?

Those variables are not used outside of papr_scm.c, so this
commit marks them static.

Fixes: 85343a8da2 ("powerpc/papr/scm: Add bad memory ranges to nvdimm bad ranges")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725091949.75234-1-weiyongjun1@huawei.com
2020-07-27 00:01:30 +10:00
Bill Wendling
faedc38012 powerpc/64s: allow for clang's objdump differences
Clang's objdump emits slightly different output from GNU's objdump,
causing a list of warnings to be emitted during relocatable builds.
E.g., clang's objdump emits this:

   c000000000000004: 2c 00 00 48  b  0xc000000000000030
   ...
   c000000000005c6c: 10 00 82 40  bf 2, 0xc000000000005c7c

while GNU objdump emits:

   c000000000000004: 2c 00 00 48  b    c000000000000030 <__start+0x30>
   ...
   c000000000005c6c: 10 00 82 40  bne  c000000000005c7c <masked_interrupt+0x3c>

Adjust llvm-objdump's output to remove the extraneous '0x' and convert
'bf' and 'bt' to 'bne' and 'beq' resp. to more closely match GNU
objdump's output.

Note that clang's objdump doesn't yet output the relocation symbols on
PPC.

Signed-off-by: Bill Wendling <morbo@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/191c67db31264b69cf6b566fd69851beb3dd0abb.1595630874.git.morbo@google.com
2020-07-27 00:01:29 +10:00
Nicholas Piggin
49a7d46a06 powerpc: Implement smp_cond_load_relaxed()
This implements smp_cond_load_relaxed() with the slowpath busy loop
using the preferred SMT priority pattern.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
[mpe: Make it 64-bit only to fix build errors on 32-bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-7-npiggin@gmail.com
2020-07-27 00:01:29 +10:00
Nicholas Piggin
2f6560e652 powerpc/qspinlock: Optimised atomic_try_cmpxchg_lock() that adds the lock hint
This brings the behaviour of the uncontended fast path back to roughly
equivalent to simple spinlocks -- a single atomic op with lock hint.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-6-npiggin@gmail.com
2020-07-27 00:01:29 +10:00
Nicholas Piggin
20c0e8269e powerpc/pseries: Implement paravirt qspinlocks for SPLPAR
This implements the generic paravirt qspinlocks using H_PROD and
H_CONFER to kick and wait.

This uses an un-directed yield to any CPU rather than the directed
yield to a pre-empted lock holder that paravirtualised simple
spinlocks use, that requires no kick hcall. This is something that
could be investigated and improved in future.

Performance results can be found in the commit which added queued
spinlocks.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-5-npiggin@gmail.com
2020-07-27 00:01:29 +10:00
Nicholas Piggin
aa65ff6b18 powerpc/64s: Implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.

With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.

Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).

On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:

https://github.com/linuxppc/issues/issues/305#issuecomment-663487453

Observations are:

- Queued spinlocks are equal when contention is insignificant, as
  expected and as measured with microbenchmarks.

- When there is contention, on bare metal queued spinlocks have better
  throughput and max latency at all points.

- When virtualised, queued spinlocks are slightly worse approaching
  peak throughput, but significantly better throughput and max latency
  at all points beyond peak, until queued spinlock maximum latency
  rises when clients are 2x vCPUs.

The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
2020-07-27 00:01:23 +10:00
Nicholas Piggin
12d0b9d6c8 powerpc: Move spinlock implementation to simple_spinlock
To prepare for queued spinlocks. This is a simple rename except to
update preprocessor guard name and a file reference.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-3-npiggin@gmail.com
2020-07-26 23:34:26 +10:00
Nicholas Piggin
20d444d06f powerpc/pseries: Move some PAPR paravirt functions to their own file
These functions will be used by the queued spinlock implementation,
and may be useful elsewhere too, so move them out of spinlock.h.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-2-npiggin@gmail.com
2020-07-26 23:34:26 +10:00
Srikar Dronamraju
dbce456280 powerpc/numa: Limit possible nodes to within num_possible_nodes
MAX_NUMNODES is a theoretical maximum number of nodes thats is
supported by the kernel. Device tree properties exposes the number of
possible nodes on the current platform. The kernel would detected this
and would use it for most of its resource allocations. If the platform
now increases the nodes to over what was already exposed, then it may
lead to inconsistencies. Hence limit it to the already exposed nodes.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724105809.24733-1-srikar@linux.vnet.ibm.com
2020-07-26 23:34:25 +10:00
Athira Rajeev
65156f2b1d powerpc/perf: Initialize power10 PMU registers in cpu setup routine
Initialize Monitor Mode Control Register 3 (MMCR3)
SPR which is new in power10. For PowerISA v3.1, BHRB disable
is controlled via Monitor Mode Control Register A (MMCRA) bit,
namely "BHRB Recording Disable (BHRBRD)". This patch also initializes
MMCRA BHRBRD to disable BHRB feature at boot for power10.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1595489557-2047-1-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
84d8505ed1 powerpc/powernv/sriov: Remove vfs_expanded
Previously iov->vfs_expanded was used for two purposes.

1) To work out how much we need to multiple the per-VF BAR size to figure
   out the total space required for the IOV BAR.

2) To indicate that IOV is not usable with this device (vfs_expanded == 0).

We don't really need the field for either since the multiple in 1) is
always the number PEs supported by the PHB. Similarly, we don't really need
it in 2) either since the IOV data field will be NULL if we can't use IOV
with the device.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-16-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
4c51f3e1e8 powerpc/powernv/sriov: Make single PE mode a per-BAR setting
Using single PE BARs to map an SR-IOV BAR is really a choice about what
strategy to use when mapping a BAR. It doesn't make much sense for this to
be a global setting since a device might have one large BAR which needs to
be mapped with single PE windows and another smaller BAR that can be mapped
with a regular segmented window. Make the segmented vs single decision a
per-BAR setting and clean up the logic that decides which mode to use.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-15-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
a0be516f81 powerpc/powernv/sriov: Refactor M64 BAR setup
Split up the logic so that we have one branch that handles setting up a
segmented window and another that handles setting up single PE windows for
each VF.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-14-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
39efc03e3e powerpc/powernv/sriov: Move M64 BAR allocation into a helper
I want to refactor the loop this code is currently inside of. Hoist it on
out.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-13-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
052da31d45 powerpc/powernv/sriov: De-indent setup and teardown
Remove the IODA2 PHB checks. We already assume IODA2 in several places so
there's not much point in wrapping most of the setup and teardown process
in an if block.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-12-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
d29a2488d2 powerpc/powernv/sriov: Drop iov->pe_num_map[]
Currently the iov->pe_num_map[] does one of two things depending on
whether single PE mode is being used or not. When it is, this contains an
array which maps a vf_index to the corresponding PE number. When single PE
mode is not being used this contains a scalar which is the base PE for the
set of enabled VFs (for for VFn is base + n).

The array was necessary because when calling pnv_ioda_alloc_pe() there is
no guarantee that the allocated PEs would be contigious. We can now
allocate contigious blocks of PEs so this is no longer an issue. This
allows us to drop the if (single_mode) {} .. else {} block scattered
through the SR-IOV code which is a nice clean up.

This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic
bitmap_clear() to manipulate the PE allocation map. Other users of the map
assume it will be accessed with atomic ops.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-11-oohall@gmail.com
2020-07-26 23:34:23 +10:00
Oliver O'Halloran
a4bc676ed5 powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()
Rework the PE allocation logic to allow allocating blocks of PEs rather
than individually. We'll use this to allocate contigious blocks of PEs for
the SR-IOVs.

This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to
use the existing, but unused, phb->pe_alloc_mutex. Currently these functions
use atomic bit ops to release a currently allocated PE number. However,
the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while
scanning for hole large enough to accomodate the allocation size.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-10-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
a610d35cc8 powerpc/powernv/sriov: Factor out M64 BAR setup
The sequence required to use the single PE BAR mode is kinda janky and
requires a little explanation. The API was designed with P7-IOC style
windows where the setup process is something like:

1. Configure the window start / end address
2. Enable the window
3. Map the segments of each window to the PE

For Single PE BARs the process is:

1. Set the PE for segment zero on a disabled window
2. Set the range
3. Enable the window

Move the OPAL calls into their own helper functions where the quirks can be
contained.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-9-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
ad9add529d powerpc/powernv/sriov: Simplify used window tracking
No need for the multi-dimensional arrays, just use a bitmap.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-8-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
fac248f811 powerpc/powernv/sriov: Rename truncate_iov
This prevents SR-IOV being used by making the SR-IOV BAR resources
unallocatable. Rename it to reflect what it actually does.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-7-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
ff79e11af0 powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-6-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
37b59ef08c powerpc/powernv/sriov: Move SR-IOV into a separate file
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a separate file.

This patch also moves the PowerNV SR-IOV specific fields from pci_dn and
moves them into a platform specific structure. I'm not sure how they ended
up in there in the first place, but leaking platform specifics into common
code has proven to be a terrible idea so far so lets stop doing that.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-5-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
369633654f powerpc/powernv/pci: Initialise M64 for IODA1 as a 1-1 window
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping,
similar to PHB3. Currently the actual mapping of segments occurs in
pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and
drop the IODA1 specific code paths in the PE setup / teardown.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-4-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
01e12629af powerpc/powernv/pci: Add explicit tracking of the DMA setup state
There's an optimisation in the PE setup which skips performing DMA
setup for a PE if we only have bridges in a PE. The assumption being
that only "real" devices will DMA to system memory, which is probably
fair. However, if we start off with only bridge devices in a PE then
add a non-bridge device the new device won't be able to use DMA because
we never configured it.

Fix this (admittedly pretty weird) edge case by tracking whether we've done
the DMA setup for the PE or not. If a non-bridge device is added to the PE
(via rescan or hotplug, or whatever) we can set up DMA on demand.

This also means the only remaining user of the old "DMA Weight" code is
the IODA1 DMA setup code that it was originally added for, which is good.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-3-oohall@gmail.com
2020-07-26 23:34:22 +10:00
Oliver O'Halloran
7a52ffabe8 powerpc/powernv/pci: Always tear down DMA windows on PE release
Currently we have these two functions:

	pnv_pci_ioda2_release_dma_pe(), and
	pnv_pci_ioda2_release_pe_dma()

The first is used when tearing down VF PEs and the other is used for normal
devices. There's very little difference between the two though. The latter
(non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
two together.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-2-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
5609ffddd1 powerpc/powernv/pci: Add pci_bus_to_pnvhb() helper
Add a helper to go from a pci_bus structure to the pnv_phb that hosts
that bus. There's a lot of instances of the following pattern:

	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
	struct pnv_phb *phb = hose->private_data;

Without any other uses of the pci_controller inside the function. This
is hard to read since it requires you to memorise the contents of the
private data fields and kind of error prone since it involves blindly
assigning a void pointer. Add a helper to make it more concise and
explicit.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-1-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
a131bfc69b powerpc/eeh: Move PE tree setup into the platform
The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
follows the PCI bus structures because a reset asserted on an upstream
bridge will be propagated to the downstream bridges. On pseries there's a
1-1 correspondence between what the guest sees are a PHB and a PE so the
"tree" is really just a single node.

Current the EEH core is reponsible for setting up this PE tree which it
does by traversing the pci_dn tree. The structure of the pci_dn tree
matches the bus tree on PowerNV which leads to the PE tree being "correct"
this setup method doesn't make a whole lot of sense and it's actively
confusing for the pseries case where it doesn't really do anything.

We want to remove the dependence on pci_dn anyway so this patch move
choosing where to insert a new PE into the platform code rather than
being part of the generic EEH code. For PowerNV this simplifies the
tree building logic and removes the use of pci_dn. For pseries we
keep the existing logic. I'm not really convinced it does anything
due to the 1-1 PE-to-PHB correspondence so every device under that
PHB should be in the same PE, but I'd rather not remove it entirely
until we've had a chance to look at it more deeply.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-14-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
31595ae5ae powerpc/eeh: Drop pdn use in eeh_pe_tree_insert()
This is mostly just to make the subsequent diffs less noisy. No functional
changes.

One thing that needs calling out is the removal of the "config_addr"
variable and replacing it with edev->bdfn. The contents of edev->bdfn are
the same, however it's worth pointing out that what RTAS calls a
"config_addr" isn't the same as the bdfn. The config_addr is supposed to
be: <bus><devfn><reg> with each field being an 8 bit number. Various parts
of the EEH code use BDFN and "config_addr" as interchangeable quantities
even though they aren't really.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-13-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
d923ab7a96 powerpc/eeh: Rename eeh_{add_to|remove_from}_parent_pe()
The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect
what they actually do. If the PE referred to be edev->pe_config_addr
already exists under that PHB then the edev is added to that PE. However,
if the PE doesn't exist the a new one is created for the edev.

The bulk of the implementation of eeh_add_to_parent_pe() covers that
second case. Similarly, most of eeh_remove_from_parent_pe() is
determining when it's safe to delete a PE.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-12-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
768a42845b powerpc/eeh: Remove class code field from edev
The edev->class_code field is never referenced anywhere except for the
platform specific probe functions. The same information is available in
the pci_dev for PowerNV and in the pci_dn on pseries so we can remove
the field.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-11-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
1a303d8844 powerpc/eeh: Remove spurious use of pci_dn in eeh_dump_dev_log
Retrieve the domain, bus, device, and function numbers from the edev.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-10-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
17d2a48704 powerpc/eeh: Pass eeh_dev to eeh_ops->{read|write}_config()
Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference
a specific device rather than pci_dn. No functional changes.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-9-oohall@gmail.com
2020-07-26 23:34:21 +10:00
Oliver O'Halloran
8225d543dc powerpc/eeh: Pass eeh_dev to eeh_ops->resume_notify()
Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference
a specific device rather than pci_dn. No functional changes.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-8-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
0c2c76523c powerpc/eeh: Pass eeh_dev to eeh_ops->restore_config()
Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference
a specific device rather than pci_dn. No functional changes.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-7-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
21b43bd59c powerpc/eeh: Remove VF config space restoration
There's a bunch of strange things about this code. First up is that none of
the fields being written to are functional for a VF. The SR-IOV
specification lists then as "Reserved, but OS should preserve" so writing
new values to them doesn't do anything and is clearly wrong from a
correctness perspective.

However, since VFs are designed to be managed by the OS there is an
argument to be made that we should be saving and restoring some parts of
config space. We already sort of do that by saving the first 64 bytes of
config space in the eeh_dev (see eeh_dev->config_space[]). This is
inadequate since it doesn't even consider saving and restoring the PCI
capability structures. However, this is a problem with EEH in general and
that needs to be fixed for non-VF devices too.

There's no real reason to keep around this around so delete it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-6-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
a40db93431 powerpc/eeh: Kill off eeh_ops->get_pe_addr()
This is used in precisely one place which is in pseries specific platform
code.  There's no need to have the callback in eeh_ops since the platform
chooses the EEH PE addresses anyway. The PowerNV implementation has always
been a stub too so remove it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-5-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
c408ce9075 powerpc/pseries: Stop using pdn->pe_number
The pci_dn->pe_number field is mainly used to track the IODA PE number of a
device on PowerNV. At some point it grew a user in the pseries SR-IOV
support which muddies the waters a bit, so remove it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-4-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
dffa91539e powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev
Drivers that do not support the PCI error handling callbacks are handled by
tearing down the device and re-probing them. If the device being removed is
a virtual function then we need to know the VF index so it can be removed
using the pci_iov_{add|remove}_virtfn() API.

Currently this is handled by looking up the pci_dn, and using the vf_index
that was stashed there when the pci_dn for the VF was created in
pcibios_sriov_enable(). We would like to eliminate the use of pci_dn
outside of pseries though so we need to provide the generic EEH code with
some other way to find the vf_index.

The easiest thing to do here is move the vf_index field out of pci_dn and
into eeh_dev.  Currently pci_dn and eeh_dev are allocated and initialized
together so this is a fairly minimal change in preparation for splitting
pci_dn and eeh_dev in the future.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-3-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
d74ee8e9d1 powerpc/eeh: Remove eeh_dev.c
The only thing in this file is eeh_dev_init() which is allocates and
initialises an eeh_dev based on a pci_dn. This is only ever called from
pci_dn.c so move it into there and remove the file.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-2-oohall@gmail.com
2020-07-26 23:34:20 +10:00
Oliver O'Halloran
475028efc7 powerpc/eeh: Remove eeh_dev_phb_init_dynamic()
This function is a one line wrapper around eeh_phb_pe_create() and despite
the name it doesn't create any eeh_dev structures. Replace it with direct
calls to eeh_phb_pe_create() since that does what it says on the tin
and removes a layer of indirection.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-1-oohall@gmail.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
3f31e49dc4 powerpc/watchpoint: Remove 512 byte boundary
Power10 has removed 512 bytes boundary from match criteria i.e. the watch
range can cross 512 bytes boundary.

Note: ISA 3.1 Book III 9.4 match criteria includes 512 byte limit but that
is a documentation mistake and hopefully will be fixed in the next version
of ISA. Though, ISA 3.1 change log mentions about removal of 512B boundary:

  Multiple DEAW:
  Added a second Data Address Watchpoint. [H]DAR is
  set to the first byte of overlap. 512B boundary is
  removed.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-11-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
deb2bd9bcc powerpc/watchpoint: Return available watchpoints dynamically
So far Book3S Powerpc supported only one watchpoint. Power10 is
introducing 2nd DAWR. Enable 2nd DAWR support for Power10.
Availability of 2nd DAWR will depend on CPU_FTR_DAWR1.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-10-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
03f3e54abd powerpc/watchpoint: Guest support for 2nd DAWR hcall
2nd DAWR can be set/unset using H_SET_MODE hcall with resource value 5.
Enable powervm guest support with that. This has no effect on kvm guest
because kvm will return error if guest does hcall with resource value 5.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-9-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
6f3fe297f9 powerpc/watchpoint: Rename current H_SET_MODE DAWR macro
Current H_SET_MODE hcall macro name for setting/resetting DAWR0 is
H_SET_MODE_RESOURCE_SET_DAWR. Add suffix 0 to macro name as well.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-8-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
8f45ca3f8b powerpc/watchpoint: Set CPU_FTR_DAWR1 based on pa-features bit
As per the PAPR, bit 0 of byte 64 in pa-features property indicates
availability of 2nd DAWR registers. i.e. If this bit is set, 2nd
DAWR is present, otherwise not. Host generally uses "cpu-features",
which masks "pa-features". But "cpu-features" are still not used for
guests and thus this change is mostly applicable for guests only.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Tested-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-7-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
dc1cedca54 powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
Add new device-tree feature for 2nd DAWR. If this feature is present,
2nd DAWR is supported, otherwise not.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-6-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
8f460a8175 powerpc/watchpoint: Enable watchpoint functionality on power10 guest
CPU_FTR_DAWR is by default enabled for host via CPU_FTRS_DT_CPU_BASE
(controlled by CONFIG_PPC_DT_CPU_FTRS). But cpu-features device-tree
node is not PAPR compatible and thus not yet used by kvm or pHyp
guests. Enable watchpoint functionality on power10 guest (both kvm
and powervm) by adding CPU_FTR_DAWR to CPU_FTRS_POWER10. Note that
this change does not enable 2nd DAWR support.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Tested-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-5-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:19 +10:00
Ravi Bangoria
f3c832f135 powerpc/watchpoint: Fix DAWR exception for CACHEOP
'ea' returned by analyse_instr() needs to be aligned down to cache
block size for CACHEOP instructions. analyse_instr() does not set
size for CACHEOP, thus size also needs to be calculated manually.

Fixes: 27985b2a64 ("powerpc/watchpoint: Don't ignore extraneous exceptions blindly")
Fixes: 74c6881019 ("powerpc/watchpoint: Prepare handler to handle more than one watchpoint")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-4-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:18 +10:00
Ravi Bangoria
f6780ce619 powerpc/watchpoint: Fix DAWR exception constraint
Pedro Miraglia Franco de Carvalho noticed that on p8/p9, DAR value is
inconsistent with different type of load/store. Like for byte,word
etc. load/stores, DAR is set to the address of the first byte of
overlap between watch range and real access. But for quadword load/
store it's sometime set to the address of the first byte of real
access whereas sometime set to the address of the first byte of
overlap. This issue has been fixed in p10. In p10(ISA 3.1), DAR is
always set to the address of the first byte of overlap. Commit 27985b2a64
("powerpc/watchpoint: Don't ignore extraneous exceptions blindly")
wrongly assumes that DAR is set to the address of the first byte of
overlap for all load/stores on p8/p9 as well. Fix that. With the fix,
we now rely on 'ea' provided by analyse_instr(). If analyse_instr()
fails, generate event unconditionally on p8/p9, and on p10 generate
event only if DAR is within a DAWR range.

Note: 8xx is not affected.

Fixes: 27985b2a64 ("powerpc/watchpoint: Don't ignore extraneous exceptions blindly")
Fixes: 74c6881019 ("powerpc/watchpoint: Prepare handler to handle more than one watchpoint")
Reported-by: Pedro Miraglia Franco de Carvalho <pedromfc@br.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-3-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:18 +10:00
Ravi Bangoria
3190ecbfee powerpc/watchpoint: Fix 512 byte boundary limit
Milton Miller reported that we are aligning start and end address to
wrong size SZ_512M. It should be SZ_512. Fix that.

While doing this change I also found a case where ALIGN() comparison
fails. Within a given aligned range, ALIGN() of two addresses does not
match when start address is pointing to the first byte and end address
is pointing to any other byte except the first one. But that's not true
for ALIGN_DOWN(). ALIGN_DOWN() of any two addresses within that range
will always point to the first byte. So use ALIGN_DOWN() instead of
ALIGN().

Fixes: e68ef121c1 ("powerpc/watchpoint: Use builtin ALIGN*() macros")
Reported-by: Milton Miller <miltonm@us.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Tested-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200723090813.303838-2-ravi.bangoria@linux.ibm.com
2020-07-26 23:34:18 +10:00
Aneesh Kumar K.V
269e829f48 powerpc/book3s64/pkey: Disable pkey on POWER6 and before
POWER6 only supports AMR update via privileged mode (MSR[PR] = 0,
SPRN_AMR=29) The PR=1 (userspace) alias for that SPR (SPRN_AMR=13) was
only supported from POWER7. Since we don't allow userspace modifying
of AMR value we should disable pkey support on P6 and before.

The hypervisor will still report pkey support via
"ibm,processor-storage-keys". Hence also check for P7 CPU_FTR bit to
decide on pkey support.

Fixes: f491fe3fb4 ("powerpc/book3s64/pkeys: Simplify the key initialization")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726132517.399076-1-aneesh.kumar@linux.ibm.com
2020-07-26 23:34:18 +10:00
David S. Miller
a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Ingo Molnar
c84d53051f Linux 5.8-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8UzA4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGQ7cH/3v+Gv+SmHJCvaT2
 CSu0+7okVnYbY3UTb3hykk7/aOqb6284KjxR03r0CWFzsEsZVhC5pvvruASSiMQg
 Pi04sLqv6CsGLHd1n+pl4AUYEaxq6k4KS3uU3HHSWxrahDDApQoRUx2F8lpOxyj8
 RiwnoO60IMPA7IFJqzcZuFqsgdxqiiYvnzT461KX8Mrw6fyMXeR2KAj2NwMX8dZN
 At21Sf8+LSoh6q2HnugfiUd/jR10XbfxIIx2lXgIinb15GXgWydEQVrDJ7cUV7ix
 Jd0S+dtOtp+lWtFHDoyjjqqsMV7+G8i/rFNZoxSkyZqsUTaKzaR6JD3moSyoYZgG
 0+eXO4A=
 =9EpR
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc6' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-25 21:49:36 +02:00
Michael Ellerman
826b07b190 powerpc/sstep: Fix incorrect CONFIG symbol in scv handling
When I "fixed" the ppc64e build in Nick's recent patch, I typoed the
CONFIG symbol, resulting in one that doesn't exist. Fix it to use the
correct symbol.

Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131609.1640533-1-mpe@ellerman.id.au
2020-07-24 23:16:49 +10:00
Michael Ellerman
70cc062c47 powerpc/test_emulate_sstep: Fix build error
ppc64_book3e_allmodconfig fails with:

  arch/powerpc/lib/test_emulate_step.c: In function 'test_pld':
  arch/powerpc/lib/test_emulate_step.c:113:7: error: implicit declaration of function 'cpu_has_feature'
    113 |  if (!cpu_has_feature(CPU_FTR_ARCH_31)) {
        |       ^~~~~~~~~~~~~~~

Add an include of cpu_has_feature.h to fix it.

Fixes: b6b54b4272 ("powerpc/sstep: Add tests for prefixed integer load/stores")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724004109.1461709-1-mpe@ellerman.id.au
2020-07-24 10:41:19 +10:00
Michael Ellerman
335aca5f65 Merge branch 'scv' support into next
From Nick's cover letter:

Linux powerpc new system call instruction and ABI

System Call Vectored (scv) ABI
==============================

The scv instruction is introduced with POWER9 / ISA3, it comes with an
rfscv counter-part. The benefit of these instructions is
performance (trading slower SRR0/1 with faster LR/CTR registers, and
entering the kernel with MSR[EE] and MSR[RI] left enabled, which can
reduce MSR updates. The scv instruction has 128 levels (not enough to
cover the Linux system call space).

Assignment and advertisement
----------------------------
The proposal is to assign scv levels conservatively, and advertise
them with HWCAP feature bits as we add support for more.

Linux has not enabled FSCR[SCV] yet, so executing the scv instruction
will cause the kernel to log a "SCV facility unavilable" message, and
deliver a SIGILL with ILL_ILLOPC to the process. Linux has defined a
HWCAP2 bit PPC_FEATURE2_SCV for SCV support, but does not set it.

This change allocates the zero level ('scv 0'), advertised with
PPC_FEATURE2_SCV, which will be used to provide normal Linux system
calls (equivalent to 'sc').

Attempting to execute scv with other levels will cause a SIGILL to be
delivered the same as before, but will not log a "SCV facility
unavailable" message (because the processor facility is enabled).

Calling convention
------------------
The proposal is for scv 0 to provide the standard Linux system call
ABI with the following differences from sc convention[1]:

- LR is to be volatile across scv calls. This is necessary because the
  scv instruction clobbers LR. From previous discussion, this should
  be possible to deal with in GCC clobbers and CFI.

- cr1 and cr5-cr7 are volatile. This matches the C ABI and would allow
  the kernel system call exit to avoid restoring the volatile cr
  registers (although we probably still would anyway to avoid
  information leaks).

- Error handling: The consensus among kernel, glibc, and musl is to
  move to using negative return values in r3 rather than CR0[SO]=1 to
  indicate error, which matches most other architectures, and is
  closer to a function call.

Notes
-----
- r0,r4-r8 are documented as volatile in the ABI, but the kernel patch
  as submitted currently preserves them. This is to leave room for
  deciding which way to go with these. Some small benefit was found by
  preserving them[1] but I'm not convinced it's worth deviating from
  the C function call ABI just for this. Release code should follow
  the ABI.

Previous discussions:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/208691.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/209268.html

[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/209263.html
2020-07-23 17:43:44 +10:00
Pratik Rajesh Sampat
5c92fb1b46 powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721153708.89057-4-psampat@linux.ibm.com
2020-07-23 17:43:36 +10:00
Pratik Rajesh Sampat
dcbbfa6b05 powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
Replace the variable name from using "pnv_first_spr_loss_level" to
"deep_spr_loss_state".

pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the
"deep" states as terminology. Hence renaming the variable to be coherent
to its semantics.

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721153708.89057-3-psampat@linux.ibm.com
2020-07-23 17:43:36 +10:00
Pratik Rajesh Sampat
8747bf36f3 powerpc/powernv/idle: Replace CPU feature check with PVR check
The POWER9 idle driver contains implementation-specific details that
means it is not suitable to run on any processor that implements ISA
v3.0 (e.g., POWER10), so only init the driver when running on a
POWER9.

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
[mpe: Use updated change log from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721153708.89057-2-psampat@linux.ibm.com
2020-07-23 17:43:36 +10:00
Santosh Sivaraj
69507b984d powerpc/mm/hash64: Remove comment that is no longer valid
hash_low_64.S was removed in commit a43c0eb836 ("powerpc/mm: Convert
4k insert from asm to C") and flush_hash_page() is no longer called
from any assembly routine.

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
[mpe: Tweak comment wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721091915.205006-1-santosh@fossix.org
2020-07-23 17:43:35 +10:00
Leonardo Bras
0f10228c6f KVM: PPC: Fix typo on H_DISABLE_AND_GET hcall
On PAPR+ the hcall() on 0x1B0 is called H_DISABLE_AND_GET, but got
defined as H_DISABLE_AND_GETC instead.

This define was introduced with a typo in commit <b13a96cfb055>
("[PATCH] powerpc: Extends HCALL interface for InfiniBand usage"), and was
later used without having the typo noticed.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200707004812.190765-1-leobras.c@gmail.com
2020-07-23 17:43:35 +10:00
Christoph Hellwig
7c7ff885c7 powerpc/spufs: Fix the type of ret in spufs_arch_write_note
Both the ->dump method and snprintf return an int.  So switch to an
int and properly handle errors from ->dump.

Fixes: 5456ffdee6 ("powerpc/spufs: simplify spufs core dumping")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200610085554.5647-1-hch@lst.de
2020-07-23 17:43:31 +10:00
Nicholas Piggin
201220bb0e powerpc/powernv: Machine check handler for POWER10
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200702233343.1128026-1-npiggin@gmail.com
2020-07-23 17:43:30 +10:00
Wen Xiong
5a090f7c36 powerpc/pseries: PCIE PHB reset
Several device drivers hit EEH(Extended Error handling) when
triggering kdump on Pseries PowerVM. This patch implemented a reset of
the PHBs in pci general code when triggering kdump. PHB reset stop all
PCI transactions from normal kernel. We have tested the patch in
several enviroments:
  - direct slot adapters
  - adapters under the switch
  - a VF adapter in PowerVM
  - a VF adapter/adapter in KVM guest.

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
[mpe: Fix broken whitespace, subject & SOB formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594651173-32166-1-git-send-email-wenxiong@linux.vnet.ibm.com
2020-07-23 17:43:30 +10:00
Nicholas Piggin
2384b36f91 powerpc: Select ARCH_HAS_MEMBARRIER_SYNC_CORE
powerpc return from interrupt and return from system call sequences
are context synchronising.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200716013522.338318-1-npiggin@gmail.com
2020-07-23 17:43:23 +10:00
Palmer Dabbelt
147c13413c powerpc/64: Fix an out of date comment about MMIO ordering
This primitive has been renamed, but because it was spelled incorrectly in the
first place it must have escaped the fixup patch.  As far as I can tell this
logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
implementation, which is "sync" rather than "hwsync" but those are the same
(though I'm not that familiar with PowerPC).

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200716193820.1141936-1-palmer@dabbelt.com
2020-07-23 17:43:23 +10:00
Balamuruhan S
e93ad65e36 powerpc/test_emulate_step: Move extern declaration to sstep.h
fix checkpatch.pl warnings by moving extern declaration from source
file to headerfile.

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200626095158.1031507-5-bala24@linux.ibm.com
2020-07-23 17:43:14 +10:00
Balamuruhan S
68a180a44c powerpc/sstep: Introduce macros to retrieve Prefix instruction operands
retrieve prefix instruction operands RA and pc relative bit R values
using macros and adopt it in sstep.c and test_emulate_step.c.

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200626095158.1031507-4-bala24@linux.ibm.com
2020-07-23 17:43:11 +10:00
Balamuruhan S
7e67c73b93 powerpc/test_emulate_step: Add negative tests for prefixed addi
testcases for `paddi` instruction to cover the negative case,
if R is equal to 1 and RA is not equal to 0, the instruction
form is invalid.

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200626095158.1031507-3-bala24@linux.ibm.com
2020-07-23 17:43:07 +10:00
Balamuruhan S
93c3a0ba2a powerpc/test_emulate_step: Enhancement to test negative scenarios
add provision to declare test is a negative scenario, verify
whether emulation fails and avoid executing it.

Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200626095158.1031507-2-bala24@linux.ibm.com
2020-07-23 17:43:01 +10:00
Jordan Niethe
8b98afc117 powerpc/xmon: Improve dumping prefixed instructions
Currently prefixed instructions are dumped as two separate word
instructions. Use mread_instr() so that prefixed instructions are read
as such and update the incrementor in the loop to take this into
account.

'dump_func' is print_insn_powerpc() which comes from ppc-dis.c which is
taken from binutils. When this is updated prefixed instructions will be
disassembled.

Currently dumping prefixed instructions looks like this:
0:mon> di c000000000094168
c000000000094168  0x06000000    .long 0x6000000
c00000000009416c  0x392a0003    addi    r9,r10,3
c000000000094170  0x913f0028    stw     r9,40(r31)
c000000000094174  0xe93f002a    lwa     r9,40(r31)
c000000000094178  0x7d234b78    mr      r3,r9
c00000000009417c  0x383f0040    addi    r1,r31,64
c000000000094180  0xebe1fff8    ld      r31,-8(r1)
c000000000094184  0x4e800020    blr
c000000000094188  0x60000000    nop
 ...
c000000000094190  0x3c4c0121    addis   r2,r12,289
c000000000094194  0x38429670    addi    r2,r2,-27024
c000000000094198  0x7c0802a6    mflr    r0
c00000000009419c  0x60000000    nop
c0000000000941a0  0xe9240100    ld      r9,256(r4)
c0000000000941a4  0x39400001    li      r10,1

After this it looks like:
0:mon> di c000000000094168
c000000000094168  0x06000000 0x392a0003 .long 0x392a000306000000
c000000000094170  0x913f0028    stw     r9,40(r31)
c000000000094174  0xe93f002a    lwa     r9,40(r31)
c000000000094178  0x7d234b78    mr      r3,r9
c00000000009417c  0x383f0040    addi    r1,r31,64
c000000000094180  0xebe1fff8    ld      r31,-8(r1)
c000000000094184  0x4e800020    blr
c000000000094188  0x60000000    nop
 ...
c000000000094190  0x3c4c0121    addis   r2,r12,289
c000000000094194  0x38429570    addi    r2,r2,-27280
c000000000094198  0x7c0802a6    mflr    r0
c00000000009419c  0x60000000    nop
c0000000000941a0  0xe9240100    ld      r9,256(r4)
c0000000000941a4  0x39400001    li      r10,1
c0000000000941a8  0x3d02000b    addis   r8,r2,11

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602052728.18227-2-jniethe5@gmail.com
2020-07-23 17:41:59 +10:00
Jordan Niethe
50428fdc53 powerpc: Add a ppc_inst_as_str() helper
There are quite a few places where instructions are printed, this is
done using a '%x' format specifier. With the introduction of prefixed
instructions, this does not work well. Currently in these places,
ppc_inst_val() is used for the value for %x so only the first word of
prefixed instructions are printed.

When the instructions are word instructions, only a single word should
be printed. For prefixed instructions both the prefix and suffix should
be printed. To accommodate both of these situations, instead of a '%x'
specifier use '%s' and introduce a helper, __ppc_inst_as_str() which
returns a char *. The char * __ppc_inst_as_str() returns is buffer that
is passed to it by the caller.

It is cumbersome to require every caller of __ppc_inst_as_str() to now
declare a buffer. To make it more convenient to use __ppc_inst_as_str(),
wrap it in a macro that uses a compound statement to allocate a buffer
on the caller's stack before calling it.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
[mpe: Drop 0x prefix to match most existings uses, especially xmon]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602052728.18227-1-jniethe5@gmail.com
2020-07-23 17:41:36 +10:00
Jordan Niethe
4f82590078 powerpc/sstep: Add tests for Prefixed Add Immediate
Use the existing support for testing compute type instructions to test
Prefixed Add Immediate (paddi). The R bit of the paddi instruction
controls whether current instruction address is used. Add test cases
for when R=1 and for R=0. paddi has a 34 bit immediate field formed by
concatenating si0 and si1. Add tests for the extreme values of this
field.

Skip the paddi tests if ISA v3.1 is unsupported.

Some of these test cases were added by Balamuruhan S.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Fix conflicts with ppc-opcode.h changes, squash in .balign]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200525025923.19843-5-jniethe5@gmail.com
2020-07-23 17:25:21 +10:00
Jordan Niethe
301ebf7d69 powerpc/sstep: Let compute tests specify a required cpu feature
An a array of struct compute_test's are used to declare tests for
compute instructions. Add a cpu_feature field to struct compute_test as
an optional way to specify a cpu feature that must be present. If not
present then skip the test.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200525025923.19843-4-jniethe5@gmail.com
2020-07-23 17:25:18 +10:00
Jordan Niethe
1c89cf7fbe powerpc/sstep: Set NIP in instruction emulation tests
The tests for emulation of compute instructions execute and
emulate an instruction and then compare the results to verify the
emulation. In ISA v3.1 there are instructions that operate relative to
the NIP. Therefore set the NIP in the regs used for the emulated
instruction to the location of the executed instruction so they will
give the same result.

This is a rework of a patch by Balamuruhan S.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200525025923.19843-3-jniethe5@gmail.com
2020-07-23 17:25:15 +10:00
Jordan Niethe
0396de6d85 powerpc/sstep: Add tests for prefixed floating-point load/stores
Add tests for the prefixed versions of the floating-point load/stores
that are currently tested. This includes the following instructions:
  * Prefixed Load Floating-Point Single (plfs)
  * Prefixed Load Floating-Point Double (plfd)
  * Prefixed Store Floating-Point Single (pstfs)
  * Prefixed Store Floating-Point Double (pstfd)

Skip the new tests if ISA v3.10 is unsupported.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Fix conflicts with ppc-opcode.h changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200525025923.19843-2-jniethe5@gmail.com
2020-07-23 17:25:12 +10:00
Jordan Niethe
b6b54b4272 powerpc/sstep: Add tests for prefixed integer load/stores
Add tests for the prefixed versions of the integer load/stores that
are currently tested. This includes the following instructions:
  * Prefixed Load Doubleword (pld)
  * Prefixed Load Word and Zero (plwz)
  * Prefixed Store Doubleword (pstd)

Skip the new tests if ISA v3.1 is unsupported.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Fix conflicts with ppc-opcode.h changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200525025923.19843-1-jniethe5@gmail.com
2020-07-23 17:25:06 +10:00
Tianjia Zhang
7ec21d9da5 KVM: PPC: Clean up redundant kvm_run parameters in assembly
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

[paulus@ozlabs.org - Fixed places that were missed in book3s_interrupts.S]

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-23 15:50:01 +10:00
Nicholas Piggin
7fa95f9ada powerpc/64s: system call support for scv/rfscv instructions
Add support for the scv instruction on POWER9 and later CPUs.

For now this implements the zeroth scv vector 'scv 0', as identical to
'sc' system calls, with the exception that LR is not preserved, nor
are volatile CR registers, and error is not indicated with CR0[SO],
but by returning a negative errno.

rfscv is implemented to return from scv type system calls. It can not
be used to return from sc system calls because those are defined to
preserve LR.

getpid syscall throughput on POWER9 is improved by 26% (428 to 318
cycles), largely due to reducing mtmsr and mtspr.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix ppc64e build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200611081203.995112-3-npiggin@gmail.com
2020-07-22 23:00:27 +10:00
Nicholas Piggin
b2dc2977cb powerpc/64s/exception: treat NIA below __end_interrupts as soft-masked
The scv instruction causes an interrupt which can enter the kernel with
MSR[EE]=1, thus allowing interrupts to hit at any time. These must not
be taken as normal interrupts, because they come from MSR[PR]=0 context,
and yet the kernel stack is not yet set up and r13 is not set to the
PACA).

Treat this as a soft-masked interrupt regardless of the soft masked
state. This does not affect behaviour yet, because currently all
interrupts are taken with MSR[EE]=0.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200611081203.995112-2-npiggin@gmail.com
2020-07-22 23:00:23 +10:00
Athira Rajeev
1cade527f6 powerpc/perf: BHRB control to disable BHRB logic when not used
PowerISA v3.1 has few updates for the Branch History Rolling
Buffer(BHRB).

BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
bit, namely "BHRB Recording Disable (BHRBRD)". This field controls
whether BHRB entries are written when BHRB recording is enabled by
other bits. This patch implements support for this BHRB disable bit.
By setting 0b1 to this bit will disable the BHRB and by setting 0b0 to
this bit will have BHRB enabled. This addresses backward
compatibility (for older OS), since this bit will be cleared and
hardware will be writing to BHRB by default.

This patch addresses changes to set MMCRA (BHRBRD) at boot for
power10 (there by the core will run faster) and enable this feature
only on runtime ie, on explicit need from user. Also save/restore
MMCRA in the restore path of state-loss idle state to make sure we
keep BHRB disabled if it was not enabled on request at runtime.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-12-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:42 +10:00
Athira Rajeev
80350a4bac powerpc/perf: Add Power10 BHRB filter support for PERF_SAMPLE_BRANCH_IND_CALL/COND
PowerISA v3.1 introduce filtering support for
PERF_SAMPLE_BRANCH_IND_CALL/COND. The patch adds BHRB filter
support for "ind_call" and "cond" in power10_bhrb_filter_map().

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-11-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:42 +10:00
Athira Rajeev
bfe3b1945d powerpc/perf: Ignore the BHRB kernel address filtering for P10
Commit bb19af8160 ("powerpc/perf: Prevent kernel address leak to
userspace via BHRB buffer") added a check in bhrb_read() to filter
the kernel address from BHRB buffer. This patch modified it to avoid
that check for PowerISA v3.1 based processors, since PowerISA v3.1
allows only MSR[PR]=1 address to be written to BHRB buffer.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-10-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Athira Rajeev
a64e697cef powerpc/perf: power10 Performance Monitoring support
Base enablement patch to register performance monitoring hardware
support for power10. Patch introduce the raw event encoding format,
defines the supported list of events, config fields for the event
attributes and their corresponding bit values which are exported via
sysfs.

Patch also enhances the support function in isa207_common.c to include
power10 pmu hardware.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-9-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Madhavan Srinivasan
9908c826d5 powerpc/perf: Add Power10 PMU feature to DT CPU features
Add Power10 feature function to DT CPU features, along with a Power10
specific init() to initialize PMU SPRs, sets the oprofile_cpu_type and
cpu_features. This will enable performance monitoring unit (PMU) for
Power10 in CPU features with "performance-monitor-power10".

For Power ISA v3.1, BHRB disable is controlled via Monitor Mode
Control Register A (MMCRA) bit, namely "BHRB Recording
Disable (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
feature at boot for Power10.

Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
[mpe: Move MMCRA_BHRB_DISABLE as noted by jpn, drop CPU setup changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-8-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Madhavan Srinivasan
1979ae8c72 powerpc/xmon: Add PowerISA v3.1 PMU SPRs
PowerISA v3.1 added three new perfromance
monitoring unit (PMU) speical purpose register (SPR).
They are Monitor Mode Control Register 3 (MMCR3),
Sampled Instruction Event Register 2 (SIER2),
Sampled Instruction Event Register 3 (SIER3).

Patch here adds a new dump function dump_310_sprs
to print these SPR values.

Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-7-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Athira Rajeev
5752fe0b81 KVM: PPC: Book3S HV: Save/restore new PMU registers
Power ISA v3.1 has added new performance monitoring unit (PMU) special
purpose registers (SPRs). They are:

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register A (SIER2)
Sampled Instruction Event Register B (SIER3)

Add support to save/restore these new SPRs while entering/exiting
guest. Also include changes to support KVM_REG_PPC_MMCR3/SIER2/SIER3.
Add new SPRs to KVM API documentation.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-6-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Madhavan Srinivasan
c718547e4a powerpc/perf: Add support for ISA3.1 PMU SPRs
PowerISA v3.1 includes new performance monitoring unit(PMU)
special purpose registers (SPRs). They are

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register 2 (SIER2)
Sampled Instruction Event Register 3 (SIER3)

MMCR3 is added for further sampling related configuration
control. SIER2/SIER3 are added to provide additional
information about the sampled instruction.

Patch adds new PPMU flag called "PPMU_ARCH_31" to support handling of
these new SPRs, updates the struct thread_struct to include these new
SPRs, include MMCR3 in struct mmcr_regs. This is needed to support
programming of MMCR3 SPR during event_enable/disable. Patch also adds
the sysfs support for the MMCR3 SPR along with SPRN_ macros for these
new pmu SPRs.

Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
[mpe: Rename to PPMU_ARCH_31 as noted by jpn]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-5-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:41 +10:00
Athira Rajeev
9d4fc86dcd powerpc/perf: Update Power PMU cache_events to u64 type
Events of type PERF_TYPE_HW_CACHE was described for Power PMU
as: int (*cache_events)[type][op][result];

where type, op, result values unpacked from the event attribute config
value is used to generate the raw event code at runtime.

So far the event code values which used to create these cache-related
events were within 32 bit and `int` type worked. In power10,
some of the event codes are of 64-bit value and hence update the
Power PMU cache_events to `u64` type in `power_pmu` struct.
Also propagate this change to existing all PMU driver code paths
which are using ppmu->cache_events.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-4-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:40 +10:00
Athira Rajeev
7e4a145e5b KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR
Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
Split this to give mmcra and mmcrs its own entries in vcpu and
use a flat array for mmcr0 to mmcr2. This patch implements this
cleanup to make code easier to read.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
[mpe: Fix MMCRA/MMCR2 uapi breakage as noted by paulus]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-3-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 21:56:01 +10:00
Masahiro Yamada
1bca544050 powerpc/boot: add DTB to 'targets'
PowerPC always re-builds DTB even if nothing has been changed.

As for other architectures, arch/*/boot/dts/Makefile builds DTB by
using the dtb-y syntax.

In contrast, arch/powerpc/boot/dts/(fsl/)Makefile does nothing unless
CONFIG_OF_ALL_DTBS is defined. Instead, arch/powerpc/boot/Makefile
builds DTB on demand. You need to add DTB to 'targets' explicitly
so .*.cmd files are included.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2020-07-22 09:19:12 +09:00
Athira Rajeev
78d76819e6 powerpc/perf: Update cpu_hw_event to use struct for storing MMCR registers
core-book3s currently uses array to store the MMCR registers as part
of per-cpu `cpu_hw_events`. This patch does a clean up to use `struct`
to store mmcr regs instead of array. This will make code easier to read
and reduces chance of any subtle bug that may come in the future, say
when new registers are added. Patch updates all relevant code that was
using MMCR array ( cpuhw->mmcr[x]) to use newly introduced `struct`.
This includes the PMU driver code for supported platforms (power5
to power9) and ISA macros for counter support functions.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-2-git-send-email-atrajeev@linux.vnet.ibm.com
2020-07-22 00:03:17 +10:00
Madhavan Srinivasan
3c9450c053 powerpc/perf: Fix missing is_sier_aviable() during build
Compilation error:
  arch/powerpc/perf/perf_regs.c:80:undefined reference to `.is_sier_available'

Currently is_sier_available() is part of core-book3s.c, which is added
to build based on CONFIG_PPC_PERF_CTRS.

A config with CONFIG_PERF_EVENTS and without CONFIG_PPC_PERF_CTRS will
have a build break because of missing is_sier_available().

In practice it only breaks when CONFIG_FSL_EMB_PERF_EVENT=n because
that also guards the usage of is_sier_available(). That only happens
with CONFIG_PPC_BOOK3E_64=y and CONFIG_FSL_SOC_BOOKE=n.

Patch adds is_sier_available() in asm/perf_event.h to fix the build
break for configs missing CONFIG_PPC_PERF_CTRS.

Fixes: 333804dc3b ("powerpc/perf: Update perf_regs structure to include SIER")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
[mpe: Add detail about CONFIG_FSL_SOC_BOOKE]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200614083604.302611-1-maddy@linux.ibm.com
2020-07-22 00:03:06 +10:00
Nicholas Piggin
63396ada80 powerpc/64s/hash: Disable subpage_prot syscall by default
The subpage_prot syscall was added for specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so disable it by default.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-4-npiggin@gmail.com
2020-07-22 00:01:25 +10:00
Nicholas Piggin
5c9fa16e8a powerpc/64s: Remove PROT_SAO support
ISA v3.1 does not support the SAO storage control attribute required to
implement PROT_SAO. PROT_SAO was used by specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so removal should not cause problems.

We rather remove it than keep support for older processors, because
live migrating guest partitions to newer processors may not be possible
if SAO is in use (or worse allowed with silent races).

- PROT_SAO stays in the uapi header so code using it would still build.
- arch_validate_prot() is removed, the generic version rejects PROT_SAO
  so applications would get a failure at mmap() time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop KVM change for the time being]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
2020-07-22 00:01:25 +10:00
Nicholas Piggin
f4ac1774f2 powerpc: Remove stale calc_vm_prot_bits() comment
This comment is wrong, we wouldn't use calc_vm_prot_bits() here
because we are being called by calc_vm_prot_bits() to modify its
behaviour.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-2-npiggin@gmail.com
2020-07-22 00:01:24 +10:00
YueHaibing
a3f3f8aa1f powerpc: Remove unneeded inline functions
Both of those functions are only called from 64-bit only code, so the
stubs should not be needed at all.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200717112714.19304-1-yuehaibing@huawei.com
2020-07-22 00:01:24 +10:00
Alexander A. Klimov
c8ed9fc9d2 powerpc: Replace HTTP links with HTTPS ones
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
            If both the HTTP and HTTPS versions
            return 200 OK and serve the same content:
              Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200718103958.5455-1-grandmaster@al2klimov.de
2020-07-22 00:01:23 +10:00
Michael Ellerman
38b407be17 powerpc/spufs: Rework fcheck() usage
Currently the spu coredump code triggers an RCU warning:

  =============================
  WARNING: suspicious RCU usage
  5.7.0-rc3-01755-g7cd49f0b7ec7 #1 Not tainted
  -----------------------------
  include/linux/fdtable.h:95 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by spu-coredump/1343:
   #0: c0000007fa22f430 (sb_writers#2){.+.+}-{0:0}, at: .do_coredump+0x1010/0x13c8

  stack backtrace:
  CPU: 0 PID: 1343 Comm: spu-coredump Not tainted 5.7.0-rc3-01755-g7cd49f0b7ec7 #1
  Call Trace:
    .dump_stack+0xec/0x15c (unreliable)
    .lockdep_rcu_suspicious+0x120/0x144
    .coredump_next_context+0x148/0x158
    .spufs_coredump_extra_notes_size+0x54/0x190
    .elf_coredump_extra_notes_size+0x34/0x50
    .elf_core_dump+0xe48/0x19d0
    .do_coredump+0xe50/0x13c8
    .get_signal+0x864/0xd88
    .do_notify_resume+0x158/0x3c8
    .interrupt_exit_user_prepare+0x19c/0x208
    interrupt_return+0x14/0x1c0

This comes from fcheck_files() via fcheck().

It's pretty clearly documented that fcheck() must be wrapped with
rcu_read_lock(), adding that fixes the RCU warning.

hch points out that once we've released the RCU read lock the file may
be closed and freed, which would leave us with a pointer to a freed
spu_context.

To avoid that, take a reference to the spu_context while we hold the
RCU read lock, and drop that reference later once we're done with the
context.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508130633.2532759-1-mpe@ellerman.id.au
2020-07-22 00:01:23 +10:00
Aneesh Kumar K.V
482b9b3948 powerpc/book3s64/pkeys: Remove is_pkey_enabled()
There is only one caller to this function and the function is wrongly
named. Avoid further confusion w.r.t name and open code this at the
only call site. Also remove read_uamor(). There are no users for
the same after this.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-24-aneesh.kumar@linux.ibm.com
2020-07-22 00:01:22 +10:00
Alexey Kardashevskiy
1508c22f11 KVM: PPC: Protect kvm_vcpu_read_guest with srcu locks
The kvm_vcpu_read_guest/kvm_vcpu_write_guest used for nested guests
eventually call srcu_dereference_check to dereference a memslot and
lockdep produces a warning as neither kvm->slots_lock nor
kvm->srcu lock is held and kvm->users_count is above zero (>100 in fact).

This wraps mentioned VCPU read/write helpers in srcu read lock/unlock as
it is done in other places. This uses vcpu->srcu_idx when possible.

These helpers are only used for nested KVM so this may explain why
we did not see these before.

Here is an example of a warning:

=============================
WARNING: suspicious RCU usage
5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897 Not tainted
-----------------------------
include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/2752:
 #0: c000200359016be0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x144/0xd80 [kvm]

stack backtrace:
CPU: 80 PID: 2752 Comm: qemu-system-ppc Not tainted 5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897
Call Trace:
[c0002003591ab240] [c000000000b23ab4] dump_stack+0x190/0x25c (unreliable)
[c0002003591ab2b0] [c00000000023f954] lockdep_rcu_suspicious+0x140/0x164
[c0002003591ab330] [c008000004a445f8] kvm_vcpu_gfn_to_memslot+0x4c0/0x510 [kvm]
[c0002003591ab3a0] [c008000004a44c18] kvm_vcpu_read_guest+0xa0/0x180 [kvm]
[c0002003591ab410] [c008000004ff9bd8] kvmhv_enter_nested_guest+0x90/0xb80 [kvm_hv]
[c0002003591ab980] [c008000004fe07bc] kvmppc_pseries_do_hcall+0x7b4/0x1c30 [kvm_hv]
[c0002003591aba10] [c008000004fe5d30] kvmppc_vcpu_run_hv+0x10a8/0x1a30 [kvm_hv]
[c0002003591abae0] [c008000004a5d954] kvmppc_vcpu_run+0x4c/0x70 [kvm]
[c0002003591abb10] [c008000004a56e54] kvm_arch_vcpu_ioctl_run+0x56c/0x7c0 [kvm]
[c0002003591abba0] [c008000004a3ddc4] kvm_vcpu_ioctl+0x4ac/0xd80 [kvm]
[c0002003591abd20] [c0000000006ebb58] ksys_ioctl+0x188/0x210
[c0002003591abd70] [c0000000006ebc28] sys_ioctl+0x48/0xb0
[c0002003591abdb0] [c000000000042764] system_call_exception+0x1d4/0x2e0
[c0002003591abe20] [c00000000000cce8] system_call_common+0xe8/0x214

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-07-21 15:38:03 +10:00