Now that we have switched to explicit page table routines,
get rid of the obsolete kvm_* wrappers.
Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2
page tables, hence get rid of the redundant check.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Now that the hyp page table is handled by different set of
routines, rename the original shared routines to stage2 handlers.
Also make explicit use of the stage2 page table helpers.
unmap_range has been merged to existing unmap_stage2_range.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
We have common routines to modify hyp and stage2 page tables
based on the 'kvm' parameter. For a smoother transition to
using separate routines for each, duplicate the routines
and modify the copy to work on hyp.
Marks the forked routines with _hyp_ and gets rid of the
kvm parameter which is no longer needed and is NULL for hyp.
Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
from the hyp versions. Uses explicit host page table accessors
instead of the kvm_* page table helpers.
Suggested-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
We have stage2 page table helpers for both arm and arm64. Switch to
the stage2 helpers for routines that only deal with stage2 page table.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce stage2 page table helpers for arm64. With the fake
page table level still in place, the stage2 table has the same
number of levels as that of the host (and hyp), so they all
fallback to the host version.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Define the page table helpers for walking the stage2 pagetable
for arm. Since both hyp and stage2 have the same number of levels,
as that of the host we reuse the host helpers.
The exceptions are the p.d_addr_end routines which have to deal
with IPA > 32bit, hence we use the open coded version of their host helpers
which supports 64bit.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Get rid of kvm_pud_huge() which falls back to pud_huge. Use
pud_huge instead.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Both arm and arm64 now provides a helper, pmd_thp_or_huge()
to check if the given pmd represents a huge page. Use that
instead of our own custom check.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Add a helper to determine if a given pmd represents a huge page
either by hugetlb or thp, as we have for arm. This will be used
by KVM MMU code.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Rearrange the code for fake pgd handling, which is applicable
only for arm64. This will later be removed once we introduce
the stage2 page table walker macros.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
We share most of the bits for VTCR_EL2 for different page sizes,
except for the TG0 value and the entry level value. This patch
makes the definitions a bit more cleaner to reflect this fact.
Also cleans up the VTTBR_X calculation. No functional changes.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions
(TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions.
This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2
and VTCR_EL2 fields.
This also fixes a bug where we assume TG0 in {V}TCR_EL2 is 1bit field.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Commit 1e947bad0b ("arm64: KVM: Skip HYP setup when already running
in HYP") re-organized the hyp init code and ended up leaving the CPU
hotplug and PM notifier even if hyp mode initialization fails.
Since KVM is not yet supported with ACPI, the above mentioned commit
breaks CPU hotplug in ACPI boot.
This patch fixes teardown_hyp_mode to properly unregister both CPU
hotplug and PM notifiers in the teardown path.
Fixes: 1e947bad0b ("arm64: KVM: Skip HYP setup when already running in HYP")
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We always thought that 40bits of PA range would be the minimum people
would actually build. Anything less is terrifyingly small.
Turns out that we were both right and wrong. Nobody has ever built
such a system, but the ARM Foundation Model has a PARange set to 36bits.
Just because we can. Oh well. Now, the KVM API explicitely says that
we offer a 40bit PA space to the VM, so we shouldn't run KVM on
the Foundation Model at all.
That being said, this patch offers a less agressive alternative, and
loudly warns about the configuration being unsupported. You'll still
be able to run VMs (at your own risks, though).
This is just a workaround until we have a proper userspace API where
we report the PARange to userspace.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
On a host that runs NTP, corrections can have a direct impact on
the background timer that we program on the behalf of a vcpu.
In particular, NTP performing a forward correction will result in
a timer expiring sooner than expected from a guest point of view.
Not a big deal, we kick the vcpu anyway.
But on wake-up, the vcpu thread is going to perform a check to
find out whether or not it should block. And at that point, the
timer check is going to say "timer has not expired yet, go back
to sleep". This results in the timer event being lost forever.
There are multiple ways to handle this. One would be record that
the timer has expired and let kvm_cpu_has_pending_timer return
true in that case, but that would be fairly invasive. Another is
to check for the "short sleep" condition in the hrtimer callback,
and restart the timer for the remaining time when the condition
is detected.
This patch implements the latter, with a bit of refactoring in
order to avoid too much code duplication.
Cc: <stable@vger.kernel.org>
Reported-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The kernel is written in C, not python, so we need braces around
multi-line if statements. GCC 6 actually warns about this, thanks to the
fantastic new "-Wmisleading-indentation" flag:
| virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’:
| virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation]
| reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
| ^~~
| arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not
| if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E))
| ^~
As it turns out, this particular case is harmless (we just do some &=
operations with 0), but worth fixing nonetheless.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When the kernel is running at EL2, it doesn't need init_hyp_mode() to
configure page tables for HYP. This function also registers the CPU
hotplug and lower power notifiers that cause HYP to be re-initialised
after the CPU has been reset.
To avoid losing the register state that controls stage2 translation, move
the registering of these notifiers into init_subsystems(), and add a
is_kernel_in_hyp_mode() path to each callback.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Fixes: 1e947bad0b ("arm64: KVM: Skip HYP setup when already running in HYP")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When we detect support for 16bit VMID in ID_AA64MMFR1, we set the
VTCR_EL2_VS field to 1 to make use of 16bit vmids. But, with
commit 3a3604bc5e ("arm64: KVM: Switch to C-based stage2 init")
this is broken and we corrupt VTCR_EL2:T0SZ instead of updating the VS
field. VTCR_EL2_VS was actually defined to the field shift (19) and
not the real value for VS. This patch fixes the issue.
Fixes: commit 3a3604bc5e ("arm64: KVM: Switch to C-based stage2 init")
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Pull orangefs filesystem from Mike Marshall.
This finally merges the long-pending orangefs filesystem, which has been
much cleaned up with input from Al Viro over the last six months. From
the documentation file:
"OrangeFS is an LGPL userspace scale-out parallel storage system. It
is ideal for large storage problems faced by HPC, BigData, Streaming
Video, Genomics, Bioinformatics.
Orangefs, originally called PVFS, was first developed in 1993 by Walt
Ligon and Eric Blumer as a parallel file system for Parallel Virtual
Machine (PVM) as part of a NASA grant to study the I/O patterns of
parallel programs.
Orangefs features include:
- Distributes file data among multiple file servers
- Supports simultaneous access by multiple clients
- Stores file data and metadata on servers using local file system
and access methods
- Userspace implementation is easy to install and maintain
- Direct MPI support
- Stateless"
see Documentation/filesystems/orangefs.txt for more in-depth details.
* tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
orangefs: fix orangefs_superblock locking
orangefs: fix do_readv_writev() handling of error halfway through
orangefs: have ->kill_sb() evict the VFS side of things first
orangefs: sanitize ->llseek()
orangefs-bufmap.h: trim unused junk
orangefs: saner calling conventions for getting a slot
orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
orangefs: get rid of readdir_handle_s
ornagefs: ensure that truncate has an up to date inode size
orangefs: move code which sets i_link to orangefs_inode_getattr
orangefs: remove needless wrapper around GFP_KERNEL
orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
orangefs: refactor inode type or link_target change detection
orangefs: use new getattr for revalidate and remove old getattr
orangefs: use new getattr in inode getattr and permission
orangefs: use new orangefs_inode_getattr to get size in write and llseek
orangefs: use new orangefs_inode_getattr to create new inodes
orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
orangefs: remove inode->i_lock wrapper
orangefs: put register_chrdev immediately before register_filesystem
...
translation window setup, NULL ptr dereference, and ntb-perf errors.
Also, a modification to the driver API that makes _addr functions
optional.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW9rVgAAoJEG5mS6x6i9IjutEP/11ufWLOGHOOD4zAZ/Bb62MS
YQXr0I/CUViEiV0Jaj254FmlYYafDAb5LfYVneVGalW3HvLMED069M7WEQyP7UqG
4dTNWyktS6GkjrlejGzY2xrPXr8hG4GTrpkMgBo4uZol1n95VEe3IlOn4VheEwPz
+RNioK1oBDV6f6JXjD5IH0FP+/Gn1VwxVPq/lXpFIZfw4FX1+pbTfjbNB2kkNOXM
0adUEZdyritg3cqrecH1P+ewGYT5//S/ai/AgxghddJAwh2IHlf46ZdGiVH37Awd
r6+EgqzdPiG5Qxlo2CTnu1gb4zsqXJYViWsM+vNIqJYCweD7ZoMuYy89mAf+l0kb
a4kBi657qmEsalw4nqeMoZHuHdW2G9gA8UHmNyENRWmtHG7odVXGnR4PA0nE7kyw
JPrNSHQ7mGgo+9wsvLYT6BAatpAjBIhE2wjrR8svS8qnP8jU9mjT3nTKAWKl1XO+
YXbUwlqbiMaFzBNd327iyEBqoGU3j1ba+AiG3IBbiOxNJz1WnfFNvnuYMu4zuruZ
KmdZlQTO5s2YPBIV+BgzPK6oRBCTxQrVBsRN2jn/i+02gAgB3uZeT3IhLCTLZhtf
mT5/3yyvk6O8Hcs5VoSzyyzbAHIL0y/NEUR0jnOqZowwTx8ajg+owHAUgggUloHj
wXALu1jc80GsOglBKU0Z
=7DcD
-----END PGP SIGNATURE-----
Merge tag 'ntb-4.6' of git://github.com/jonmason/ntb
Pull NTB bug fixes from Jon Mason:
"NTB bug fixes for tasklet from spinning forever, link errors,
translation window setup, NULL ptr dereference, and ntb-perf errors.
Also, a modification to the driver API that makes _addr functions
optional"
* tag 'ntb-4.6' of git://github.com/jonmason/ntb:
NTB: Remove _addr functions from ntb_hw_amd
NTB: Make _addr functions optional in the API
NTB: Fix incorrect clean up routine in ntb_perf
NTB: Fix incorrect return check in ntb_perf
ntb: fix possible NULL dereference
ntb: add missing setup of translation window
ntb: stop link work when we do not have memory
ntb: stop tasklet from spinning forever during shutdown.
ntb: perf test: fix address space confusion
The only new stuff which missed the first pull request is an update to
the UFS driver. The rest is an assortment of bug fixes and minor
tweaks which appeared recently (some are fixes for recent code and
some are stuff spotted recently by the checkers or the new gcc-6
compiler [most of Arnd's stuff]).
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJW9nPcAAoJEDeqqVYsXL0M4CIH/1EkSjCyLzm5yDGzPKyD8LuS
r8mNmXEgKxKuCVenmsydDEa4YmEH/94ysMevwXCogDvUz0ms/qRHJnF3cy7MT7fe
TlcuKQdshl3k5gRP33K3AkK1aNtzyWGwiP+5+e+uO3rzJgujJa+IcpvYYk/e46GE
yTfi6uEdNRFD6xGxqfttvO9I+YKj5XtRpNZQe/YAS6bcyLm0R62031b8OcmKxWYT
m/F9AlxKeIDmutH5GK5siePQ1KNmn1LZOGYO8RKA4jcyzhxJ8qv8HdRpFGAGuyg/
f7V0OGqhLGlzZ5pNRfsYpAhoScmwbm+rYJv1W0vjRdeoAgmNkR8S5LDeXHpvZ5A=
=2sEy
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"The only new stuff which missed the first pull request is an update to
the UFS driver.
The rest is an assortment of bug fixes and minor tweaks which appeared
recently (some are fixes for recent code and some are stuff spotted
recently by the checkers or the new gcc-6 compiler [most of Arnd's
stuff])"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
scsi_common: do not clobber fixed sense information
scsi: ufs: select CONFIG_NLS
scsi: fc: use get/put_unaligned64 for wwn access
fnic: move printk()s outside of the critical code section.
qla2xxx: avoid maybe_uninitialized warning
megaraid_sas: add missing curly braces in ioctl handler
lpfc: fix misleading indentation
scsi_transport_sas: add 'scsi_target_id' sysfs attribute
scsi_dh_alua: uninitialized variable in alua_check_vpd()
scsi: ufs-qcom: add printouts of testbus debug registers
scsi: ufs-qcom: enable/disable the device ref clock
scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
scsi: ufs: add device quirk delay before putting UFS rails in LPM
scsi: ufs: fix leakage during link off state
scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
scsi: ufs: handle non spec compliant bkops behaviour by device
scsi: ufs: add retry for query descriptors
scsi: ufs: add error recovery after DL NAC error
scsi: ufs: make error handling bit faster
scsi: ufs: disable vccq if it's not needed by UFS device
...
Commit 0b81d07790 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
to just "FS" for preprocessor symbols).
Because of the symbol renaming, it's a bit hard to see it as a file
move: use
git show -M30 0b81d07790
to lower the rename detection to just 30% similarity and make git show
the files as renamed (the header file won't be shown as a rename even
then - since all it contains is symbol definitions, it looks almost
completely different).
Even with the renames showing as renames, the diffs are not all that
easy to read, since so much is just the renames. But Eric Biggers
noticed that it's not just all renames: the initialization of the
xts_tweak had been broken too, using the inode number rather than the
page offset.
That's not right - it makes the xfs_tweak the same for all pages of each
inode. It _might_ make sense to make the xfs_tweak contain both the
offset _and_ the inode number, but not just the inode number.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel zero day testing warned about address space confusion. A virtual
iomem address was used where a physical address is expected. The
offending functions implement an optional part of the api, so they are
removed. They can be added later, after testing.
Fixes: a1b3695820
Signed-off-by: Allen Hubbe <Allen.Hubbe@emc.com>
Acked-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
* remove from the list _before_ orangefs_unmount() - request_mutex
in the latter will make sure that nothing observed in the loop in
ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
of loop
* on removal, keep the forward pointer and zero the back one. That
way we can drop and regain the spinlock in the loop body (again,
ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
rest of the list.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Error should only be returned if nothing had been read/written.
Otherwise we need to report a short read/write instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
a) open files can't have NULL inodes
b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
c) make_bad_inode() on lseek()?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
just have it return the slot number or -E... - the caller checks
the sign anyway
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
no point, really - we couldn't keep those across the calls of
getdents(); it would be too easy to DoS, having all slots exhausted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Merge fourth patch-bomb from Andrew Morton:
"A lot more stuff than expected, sorry. A bunch of ocfs2 reviewing was
finished off.
- mhocko's oom-reaper out-of-memory-handler changes
- ocfs2 fixes and features
- KASAN feature work
- various fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (42 commits)
thp: fix typo in khugepaged_scan_pmd()
MAINTAINERS: fill entries for KASAN
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2
mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
mm, kasan: add GFP flags to KASAN API
mm, kasan: SLAB support
kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()
include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()
mm/page_alloc: prevent merging between isolated and other pageblocks
drivers/memstick/host/r592.c: avoid gcc-6 warning
ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
ocfs2: solve a problem of crossing the boundary in updating backups
ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
ocfs2/dlm: fix race between convert and recovery
ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
...
- Fix for an intel_pstate driver issue related to the handling of
MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking
fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems
from hanging during initialization by disabling deep C-states
mishandled by the platform in the problematic configurations (Len
Brown).
- Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman
Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the
fallback C-state on x86 when they are set below its exit latency)
and to restore the previous behavior to fall back to C1 if the next
timer event is set far enough in the future that was changed in 4.4
which led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API
to make it parse non-device (data node only) children of an
ACPI device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJW9bXRAAoJEILEb/54YlRxjX8P/38haQ1cs7aNyHjv+eAAXGDq
kr+oNG+cE5D8/X6wT7pWliIRkLzZM2+D/ec2QdA9kFnB/8DNoKdeJ2vi/K3cfbVO
Jz3W97GgwdxSjzSxF2MjHSP/AGAZSvipzH9aL4ofxSFdPNWnget/58bUMo/HdRPH
+vtAfTcfYxYCiJAKJMntvCjWuMZqDTM+YUcTkfUp5jDqvNStqzHvhZCFqo4lpci6
pJAUIkaSXo6lmazIfyPgYQLVEXN1ljbXceJFP84Uk+XfaAEKmtzi5aI11MADqUwj
7TXCR9p6wb678Rbb7FCTVBkOFvQ607+qASG2lMe8IxGa0l7rmyNpVKuQ1uKHLCwp
ozMV3oLVaG/HyZTHpUN6nYXF7QgHWmNk+YZcpun0JTk/ehwGQTOt3B2Zheianyq/
I0lFnBqTFI4e0cuYTDv6N7CKAK7rsBHvoNB5t/oPbtAzdGbeDpceoI1R8Mj7hbSj
zOf+Q46AVyC7neWWbY5QJvKnWp8fVMzlj1p3BqzWD5XWWyaYE2f/xijFM+jU34lE
jx+X/C0N7vZ7cL2x5Zd4BD9E80D1MxqzW1lZ763lMg8bmpQaPNDFWH8rmq/r7CUv
uf0HC91ndTaJ8yoV8gUNXPWQrev8w5Gcrse1LDXXGDfXq7MupLgE+MKxRm8oW+hr
uEAOfwkU9eeswv+jWPJx
=KY4f
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixlet from Rafael Wysocki:
"One of commits in my previous pull request changed the permissions of
drivers/power/avs/rockchip-io-domain.c to executable by mistake"
* tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Fix permissions of drivers/power/avs/rockchip-io-domain.c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW9bT9AAoJEKurIx+X31iBiiQP/3Ebwh+ZahGsQiiLExa5LNTt
SgXB2dBy7qtJz3RSeaaKXsgwB+FqNpviLrUiF2pT9o88tG/d7CiBQpoloYhzR3oM
vnuXNHnolUuhcKvUTZQDsgSa0Y2RiaTJbO8/TGdahjLUBmeZZEWeDnOF15TVU30S
K4FrSryVLGdt7hk86zsQCUyMONOLODiQlwGUyzqZAI01ndDEVWusfBbxmeCVDCsk
K6ycx2TAfGgKYfFJXJIAEY053xsFna/R0f+DeMsr+xLgoN4fY/fZLaLyncjm5ioj
smcHHurN54nB0Fvn2cD2Mjqn/0KHx/gC3Yfa2C3DL6e79RRXAFH0RhRZWHR55Itr
MXd5eCQnnRHYgy2LR06fTQtLCzh5ZgEcroWVx/nFdRfXqdNgGr5s/23lLhOrouTA
2/HD6ZZnWXdL+c1r5duEhsC5qyUzesg2ZNtGRmShxJm9roMpo0LQkENv9BoZ30Tz
KQcXLeWokDDDPNRA5OBiZ053WQkOQ4bY0/qyHhxiWY2ORdVUMiT/l8qWv9JjNisN
plFJ93pvl7Q8AIupj9Pl4wXNJAoY367hevSTvpc+HkRU7YPfQiZbhHQIcfqbeaOa
PSd84vRgUqGeuTdiW/jxVgs7UAH0CM9emOr4bgI90xTD588R2foNWiDik/wFDZn7
TrgwUrsVCaFFbsSe3WQC
=Zn50
-----END PGP SIGNATURE-----
Merge tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull ia64 update from Tony Luck:
"Wire up new system calls p{read,write}v2 for ia64"
* tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
[IA64] Enable preadv2 and pwritev2 syscalls for ia64
Pull more input updates from Dmitry Torokhov:
"Second round of updates for the input subsystem.
The BYD PS/2 protocol driver now uses absolute reporting mode and
should behave more like other touchpads; Synaptics driver needed to
extend one of its quirks to a newer firmware version, and a few USB
drivers got tightened up checks for the contents of their descriptors"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: sur40 - fix DMA on stack
Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
Input: synaptics - handle spurious release of trackstick buttons, again
Input: synaptics-rmi4 - remove check of Non-NULL array
Input: byd - enable absolute mode
Input: ims-pcu - sanity check against missing interfaces
Input: melfas_mip4 - add hw_version sysfs attribute
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks. The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.
IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.
Right now stackdepot support is only enabled in SLAB allocator. Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.
This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.
Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.
[akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabinin@virtuozzo.com: comment style fixes]
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.
Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add GFP flags to KASAN hooks for future patches to use.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add KASAN hooks to SLAB allocator.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset implements SLAB support for KASAN
Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap
objects, therefore we reimplement this feature in mm/kasan/stackdepot.c.
The intention is to ultimately switch SLUB to use this implementation as
well, which will save a lot of memory (right now SLUB bloats each object
by 256 bytes to store the allocation/deallocation stacks).
Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which
is necessary for better detection of use-after-free errors. We
introduce memory quarantine (mm/kasan/quarantine.c), which allows
delayed reuse of deallocated memory.
This patch (of 7):
Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as
the test only checks the page allocator functionality. Also reimplement
kmalloc_large_oob_right() so that the test allocates a large enough
chunk of memory that still does not trigger the page allocator fallback.
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A leftover from commit c32b3cbe0d ("oom, PM: make OOM detection in the
freezer path raceless").
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:
> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal: 16342016 kB
> MemFree: 22367268 kB
> MemAvailable: 22370528 kB
Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:
> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.
This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
but since it doesn't hold the zone->lock between pageblocks, a race
window does exist.
It's also likely that unexpected merging can occur between
MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
__free_one_page() since commit 3c605096d3 ("mm/page_alloc: restrict
max order of merging on isolated pageblock"). However, we only check
the migratetype of the pageblock where buddy merging has been initiated,
not the migratetype of the buddy pageblock (or group of pageblocks)
which can be MIGRATE_ISOLATE.
Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath
and bloat-o-meter has shown significant code bloat (the function is
inline).
This patch reduces the bloat at some expense of more complicated code.
The buddy-merging while-loop in __free_one_page() is initially bounded
to pageblock_border and without any migratetype checks. The checks are
placed outside, bumping the max_order if merging is allowed, and
returning to the while-loop with a statement which can't be possibly
considered harmful.
This fixes the accounting bug and also removes the arguably weird state
in the original commit 3c605096d3 where buddies could be left
unmerged.
Fixes: 3c605096d3 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Link: https://lkml.org/lkml/2016/3/2/280
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Debugged-by: Laura Abbott <labbott@redhat.com>
Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>