As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
Pull sysctl system call removal from Eric Biederman:
"As far as I can tell we have reached the point where no one enables
the sysctl system call anymore. It still is enabled in a few
defconfigs but they are mostly the rarely used one and in asking
people about that it was more cut & paste enabled than anything else.
This is single commit that just deletes code. Leaving just enough code
so that the deprecated sysctl warning continues to be printed. If my
analysis turns out to be wrong and someone actually cares it will be
easy to revert this commit and have the system call again.
There was one new xtensa defconfig in linux-next that enabled the
system call this cycle and when asked about it the maintainer of the
code replied that it was not enabled on purpose. As of today's
linux-next tree that defconfig no longer enables the system call.
What we saw in the review discussion was that if we go a step farther
than my patch and mess with uapi headers there are pieces of code that
won't compile, but nothing minds the system call actually disappearing
from the kernel"
Link: https://lore.kernel.org/lkml/201910011140.EA0181F13@keescook/
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Remove the sysctl system call
Paolo Abeni says:
====================
openvswitch: remove a couple of BUG_ON()
The openvswitch kernel datapath includes some BUG_ON() statements to check
for exceptional/unexpected failures. These patches drop a couple of them,
where we can do that without introducing other side effects.
v1 -> v2:
- avoid memory leaks on error path
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If we can't build the flow del notification, we can simply delete
the flow, no need to crash the kernel. Still keep a WARN_ON to
preserve debuggability.
Note: the BUG_ON() predates the Fixes tag, but this change
can be applied only after the mentioned commit.
v1 -> v2:
- do not leak an skb on error
Fixes: aed067783e ("openvswitch: Minimize ovs_flow_cmd_del critical section.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All the callers of ovs_flow_cmd_build_info() already deal with
error return code correctly, so we can handle the error condition
in a more gracefull way. Still dump a warning to preserve
debuggability.
v1 -> v2:
- clarify the commit message
- clean the skb and report the error (DaveM)
Fixes: ccb1352e76 ("net: Add Open vSwitch kernel components.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was reported [0] that since the referenced commit a warning is
triggered in phylib that complains about paged operations being used
with a PHY driver that doesn't support this. The commit isn't wrong,
just for one chip version (RTL8105e) no dedicated PHY driver exists
yet. So add the missing PHY driver.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=202103
Fixes: 3a129e3f9a ("r8169: switch to phylib functions in more places")
Reported-by: jhdskag3 <jhdskag3@tutanota.com>
Tested-by: jhdskag3 <jhdskag3@tutanota.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was reported [0] that network doesn't wake up on cable plug-in with
certain chip versions. Reason is that on these chip versions the PHY
doesn't detect cable plug-in when being in power-down mode. So prevent
the PHY from powering down if WoL is enabled.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=202103
Fixes: 95fb8bb318 ("net: phy: force phy suspend when calling phy_stop")
Reported-by: jhdskag3 <jhdskag3@tutanota.com>
Tested-by: jhdskag3 <jhdskag3@tutanota.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan reported [0] that network is broken since the referenced commit
when using jumbo frames. This commit isn't wrong, it just revealed
another issue that has been existing before. According to the vendor
driver the RTL8168e-specific jumbo config doesn't apply for RTL8168evl.
[0] https://lkml.org/lkml/2019/11/30/119
Fixes: 4ebcb113ed ("r8169: fix jumbo packet handling on resume from suspend")
Reported-by: Alan J. Wylie <alan@wylie.me.uk>
Tested-by: Alan J. Wylie <alan@wylie.me.uk>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a block device supports rw_page operation, it doesn't submit bios so
the annotation in submit_bio() for refault stall doesn't work. It
happens with zram in android, especially swap read path which could
consume CPU cycle for decompress. It is also a problem for zswap which
uses frontswap.
Annotate swap_readpage() to account the synchronous IO overhead to
prevent underreport memory pressure.
[akpm@linux-foundation.org: add comment, per Johannes]
Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
End a Kconfig help text sentence with a period (aka full stop).
Link: http://lkml.kernel.org/r/c17f2c75-dc2a-42a4-2229-bb6b489addf2@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several places emphasise the effect of __SetPageUptodate(),
while the comment seems to have a typo in two places.
Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The member in struct zonelist is _zonerefs instead of zones.
Link: http://lkml.kernel.org/r/20190927144049.GA29622@haolee.github.io
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
which equal LLONG_MAX.
If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
shmem_fallocate, which will pass the checking in vfs_fallocate.
/* Check for wrap through zero too */
if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
return -EFBIG;
loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
causes a overflow.
Syzkaller reports a overflow problem in mm/shmem:
UBSAN: Undefined behaviour in mm/shmem.c:2014:10
signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
Hardware name: linux, dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
__dump_stack lib/dump_stack.c:15 [inline]
ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
handle_overflow+0x158/0x1b0 lib/ubsan.c:195
shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
vfs_fallocate+0x238/0x428 fs/open.c:312
SYSC_fallocate fs/open.c:335 [inline]
SyS_fallocate+0x54/0xc8 fs/open.c:239
The highest bit of unmap_start will be appended with sign bit 1
(overflow) when calculate shmem_falloc.start:
shmem_falloc.start = unmap_start >> PAGE_SHIFT.
Fix it by casting the type of unmap_start to u64, when right shifted.
This bug is found in LTS Linux 4.1. It also seems to exist in mainline.
Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The shmem_writepage() uses GFP_ATOMIC to allocate swap cache. GFP_ATOMIC
used to mean __GFP_HIGH, but now it means __GFP_HIGH | __GFP_ATOMIC |
__GFP_KSWAPD_RECLAIM. However, shmem_writepage() should write out to swap
only in response to memory pressure, so __GFP_KSWAPD_RECLAIM looks useless
since the caller may be kswapd itself or in direct reclaim already.
In addition, XArray node allocations from PF_MEMALLOC contexts could
completely exhaust the page allocator, __GFP_NOMEMALLOC stops emergency
reserves from being allocated.
Here just copy the gfp flags used by add_to_swap().
Hugh:
"a cleanup to make the two calls look the same when they don't need to
be different (whereas the call from __read_swap_cache_async() rightly
uses a lower priority gfp)".
Link: http://lkml.kernel.org/r/1572991351-86061-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't populate the array 'values' on the stack but instead make it static
const. Makes the object code smaller by 111 bytes.
Before:
text data bss dec hex filename
108612 11169 512 120293 1d5e5 mm/shmem.o
After:
text data bss dec hex filename
108437 11233 512 120182 1d576 mm/shmem.o
(gcc version 9.2.1, amd64)
Link: http://lkml.kernel.org/r/20190906143012.28698-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A while ago Andy noticed
(http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com)
that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have
security implications.
As the first step of the solution the following patch limits the availably
of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE.
The usage of CAP_SYS_PTRACE ensures compatibility with CRIU.
Yet, if there are other users of non-cooperative userfaultfd that run
without CAP_SYS_PTRACE, they would be broken :(
Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file
descriptor table from the read() implementation of uffd, which may have
security implications for unprivileged use of the userfaultfd.
Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have
CAP_SYS_PTRACE.
Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Nosh Minwalla <nosh@google.com>
Cc: Pavel Emelyanov <ovzxemul@gmail.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the registration is repeated without VM_UFFD_MISSING or VM_UFFD_WP they
need to be cleared. Currently setting UFFDIO_REGISTER_MODE_WP returns
-EINVAL, so this patch is a noop until the UFFDIO_REGISTER_MODE_WP support
is applied.
Link: http://lkml.kernel.org/r/20191004232834.GP13922@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing UFFDIO_COPY, it is necessary to find the correct destination
vma and make sure fault range is in it.
Since there are two places need to do the same task, just wrap those
common check into an inlined function.
Link: http://lkml.kernel.org/r/20190927070032.2129-3-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These warning here is to make sure address(dst_addr) and length(len -
copied) are huge page size aligned.
While this is ensured by:
dst_start and len is huge page size aligned
dst_addr equals to dst_start and increase huge page size each time
copied increase huge page size each time
This means these warnings will never be triggered.
Link: http://lkml.kernel.org/r/20190927070032.2129-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __mcopy_atomic_hugetlb() we use two variables to deal with huge page
size: vma_hpagesize and huge_page_size.
Since they are the same, it is not necessary to use two different
mechanism. This patch makes it consistent by all using vma_hpagesize.
Link: http://lkml.kernel.org/r/20190927070032.2129-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_size() is supported after the commit a50b854e07 ("mm: introduce
page_size()").
Use page_size() in madvise_inject_error() for readability.
[akpm@linux-foundation.org: use ulong for `size', per David]
Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Case 1/6, 2/7 and 3/8 have the same pattern and we handle them in the
same logic.
Rearrange the comment to make it a little easy for audience to
understand.
Link: http://lkml.kernel.org/r/20191030012445.16944-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572403660-44718-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In auto NUMA balancing page table scanning, if the pte_protnone() is
true, the PTE needs not to be changed because it's in target state
already. So other checking on corresponding struct page is unnecessary
too.
So, if we check pte_protnone() firstly for each PTE, we can avoid
unnecessary struct page accessing, so that reduce the cache footprint of
NUMA balancing page table scanning.
In the performance test of pmbench memory accessing benchmark with 80:20
read/write ratio and normal access address distribution on a 2 socket
Intel server with Optance DC Persistent Memory, perf profiling shows
that the autonuma page table scanning time reduces from 1.23% to 0.97%
(that is, reduced 21%) with the patch.
Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA). But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE. That is, the requested zone is different. The size of
lowmem_reserve for the different requested zone is different. And this
may cause some issues.
For example, in the zoneinfo of a test machine as below,
Node 0, zone DMA32
pages free 61592
min 29
low 454
high 879
spanned 1044480
present 442306
managed 425921
protection: (0, 0, 62457, 62457, 62457)
The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true. So, autonuma may not stop even when memory pressure in
node 0 is heavy.
To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as
requested zone in alloc_misplaced_dst_page(). So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.
Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572348687-9951-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yue Hu <huyue2@yulong.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kzalloc() is used for cma bitmap allocation in cma_activate_area(),
switch to bitmap_zalloc() for clarity.
Link: http://lkml.kernel.org/r/895d4627-f115-c77a-d454-c0a196116426@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Yue Hu <huyue2@yulong.com>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ryohei Suzuki <ryh.szk.cmnty@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For non-shmem file THPs, khugepaged only collapses read only .text
mapping (VM_DENYWRITE). These pages should not be dirty except the case
where the file hasn't been flushed since first write.
Call filemap_flush() in collapse_file() to accelerate the write back in
such cases.
Link: http://lkml.kernel.org/r/20191106060930.2571389-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding fully unmapped pages into deferred split queue is not productive:
these pages are about to be freed or they are pinned and cannot be split
anyway.
Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing migration if the freed page is met, we just return without
migrating it since it is pointless to migrate a freed page. But, the
current code allocates target page unconditionally before handling freed
page, if the page is freed, the newly allocated will be just freed. It
doesn't make too much sense and is just a waste of time although
migrating freed page is rare.
So, handle freed page at the before that to avoid unnecessary page
allocation and free.
Link: http://lkml.kernel.org/r/1573755869-106954-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
split_huge_pages_fops is used for debugfs file. hence, it is more clear
to use DEFINE_DEBUGFS_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572347674-8111-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mmapping an existing hugetlbfs file with MAP_POPULATE, we find it
is very time consuming. For example, mmapping a 128GB file takes about
50 milliseconds. Sampling with perfevent shows it spends 99% time in
the same_page loop in follow_hugetlb_page().
samples: 205 of event 'cycles', Event count (approx.): 136686374
- 99.04% test_mmap_huget [kernel.kallsyms] [k] follow_hugetlb_page
follow_hugetlb_page
__get_user_pages
__mlock_vma_pages_range
__mm_populate
vm_mmap_pgoff
sys_mmap_pgoff
sys_mmap
system_call_fastpath
__mmap64
follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for
each hugepage, we run into the same_page loop for pages_per_huge_page()
times, but doing nothing. With this change, it takes less then 1
millisecond to mmap a 128GB file in hugetlbfs.
Link: http://lkml.kernel.org/r/1567581712-5992-1-git-send-email-totty.lu@gmail.com
Signed-off-by: Zhigang Lu <tonnylu@tencent.com>
Reviewed-by: Haozhong Zhang <hzhongzhang@tencent.com>
Reviewed-by: Zongming Zhang <knightzhang@tencent.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first parameter hstate in function hugetlb_fault_mutex_hash() is not
used anymore.
This patch removes it.
[akpm@linux-foundation.org: various build fixes]
[cai@lca.pw: fix a GCC compilation warning]
Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove duplicated code between region_chg and region_add, and refactor
it into a common function, add_reservation_in_range. This is mostly
done because there is a follow up change in another series that disables
region coalescing in region_add, and I want to make that change in one
place only. It should improve maintainability anyway on its own.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190919200428.188797-3-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current behavior is that region_chg provides both a cache entry in
resv->region_cache, AND a placeholder entry in resv->regions.
region_add first tries to use the placeholder, and if it finds that the
placeholder has been deleted by a racing region_del call, it uses the
cache entry.
This behavior is completely unnecessary and is removed in this patch for
a couple of reasons:
1. region_add needs to either find a cached file_region entry in
resv->region_cache, or find an entry in resv->regions to expand. It
does not need both.
2. region_chg adding a placeholder entry in resv->regions opens up
a possible race with region_del, where region_chg adds a placeholder
region in resv->regions, and this region is deleted by a racing call
to region_del during region_chg execution or before region_add is
called. Removing the race makes the code easier to reason about and
maintain.
In addition, a follow up patch in another series that disables region
coalescing, which would be further complicated if the race with
region_del exists.
Link: http://lkml.kernel.org/r/20190919200428.188797-2-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are
experiencing random multisecond delays. These delays were caused by the
long time it took to scan the VMA interval tree with mmap_sem held.
The sharing of huge PMD does not require changes to the i_mmap at all.
Therefore, we can just take the read lock and let other threads
searching for the right VMA share it in parallel. Once the right VMA is
found, either the PMD lock (2M huge page for x86-64) or the
mm->page_table_lock will be acquired to perform the actual PMD sharing.
Lock contention, if present, will happen in the spinlock. That is much
better than contention in the rwsem where the time needed to scan the
the interval tree is indeterminate.
With this patch applied, the customer is seeing significant performance
improvement over the unpatched kernel.
Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With hugetlbfs, a common pattern for mapping anonymous huge pages is to
create a temporary file first. Currently libraries like libhugetlbfs
and seastar create these with a standard mkstemp+unlink trick, but it
would be more robust to be able to simply pass the O_TMPFILE flag to
open(). O_TMPFILE is already supported by several file systems like
ext4 and xfs. The implementation simply uses the existi= ng d_tmpfile
utility function to instantiate the dcache entry for the file.
Tested manually by successfully creating a temporary file by opening it
with (O_TMPFILE|O_RDWR) on mounted hugetlbfs and successfully mapping 2M
huge pages with it. Without the patch, trying to open a file with
O_TMPFILE results in -ENOSUP.
Link: http://lkml.kernel.org/r/bc9383eff6e1374d79f3a92257ae829ba1e6ae60.1573285189.git.p.sarna@tlen.pl
Signed-off-by: Piotr Sarna <p.sarna@tlen.pl>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
huge_pte_offset() produced a sparse warning due to an improper return
type when the kernel was built with !CONFIG_HUGETLB_PAGE. Fix the bad
type and also convert all the macros in this block to static inline
wrappers. Two existing wrappers in this block had lines in excess of 80
columns so clean those up as well.
No functional change.
Link: http://lkml.kernel.org/r/20191112194558.139389-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Ben Dooks <ben.dooks@codethink.co.uk>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "hugetlbfs: convert macros to static inline, fix sparse
warning".
The definition for huge_pte_offset() in <linux/hugetlb.h> causes a
sparse warning in the !CONFIG_HUGETLB_PAGE. Fix this as well as
converting all macros in this block of definitions to static inlines for
better type checking.
When making the above changes, build errors were found in powerpc due to
duplicate definitions. A separate powerpc specific patch is included as
a requisite to remove the definitions and get them from
<linux/hugetlb.h>.
This patch (of 2):
This removes the power specific stubs created by commit aad71e3928
("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when
!CONFIG_HUGETLB_PAGE. Instead, it addresses the build break by getting
the definitions from <linux/hugetlb.h>. This allows the macros in
<linux/hugetlb.h> to be replaced with static inlines.
Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is assumed that the hugetlbfs_vfsmount[] array will contain either a
valid vfsmount pointer or NULL for each hstate after initialization.
Changes made while converting to use fs_context broke this assumption.
While fixing the hugetlbfs_vfsmount issue, it was discovered that
init_hugetlbfs_fs never did correctly clean up when encountering a vfs
mount error.
It was found during code inspection. A small memory allocation failure
would be the most likely cause of taking a error path with the bug.
This is unlikely to happen as this is early init code.
Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com
Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Fixes: 32021982a3 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
to determine the number of u32's in an array of unsigned longs.
Suppress warning by adding parentheses.
While looking at the above issue, noticed that the 'address' parameter
to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
definition and all callers.
No functional change.
Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Ilie Halip <ilie.halip@gmail.com>
Cc: David Bolvansky <david.bolvansky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
for page management structure, if memory allocation fails from specified
node, it will fall back to allocate from other nodes.
Normally, the page management structure will not exceed 2% of the total
memory, but a large continuous block of allocation is needed. In most
cases, memory allocation from the specified node will succeed, but a
node memory become highly fragmented will fail. we expect to allocate
memory base section rather than by allocating a large block of memory
from other NUMA nodes
Add memblock_alloc_exact_nid_raw() for this situation, which allocate
boot memory block on the exact node. If a large contiguous block memory
allocate fail in sparse_buffer_init(), it will fall back to allocate
small block memory base section.
Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change "max_addr" to "end" for less confusion in
memblock_alloc_range_nid comments.
Link: http://lkml.kernel.org/r/20191113051822.3296-1-ruansy.fnst@cn.fujitsu.com
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fix typos for:
elaboarte -> elaborate
architecure -> architecture
compltes -> completes
And, convert the markup :c:func:`foo` to foo() as kernel documentation
toolchain can recognize foo() as a function.
Link: http://lkml.kernel.org/r/20190912123127.8694-1-caoj.fnst@cn.fujitsu.com
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mbind() is required to report EFAULT if range, specified by addr and
len, contains unmapped holes. In current implementation, below rules
are applied for this checking:
1: Unmapped holes at any part of the specified range should be reported
as EFAULT if mbind() for none MPOL_DEFAULT cases;
2: Unmapped holes at any part of the specified range should be ignored
(do not reprot EFAULT) if mbind() for MPOL_DEFAULT case;
3: The whole range in an unmapped hole should be reported as EFAULT;
Note that rule 2 does not fullfill the mbind() API definition, but since
that behavior has existed for long days (the internal flag
MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to
change it.
In current code, application observed inconsistent behavior on rule 1
and rule 2 respectively. That inconsistency is fixed as below details.
Cases of rule 1:
- Hole at head side of range. Current code reprot EFAULT, no change by
this patch.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at middle of range. Current code report EFAULT, no change by
this patch.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at tail side of range. Current code do not report EFAULT, this
patch fixes it.
[ vma ][ hole ][ vma ]
[ range ]
Cases of rule 2:
- Hole at head side of range. Current code reports EFAULT, this patch
fixes it.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at middle of range. Current code does not report EFAULT, no
change by this patch.
[ vma ][ hole ][ vma]
[ range ]
- Hole at tail side of range. Current code does not report EFAULT, no
change by this patch.
[ vma ][ hole ][ vma]
[ range ]
This patch has no changes to rule 3.
The unmapped hole checking can also be handled by using .pte_hole(),
instead of .test_walk(). But .pte_hole() is called for holes inside and
outside vma, which causes more cost, so this patch keeps the original
design with .test_walk().
Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com
Fixes: 6f4576e368 ("mempolicy: apply page table walker on queue_pages_range()")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Fix checking unmapped holes for mbind", v4.
This patchset fix checking unmapped holes for mbind().
First patch makes sure the vma been correctly tracked in .test_walk(),
so each time when .test_walk() is called, the neighborhood of two vma
is correct.
Current problem is that the !vma_migratable() check could cause return
immediately without update tracking to vma.
Second patch fix the inconsistent report of EFAULT when mbind() is
called for MPOL_DEFAULT and non MPOL_DEFAULT cases, so application do
not need to have workaround code to handle this special behavior.
Currently there are two problems, one is that the .test_walk() can not
know there is hole at tail side of range, because .test_walk() only
call for vma not for hole. The other one is that mbind_range() checks
for hole at head side of range but do not consider the
MPOL_MF_DISCONTIG_OK flag as done in .test_walk().
This patch (of 2):
Checking unmapped hole and updating the previous vma must be handled
first, otherwise the unmapped hole could be calculated from a wrong
previous vma.
Several commits were relevant to this error:
- commit 6f4576e368 ("mempolicy: apply page table walker on
queue_pages_range()")
This commit was correct, the VM_PFNMAP check was after updating
previous vma
- commit 48684a65b4 ("mm: pagewalk: fix misbehavior of
walk_page_range for vma(VM_PFNMAP)")
This commit added VM_PFNMAP check before updating previous vma. Then,
there were two VM_PFNMAP check did same thing twice.
- commit acda0c3340 ("mm/mempolicy.c: get rid of duplicated check for
vma(VM_PFNMAP) in queue_page s_range()")
This commit tried to fix the duplicated VM_PFNMAP check, but it
wrongly removed the one which was after updating vma.
Link: http://lkml.kernel.org/r/1573218104-11021-2-git-send-email-lixinhai.lxh@gmail.com
Fixes: acda0c3340 (mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range())
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For each page scheduled for compaction (e. g. by z3fold_free()), try to
apply inter-page compaction before running the traditional/ existing
intra-page compaction. That means, if the page has only one buddy, we
treat that buddy as a new object that we aim to place into an existing
z3fold page. If such a page is found, that object is transferred and the
old page is freed completely. The transferred object is named "foreign"
and treated slightly differently thereafter.
Namely, we increase "foreign handle" counter for the new page. Pages with
non-zero "foreign handle" count become unmovable. This patch implements
"foreign handle" detection when a handle is freed to decrement the foreign
handle counter accordingly, so a page may as well become movable again as
the time goes by.
As a result, we almost always have exactly 3 objects per page and
significantly better average compression ratio.
[cai@lca.pw: fix -Wunused-but-set-variable warnings]
Link: http://lkml.kernel.org/r/1570542062-29144-1-git-send-email-cai@lca.pw
[vitalywool@gmail.com: avoid subtle race when freeing slots]
Link: http://lkml.kernel.org/r/20191127152118.6314b99074b0626d4c5a8835@gmail.com
[vitalywool@gmail.com: compact objects more accurately]
Link: http://lkml.kernel.org/r/20191127152216.6ad33745a21ba71c53606acb@gmail.com
[vitalywool@gmail.com: protect handle reads]
Link: http://lkml.kernel.org/r/20191127152345.8059852f60947686674d726d@gmail.com
Link: http://lkml.kernel.org/r/20191006041457.24113-1-vitalywool@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>