Commit Graph

857110 Commits

Author SHA1 Message Date
Dave Chinner
9df243a1a9 xfs: consolidate attribute value copying
The same code is used to copy do the attribute copying in three
different places. Consolidate them into a single function in
preparation from on-demand buffer allocation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
e3cc4554ce xfs: move remote attr retrieval into xfs_attr3_leaf_getvalue
Because we repeat exactly the same code to get the remote attribute
value after both calls to xfs_attr3_leaf_getvalue() if it's a remote
attr. Just do it in xfs_attr3_leaf_getvalue() so the callers don't
have to care about it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
a0e959d3c9 xfs: remove unnecessary indenting from xfs_attr3_leaf_getvalue
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
728bcaa3e0 xfs: make attr lookup returns consistent
Shortform, leaf and remote value attr value retrieval return
different values for success. This makes it more complex to handle
actual errors xfs_attr_get() as some errors mean success and some
mean failure. Make the return values consistent for success and
failure consistent for all attribute formats.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
756c6f0f7e xfs: reverse search directory freespace indexes
When a directory is growing rapidly, new blocks tend to get added at
the end of the directory. These end up at the end of the freespace
index, and when the directory gets large finding these new
freespaces gets expensive. The code does a linear search across the
frespace index from the first block in the directory to the last,
hence meaning the newly added space is the last index searched.

Instead, do a reverse order index search, starting from the last
block and index in the freespace index. This makes most lookups for
free space on rapidly growing directories O(1) instead of O(N), but
should not have any impact on random insert workloads because the
average search length is the same regardless of which end of the
array we start at.

The result is a major improvement in large directory grow rates:

		create time(sec) / rate (files/s)
 File count     vanilla             Prev commit		Patched
  10k	      0.41 / 24.3k	   0.42 / 23.8k       0.41 / 24.3k
  20k	      0.74 / 27.0k	   0.76 / 26.3k       0.75 / 26.7k
 100k	      3.81 / 26.4k	   3.47 / 28.8k       3.27 / 30.6k
 200k	      8.58 / 23.3k	   7.19 / 27.8k       6.71 / 29.8k
   1M	     85.69 / 11.7k	  48.53 / 20.6k      37.67 / 26.5k
   2M	    280.31 /  7.1k	 130.14 / 15.3k      79.55 / 25.2k
  10M	   3913.26 /  2.5k                          552.89 / 18.1k

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
610125ab1e xfs: speed up directory bestfree block scanning
When running a "create millions inodes in a directory" test
recently, I noticed we were spending a huge amount of time
converting freespace block headers from disk format to in-memory
format:

 31.47%  [kernel]  [k] xfs_dir2_node_addname
 17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
  3.55%  [kernel]  [k] xfs_dir3_free_bests_p

We shouldn't be hitting the best free block scanning code so hard
when doing sequential directory creates, and it turns out there's
a highly suboptimal loop searching the the best free array in
the freespace block - it decodes the block header before checking
each entry inside a loop, instead of decoding the header once before
running the entry search loop.

This makes a massive difference to create rates. Profile now looks
like this:

  13.15%  [kernel]  [k] xfs_dir2_node_addname
   3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
   3.11%  [kernel]  [k] xfs_log_commit_cil

And the wall time/average file create rate differences are
just as stark:

		create time(sec) / rate (files/s)
File count	     vanilla		    patched
  10k		   0.41 / 24.3k		   0.42 / 23.8k
  20k		   0.74	/ 27.0k		   0.76 / 26.3k
 100k		   3.81	/ 26.4k		   3.47 / 28.8k
 200k		   8.58	/ 23.3k		   7.19 / 27.8k
   1M		  85.69	/ 11.7k		  48.53 / 20.6k
   2M		 280.31	/  7.1k		 130.14 / 15.3k

The larger the directory, the bigger the performance improvement.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
0e822255f9 xfs: factor free block index lookup from xfs_dir2_node_addname_int()
Simplify the logic in xfs_dir2_node_addname_int() by factoring out
the free block index lookup code that finds a block with enough free
space for the entry to be added. The code that is moved gets a major
cleanup at the same time, but there is no algorithm change here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
a07258a695 xfs: factor data block addition from xfs_dir2_node_addname_int()
Factor out the code that adds a data block to a directory from
xfs_dir2_node_addname_int(). This makes the code flow cleaner and
more obvious and provides clear isolation of upcoming optimsations.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner
aee7754bbe xfs: move xfs_dir2_addname()
This gets rid of the need for a forward  declaration of the static
function xfs_dir2_addname_int() and readies the code for factoring
of xfs_dir2_addname_int().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:56 -07:00
Darrick J. Wong
39ee2239a5 xfs: remove all *_ITER_CONTINUE values
Iterator functions already use 0 to signal "continue iterating", so get
rid of the #defines and just do it directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-30 22:43:56 -07:00
Darrick J. Wong
e7ee96dfb8 xfs: remove all *_ITER_ABORT values
Use -ECANCELED to signal "stop iterating" instead of these magical
*_ITER_ABORT values, since it's duplicative.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-29 21:22:41 -07:00
Eric Sandeen
7f313eda8f xfs: log proper length of btree block in scrub/repair
xfs_trans_log_buf() takes a final argument of the last byte to
log in the buffer; b_length is in basic blocks, so this isn't
the correct last byte.  Fix it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-28 08:31:02 -07:00
Darrick J. Wong
ffb5696f75 xfs: reinitialize rm_flags when unpacking an offset into an rmap irec
In xfs_rmap_irec_offset_unpack, we should always clear the contents of
rm_flags before we begin unpacking the encoded (ondisk) offset into the
incore rm_offset and incore rm_flags fields.  Remove the open-coded
field zeroing as this encourages api misuse.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:02 -07:00
Darrick J. Wong
3e08f42ae7 xfs: remove unnecessary int returns from deferred bmap functions
Remove the return value from the functions that schedule deferred bmap
operations since they never fail and do not return status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:02 -07:00
Darrick J. Wong
74b4c5d4a9 xfs: remove unnecessary int returns from deferred refcount functions
Remove the return value from the functions that schedule deferred
refcount operations since they never fail and do not return status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:02 -07:00
Darrick J. Wong
bc46ac6471 xfs: remove unnecessary int returns from deferred rmap functions
Remove the return value from the functions that schedule deferred rmap
operations since they never fail and do not return status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
2ca09177ab xfs: remove unnecessary parameter from xfs_iext_inc_seq
This function doesn't use the @state parameter, so get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
b521c89027 xfs: fix sign handling problem in xfs_bmbt_diff_two_keys
In xfs_bmbt_diff_two_keys, we perform a signed int64_t subtraction with
two unsigned 64-bit quantities.  If the second quantity is actually the
"maximum" key (all ones) as used in _query_all, the subtraction
effectively becomes addition of two positive numbers and the function
returns incorrect results.  Fix this with explicit comparisons of the
unsigned values.  Nobody needs this now, but the online repair patches
will need this to work properly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
7380e8fec1 xfs: don't return _QUERY_ABORT from xfs_rmap_has_other_keys
The xfs_rmap_has_other_keys helper aborts the iteration as soon as it
has an answer.  Don't let this abort leak out to callers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
c94613feef xfs: fix maxicount division by zero error
In xfs_ialloc_setup_geometry, it's possible for a malicious/corrupt fs
image to set an unreasonably large value for sb_inopblog which will
cause ialloc_blks to be zero.  If sb_imax_pct is also set, this results
in a division by zero error in the second do_div call.  Therefore, force
maxicount to zero if ialloc_blks is zero.

Note that the kernel metadata verifiers will catch the garbage inopblog
value and abort the fs mount long before it tries to set up the inode
geometry; this is needed to avoid a crash in xfs_db while setting up the
xfs_mount structure.

Found by fuzzing sb_inopblog to 122 in xfs/350.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
519e5869d5 xfs: bmap scrub should only scrub records once
The inode block mapping scrub function does more work for btree format
extent maps than is absolutely necessary -- first it will walk the bmbt
and check all the entries, and then it will load the incore tree and
check every entry in that tree, possibly for a second time.

Simplify the code and decrease check runtime by separating the two
responsibilities.  The bmbt walk will make sure the incore extent
mappings are loaded, check the shape of the bmap btree (via xchk_btree)
and check that every bmbt record has a corresponding incore extent map;
and the incore extent map walk takes all the responsibility for checking
the mapping records and cross referencing them with other AG metadata.

This enables us to clean up some messy parameter handling and reduce
redundant code.  Rename a few functions to make the split of
responsibilities clearer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-08-26 17:43:15 -07:00
zhengbin
71912e08e0 xfs: remove excess function parameter description in 'xfs_btree_sblock_v5hdr_verify'
Fixes gcc warning:

fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'max_recs' description in 'xfs_btree_sblock_v5hdr_verify'
fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'pag_max_level' description in 'xfs_btree_sblock_v5hdr_verify'

Fixes: c5ab131ba0 ("libxfs: refactor short btree block verification")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:15 -07:00
Dave Chinner
f8f9ee4794 xfs: add kmem_alloc_io()
Memory we use to submit for IO needs strict alignment to the
underlying driver contraints. Worst case, this is 512 bytes. Given
that all allocations for IO are always a power of 2 multiple of 512
bytes, the kernel heap provides natural alignment for objects of
these sizes and that suffices.

Until, of course, memory debugging of some kind is turned on (e.g.
red zones, poisoning, KASAN) and then the alignment of the heap
objects is thrown out the window. Then we get weird IO errors and
data corruption problems because drivers don't validate alignment
and do the wrong thing when passed unaligned memory buffers in bios.

TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
512 byte alignment of buffers for IO, even if memory debugging
options are turned on. It is assumed that the minimum allocation
size will be 512 bytes, and that sizes will be power of 2 mulitples
of 512 bytes.

Use this everywhere we allocate buffers for IO.

This no longer fails with log recovery errors when KASAN is enabled
due to the brd driver not handling unaligned memory buffers:

# mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:15 -07:00
Dave Chinner
d916275aa4 xfs: get allocation alignment from the buftarg
Needed to feed into the allocation routine to guarantee the memory
buffers we add to bios are correctly aligned to the underlying
device.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:14 -07:00
Dave Chinner
0ad95687c3 xfs: add kmem allocation trace points
When trying to correlate XFS kernel allocations to memory reclaim
behaviour, it is useful to know what allocations XFS is actually
attempting. This information is not directly available from
tracepoints in the generic memory allocation and reclaim
tracepoints, so these new trace points provide a high level
indication of what the XFS memory demand actually is.

There is no per-filesystem context in this code, so we just trace
the type of allocation, the size and the allocation constraints.
The kmem code also doesn't include much of the common XFS headers,
so there are a few definitions that need to be added to the trace
headers and a couple of types that need to be made common to avoid
needing to include the whole world in the kmem code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:14 -07:00
Tetsuo Handa
707e0ddaf6 fs: xfs: Remove KM_NOSLEEP and KM_SLEEP.
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 12:06:22 -07:00
Linus Torvalds
a55aa89aab Linux 5.3-rc6 2019-08-25 12:01:23 -07:00
Linus Torvalds
c749088f25 A minor auxdisplay improvement:
- ht16k33: Make ht16k33_fb_fix and ht16k33_fb_var constant (Nishka Dasgupta)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAl1ixpEACgkQGXyLc2ht
 IW3Alg//f7+tOooILnDsxByF6T3bD5ObZFuMAW01jnHER8q93sBAuddY28OjiSrI
 MZrwZLbz43Ek9zF+Q2A8RIVYD79vFUZbD33ZbQHJ1CJmD/urapVE13rmQMo+EsiB
 PsCgIKjRByj/WfUexRdJTZ7gbKb+l6l/gvLO9tqLbb0rD/CMEny7rLEzmC5uLwYE
 koM6A74AhXBEQMYR2Vn7HpLF9U3vzo7O0QuDLUlvaSv5TJgpdpZuLDJHXbBOcnRU
 qrD7ruPOXxwo6b218TaIeCP6IDIEOdHz/4XxcZ0rFjiTxF0nLx4OjDHlCYfsxlEw
 6kujamc8kJmdUwHk3xQM2kxUlR/mMSmvpW5bRdUEBk2+Cqe4S5c2OFSxYoHMBiI/
 SpmUJbkLgzQSo33k0rNKiZL49arlrsNN94EV9+QHSHbmTq/HlPWuPleUUfA0Ep46
 mN7wbQkE1FAniwoOu3Tx4T1Kw+L2gTqAmqxNCFf6HoihnkFjf/RAYEGPLBP9mKAN
 o2W9icMSREeM9pKy4NYr0Fcq7eD1vcYGkSY1gpFfNDDEt7TTH7M3L85ty0ky+JvU
 jHRayXNRg/SGtx3CBhDw3iiq4Dj5t2YJ0NTNF2XyzHTkass4dGfE8duHoDDnaFP5
 GphEAjf3kV+f+j9f7Kj8Y5cCeCMxctWp3bv6eZGK+LMiPUzPqC0=
 =qxac
 -----END PGP SIGNATURE-----

Merge tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux

Pull auxdisplay cleanup from Miguel Ojeda:
 "Make ht16k33_fb_fix and ht16k33_fb_var constant (Nishka Dasgupta)"

* tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux:
  auxdisplay: ht16k33: Make ht16k33_fb_fix and ht16k33_fb_var constant
2019-08-25 11:43:17 -07:00
Linus Torvalds
32ae83ffec This pull request contains a single bug fix for UML:
- Fix time travel mode
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl1ikVwWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waNYEADaqyJu+2UAp2hZGkwric9dqh4l
 IibXY0bPKokDIAt/gGmh5CX8cqBWKjWJSny91mqrINm1SBv4iTm0GLrSq7ZmQmYH
 1JRZSk3QtxRfVMVKizp2L/K22lPSMIViYoAsTGYTbRAmNyjBGJNSZrgCs3BBi/1F
 mxINtpyg2MyWOg9aNIzil6ZfwcPEazt9US6XM/2Tcs3z9wDO5bfRIgD3ILoWcT7D
 RPwLbtMi242Uak+Eyi44QCfwB5UjC1UvDdKjgr3paHiTVm7LS0dCEnBhaDhtGeb8
 bqEnSVH9oHA0XQhUAYdFNMQN0n1+bEDbqnbz9JLg4iJt6jXpvY8oL9xi7k/FglSu
 zXlhRRE4G7AYpBoCvQp/Anh85aCAcsZ9nP4aSN8GXLi7IqyaZ7KRTBHrAFxYi/WP
 dXVaqR984w5bEBDLRUsGosKHlHXHMnAwPDthQhuRrCqqmE/YyzpOaCsG46Wzpriy
 Jg302QmlTOMfx0uUoCVsiEq6rwar6LGTP7raihaR8j9g0EzFr7f4FpzmWxQpvJqG
 YpE3jVwp3OOKJjOETIW6ko2lzai3GOP9rPqoPfOhtqeALHLtORlg7XAhBj7n3Tji
 rLHKmVIxiiAmkfQItMdRjJbu9gFAiW+ZR7nEnDnhMjer1iPkJX+DtCLEZFpui7Me
 WrrQx4ypeO4RFemQCQ==
 =bDrL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML fix from Richard Weinberger:
 "Fix time travel mode"

* tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: fix time travel mode
2019-08-25 11:40:24 -07:00
Linus Torvalds
94a76d9b52 This pull request contains the following fixes for UBIFS and JFFS2:
UBIFS:
 
 - Don't block too long in writeback_inodes_sb()
 - Fix for a possible overrun of the log head
 - Fix double unlock in orphan_delete()
 
 JFFS2:
 
 - Remove C++ style from UAPI header and unbreak picky toolchains
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl1ik14WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wbP2D/4xVW7YP5Yyt6YrABJuclfoib30
 2LI6eOz0+5OojQKUbOzXCN9N7Dv4TLJKrCjRc9qKYTIB1DiQXuBDqtYKg6CTBhHb
 MjiftEDiBQ6j3jVmRxkQRXZEB9I3Uu9CkA8s65+UmL8peJfgNElpH34omsU1fzup
 y0NhZhj77P5jsAG6r7yXvuaofCOTlZIZVPya9FX17J0Ra+3rMOCtVEqnaHk2E5RB
 EQPAEByqXUIx7+9mOi1Krw7B7fesB7oOVbCykE5knX1pZQCTURP64yNr35WxN+7Z
 crcpdEQtf54qWMCKf4ClIBHiPmmsDIHYJy3JXjgJKOwIYvrB3dZ5E170qPr3JixY
 nS+l8x69IYZhWUzHg8gxDizk92iFYKbO1h5vBwI7NUFHkHLzylsgonBK0KdaUnol
 OvI5oCO/rdJEMBPr5LEFpOjZJIEptPtXpDvQCpm5tWd5tuW+8edNpI38lDO9LThC
 O0diZZUQfsuzD1XrvKRORPU+4lskzGV5b1UA0DWXdGKALqM5VrQZo1XftvA74Zkv
 oZQcHNK5wdecQX81Oadfb/0a5SN7FGGtTUCKTpOyBIu0adarGIasC6TQr2aDiiNh
 7jLjBoV2XEGhXZQrK2lm8G+6rJ7Mp11B6aoTFgDELzt+SB7htp6dARR2+4aGWXh9
 iXgme0n9HXDDeuosag==
 =Bsgx
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBIFS and JFFS2 fixes from Richard Weinberger:
 "UBIFS:
   - Don't block too long in writeback_inodes_sb()
   - Fix for a possible overrun of the log head
   - Fix double unlock in orphan_delete()

  JFFS2:
   - Remove C++ style from UAPI header and unbreak picky toolchains"

* tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubifs: Limit the number of pages in shrink_liability
  ubifs: Correctly initialize c->min_log_bytes
  ubifs: Fix double unlock around orphan_delete()
  jffs2: Remove C++ style comments from uapi header
2019-08-25 11:29:27 -07:00
Linus Torvalds
146c3d3220 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A few fixes for x86:

   - Fix a boot regression caused by the recent bootparam sanitizing
     change, which escaped the attention of all people who reviewed that
     code.

   - Address a boot problem on machines with broken E820 tables caused
     by an underflow which ended up placing the trampoline start at
     physical address 0.

   - Handle machines which do not advertise a legacy timer of any form,
     but need calibration of the local APIC timer gracefully by making
     the calibration routine independent from the tick interrupt. Marked
     for stable as well as there seems to be quite some new laptops
     rolled out which expose this.

   - Clear the RDRAND CPUID bit on AMD family 15h and 16h CPUs which are
     affected by broken firmware which does not initialize RDRAND
     correctly after resume. Add a command line parameter to override
     this for machine which either do not use suspend/resume or have a
     fixed BIOS. Unfortunately there is no way to detect this on boot,
     so the only safe decision is to turn it off by default.

   - Prevent RFLAGS from being clobbers in CALL_NOSPEC on 32bit which
     caused fast KVM instruction emulation to break.

   - Explain the Intel CPU model naming convention so that the repeating
     discussions come to an end"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
  x86/boot: Fix boot regression caused by bootparam sanitizing
  x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
  x86/boot/compressed/64: Fix boot on machines with broken E820 table
  x86/apic: Handle missing global clockevent gracefully
  x86/cpu: Explain Intel model naming convention
2019-08-25 10:10:15 -07:00
Linus Torvalds
5a13fc3d8b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping fix from Thomas Gleixner:
 "A single fix for a regression caused by the generic VDSO
  implementation where a math overflow causes CLOCK_BOOTTIME to become a
  random number generator"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25 10:08:01 -07:00
Linus Torvalds
8a04c2ee62 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Handle the worker management in situations where a task is scheduled
  out on a PI lock contention correctly and schedule a new worker if
  possible"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Schedule new worker even if PI-blocked
2019-08-25 10:06:12 -07:00
Linus Torvalds
05bbb9360a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Two small fixes for kprobes and perf:

   - Prevent a deadlock in kprobe_optimizer() causes by reverse lock
     ordering

   - Fix a comment typo"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix potential deadlock in kprobe_optimizer()
  perf/x86: Fix typo in comment
2019-08-25 10:03:32 -07:00
Linus Torvalds
44c471e436 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "A single fix for a imbalanced kobject operation in the irq decriptor
  code which was unearthed by the new warnings in the kobject code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Properly pair kobject_del() with kobject_add()
2019-08-25 10:00:21 -07:00
Linus Torvalds
f47edb59bb Merge branch 'akpm' (patches from Andrew)
Mergr misc fixes from Andrew Morton:
 "11 fixes"

Mostly VM fixes, one psi polling fix, and one parisc build fix.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
  mm/zsmalloc.c: fix race condition in zs_destroy_pool
  mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
  mm, page_owner: handle THP splits correctly
  userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
  psi: get poll_work to run when calling poll syscall next time
  mm: memcontrol: flush percpu vmevents before releasing memcg
  mm: memcontrol: flush percpu vmstats before releasing memcg
  parisc: fix compilation errrors
  mm, page_alloc: move_freepages should not examine struct page of reserved memory
  mm/z3fold.c: fix race between migration and destruction
2019-08-25 09:56:27 -07:00
Linus Torvalds
e67095fd2f dma-mapping fixes for 5.3-rc
Two fixes for regressions in this merge window:
 
  - select the Kconfig symbols for the noncoherent dma arch helpers
    on arm if swiotlb is selected, not just for LPAE to not break then
    Xen build, that uses swiotlb indirectly through swiotlb-xen
  - fix the page allocator fallback in dma_alloc_contiguous if the CMA
    allocation fails
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1hvn4LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYON4w//Recfoy5T2Q4Gfjp1xVKGbr2sP7J93Vs7VCyQNZmX
 PrtzhmNKs4gxCEXVgHm+GVA+IJwQFqDtSFaPb8q3GQ+qM9NUDF4ScMFpfrLZsFr1
 dorm5kC1xcwrQtWjS1CQS/Gj0VBtWiMQOoUcAESMqgBIUo4ssj3Ny+vnh8hWgAOs
 oVDgOM4wt35bW0Pv/iY44uQzOq7xcYJUUYtPIiP9vMDrhPsxe6D1DgFQ4HZKJWix
 uS3BjZnsZDnLltXM/0CKdRV9wLF+jHYP/wJTztksRlr/A5V3FJ8lJIvgphxG1v3J
 tDfQs4BNuGWBjqdg+Qo6qOPEL9krvVYYVVql93DXwtPK/cJW1Z+0glgC2rbbHmIy
 ew35DFnYm9v0sFLZnbpuoHd6sQ9G59nTZstkqt/Z/hldBvKotwBpeuILAcMC9Nlw
 3iYW6Sz5L7cmkifC8OvopKKJWVoW5rVtMrVQw5niBiZVERtWbY825r/7ju2xYhZC
 iSAaUHT5wNtXsXQOTrFQ5LzTDBtgGyXRXgvNagEHhBf120jBQfOhvOCVT2HHOxdy
 5vx7xeeRS0M2HpxIsmd3XQjIUQEY9x1to4FKiYczGM1kcKeyWWBMFOXfLxe2Rmhg
 h14lbfsAxIEWdFkJAVFhjyjzC6IzxyVGtHCxw1iw0VgGzYATO/K6Oo8T2hG3HagR
 abQ=
 =DXk9
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Two fixes for regressions in this merge window:

   - select the Kconfig symbols for the noncoherent dma arch helpers on
     arm if swiotlb is selected, not just for LPAE to not break then Xen
     build, that uses swiotlb indirectly through swiotlb-xen

   - fix the page allocator fallback in dma_alloc_contiguous if the CMA
     allocation fails"

* tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: fix zone selection after an unaddressable CMA allocation
  arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24 20:00:11 -07:00
Andrey Ryabinin
00fb24a42a mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
The code like this:

	ptr = kmalloc(size, GFP_KERNEL);
	page = virt_to_page(ptr);
	offset = offset_in_page(ptr);
	kfree(page_address(page) + offset);

may produce false-positive invalid-free reports on the kernel with
CONFIG_KASAN_SW_TAGS=y.

In the example above we lose the original tag assigned to 'ptr', so
kfree() gets the pointer with 0xFF tag.  In kfree() we check that 0xFF
tag is different from the tag in shadow hence print false report.

Instead of just comparing tags, do the following:

1) Check that shadow doesn't contain KASAN_TAG_INVALID.  Otherwise it's
   double-free and it doesn't matter what tag the pointer have.

2) If pointer tag is different from 0xFF, make sure that tag in the
   shadow is the same as in the pointer.

Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
Fixes: 7f94ffbc4c ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Walter Wu <walter-zh.wu@mediatek.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Henry Burns
701d678599 mm/zsmalloc.c: fix race condition in zs_destroy_pool
In zs_destroy_pool() we call flush_work(&pool->free_work).  However, we
have no guarantee that migration isn't happening in the background at
that time.

Since migration can't directly free pages, it relies on free_work being
scheduled to free the pages.  But there's nothing preventing an
in-progress migrate from queuing the work *after*
zs_unregister_migration() has called flush_work().  Which would mean
pages still pointing at the inode when we free it.

Since we know at destroy time all objects should be free, no new
migrations can come in (since zs_page_isolate() fails for fully-free
zspages).  This means it is sufficient to track a "# isolated zspages"
count by class, and have the destroy logic ensure all such pages have
drained before proceeding.  Keeping that state under the class spinlock
keeps the logic straightforward.

In this case a memory leak could lead to an eventual crash if compaction
hits the leaked page.  This crash would only occur if people are
changing their zswap backend at runtime (which eventually starts
destruction).

Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
Fixes: 48b4800a1c ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Henry Burns
1a87aa0359 mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
In zs_page_migrate() we call putback_zspage() after we have finished
migrating all pages in this zspage.  However, the return value is
ignored.  If a zs_free() races in between zs_page_isolate() and
zs_page_migrate(), freeing the last object in the zspage,
putback_zspage() will leave the page in ZS_EMPTY for potentially an
unbounded amount of time.

To fix this, we need to do the same thing as zs_page_putback() does:
schedule free_work to occur.

To avoid duplicated code, move the sequence to a new
putback_zspage_deferred() function which both zs_page_migrate() and
zs_page_putback() call.

Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
Fixes: 48b4800a1c ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Vlastimil Babka
f7da677bc6 mm, page_owner: handle THP splits correctly
THP splitting path is missing the split_page_owner() call that
split_page() has.

As a result, split THP pages are wrongly reported in the page_owner file
as order-9 pages.  Furthermore when the former head page is freed, the
remaining former tail pages are not listed in the page_owner file at
all.  This patch fixes that by adding the split_page_owner() call into
__split_huge_page().

Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
Fixes: a9627bc5e3 ("mm/page_owner: introduce split_page_owner and replace manual handling")
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Oleg Nesterov
46d0b24c5e userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
mm->core_state != NULL.

Otherwise a page fault can see userfaultfd_missing() == T and use an
already freed userfaultfd_ctx.

Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
Fixes: 04f5866e41 ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Jason Xing
7b2b55da1d psi: get poll_work to run when calling poll syscall next time
Only when calling the poll syscall the first time can user receive
POLLPRI correctly.  After that, user always fails to acquire the event
signal.

Reproduce case:
 1. Get the monitor code in Documentation/accounting/psi.txt
 2. Run it, and wait for the event triggered.
 3. Kill and restart the process.

The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0).  And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker.  The
cancel needs to pair with resetting the poll_scheduled flag.

Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Roman Gushchin
bb65f89b7d mm: memcontrol: flush percpu vmevents before releasing memcg
Similar to vmstats, percpu caching of local vmevents leads to an
accumulation of errors on non-leaf levels.  This happens because some
leftovers may remain in percpu caches, so that they are never propagated
up by the cgroup tree and just disappear into nonexistence with on
releasing of the memory cgroup.

To fix this issue let's accumulate and propagate percpu vmevents values
before releasing the memory cgroup similar to what we're doing with
vmstats.

Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.

Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Roman Gushchin
c350a99ea2 mm: memcontrol: flush percpu vmstats before releasing memcg
Percpu caching of local vmstats with the conditional propagation by the
cgroup tree leads to an accumulation of errors on non-leaf levels.

Let's imagine two nested memory cgroups A and A/B.  Say, a process
belonging to A/B allocates 100 pagecache pages on the CPU 0.  The percpu
cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
and A atomic vmstat counters, 4 pages will remain in the percpu cache.

Imagine A/B is nearby memory.max, so that every following allocation
triggers a direct reclaim on the local CPU.  Say, each such attempt will
free 16 pages on a new cpu.  That means every percpu cache will have -16
pages, except the first one, which will have 4 - 16 = -12.  A/B and A
atomic counters will not be touched at all.

Now a user removes A/B.  All percpu caches are freed and corresponding
vmstat numbers are forgotten.  A has 96 pages more than expected.

As memory cgroups are created and destroyed, errors do accumulate.  Even
1-2 pages differences can accumulate into large numbers.

To fix this issue let's accumulate and propagate percpu vmstat values
before releasing the memory cgroup.  At this point these numbers are
stable and cannot be changed.

Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.

Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Qian Cai
bbcb03a97f parisc: fix compilation errrors
Commit 0cfaee2af3 ("include/asm-generic/5level-fixup.h: fix variable
'p4d' set but not used") converted a few functions from macros to static
inline, which causes parisc to complain,

  In file included from include/asm-generic/4level-fixup.h:38:0,
                   from arch/parisc/include/asm/pgtable.h:5,
                   from arch/parisc/include/asm/io.h:6,
                   from include/linux/io.h:13,
                   from sound/core/memory.c:9:
  include/asm-generic/5level-fixup.h:14:18: error: unknown type name 'pgd_t'; did you mean 'pid_t'?
   #define p4d_t    pgd_t
                    ^
  include/asm-generic/5level-fixup.h:24:28: note: in expansion of macro 'p4d_t'
   static inline int p4d_none(p4d_t p4d)
                              ^~~~~

It is because "4level-fixup.h" is included before "asm/page.h" where
"pgd_t" is defined.

Link: http://lkml.kernel.org/r/20190815205305.1382-1-cai@lca.pw
Fixes: 0cfaee2af3 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used")
Signed-off-by: Qian Cai <cai@lca.pw>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
David Rientjes
cd96103838 mm, page_alloc: move_freepages should not examine struct page of reserved memory
After commit 907ec5fca3 ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed.  This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.

The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory.  That assumption is
no longer true after the aforementioned commit.

There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.

Note that pfn_valid() can be true for reserved memory: there is a
backing struct page.  The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.

Move the debug checks to after verifying PageBuddy is true.  This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on.  In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).

Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback.  We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820.  pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone.  The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.

The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.

We noticed it almost immediately after bringing 907ec5fca3 in on
CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype.  So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.

Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit.  I
think it's rare enough under a config option that most people don't run
that others may not have noticed.  I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.

Mel said:

: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Henry Burns
d776aaa989 mm/z3fold.c: fix race between migration and destruction
In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
However, we have no guarantee that migration isn't happening in the
background at that time.

Migration directly calls queue_work_on(pool->compact_wq), if destruction
wins that race we are using a destroyed workqueue.

Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Linus Torvalds
083f0f2cd4 GPIO fixes for the v5.3 kernel:
- Fix not reporting open drain/source lines to userspace as "input"
 - Fix a minor build error found in randconfigs
 - Fix a chip select quirk on the Freescale SPI
 - Fix the irqchip initialization semantic order to reflect what
   it was using the old API
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAl1hoSwACgkQQRCzN7AZ
 XXP7Gg/9HeB5NKtBeMjLh0iRjKfHuIs0dFdJVxV8hYI/BTx9nJN38dW5CXkE5rBP
 0sqyO5/HqQJpArrc5KfgcXDay3rRvYgBvNwaB0ZFaOmGhqDUjRL3ZzKcP7SzKCVg
 imNUDuwJgyUsop3mAWe2MkNrc+YKr5GIDb8pyxOZDP+JDudajoUnZRfEDZYjUx2t
 X2LjfkqcUpFH5IjPLgND2A43oS4dSrgZ1XTW5R0Dp36a8ZpfJFUYtkXFGdiu9tIa
 J0Koc4npHgUFMieKedDTbFXob+y6YekXkD2uI0LMByqLwz7cZ+l+YLYZJJSTGuSM
 sr2etxthfatnmmObbmv+A9dixzcA7CgG8jycqYSyhu88QJyDqlmJBF/cDgTYQss4
 bJRVsPapLSBjsQ5+DNCMWPiYLNRvdsI7AFAmZxdpF/iOzBhc331EqgAYqU7nG+4t
 XnmvCig5SQwwiILEApHAq5lqgbj36yx900UL8OSXpLFX4SfJFFs0rSVDRGesyhvv
 WBaUtsJYDvH7mh24SdC8CX7C/V+8EEx9Dht+NT459DvhaCF7xu+xUixloSVpbb1p
 GkMGIvxV2Sy6I7TzKciAoKgWxI2sCigmqSKhN5oWK5cAWzgTbD/9hJsycKJ47sL6
 tBbI/Xc0GHC5lPfTolVfvjefFWIOblhixiEY2Qafw2HSm+3cjB0=
 =yuPv
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
 "Here is a (hopefully last) set of GPIO fixes for the v5.3 kernel
  cycle. Two are pretty core:

   - Fix not reporting open drain/source lines to userspace as "input"

   - Fix a minor build error found in randconfigs

   - Fix a chip select quirk on the Freescale SPI

   - Fix the irqchip initialization semantic order to reflect what it
     was using the old API"

* tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: Fix irqchip initialization order
  gpio: of: fix Freescale SPI CS quirk handling
  gpio: Fix build error of function redefinition
  gpiolib: never report open-drain/source lines as 'input' to user-space
2019-08-24 14:45:33 -07:00
Linus Torvalds
361469211f - Fix for panics and network failures on PAE guests by Dexuan Cui.
- Fix of a memory leak (and related cleanups) in the hyper-v keyboard
 driver by Dexuan Cui.
 - Code cleanups for hyper-v clocksource driver during the merge window
 by Dexuan Cui.
 - Fix for a false positive warning in the userspace hyper-v KVP store by
 Vitaly Kuznetsov.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAl1hT7EACgkQ3qZv95d3
 LNy4PQ/+J8FDpxqBlg7gh5iv0FMM6z/knU2z6RNuwsXs5xPCF/rVvKb1MlW0V4pO
 apGWAqCZbpslQfxN8rNKgOtJQ9FRSXVKzLapfeVI9loVvkoSZCGxrLU1fS0NBEL2
 Eg1qzj9qAHMA6UyClNhQipKuK83ZBJJCPsupD2IjLzb0UcPKXgFS3FTJP7UkcREh
 jV8djoyzmwQJDFBWsbp1IY1ZAhDhnqeEYJ8GZXziU58TfoT9zS0zrlO5/Pf+0mPT
 5QNmLpK/e/l1gDra84DEY9zQUUYepHbCd54XGTMaMGPnUWXCCgDQcEAC4n1/i3ZP
 lUP5AG994F0Z5csN3DKtTnX2Mz7DSXmx9ro4jnHDWVqhHKwrFqd9Ml+ukDkICf/S
 6dgVcV9pUF1zvO235Dujo2Ht+95eHg+Dbh9aW21khzwKGUd/FyxzirpuTGQZol7N
 CjnKnWkVyz8lGjrE6UQwa8oguqyIOjVPAXQv1xI1+jrVnXlZ+EbpMwnVBy39A1SU
 BpimckYZwznmKVpmTU6L9MrIfM/cU1KTXt6mMLQfArdMef1a3O6AWsFg/mawFQPq
 7C5ixC2BwW9yA0pjO1igmNDgnOqqIjqsOxl7BtQOxgnu5QPkLIY/FVXHNkhygsU3
 CsWyo3C7QQK21MmIAFU0mFlrdMLIKuldNf/wkUVBcOTSMbX72CM=
 =GJPQ
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V fixes from Sasha Levin:

 - Fix for panics and network failures on PAE guests by Dexuan Cui.

 - Fix of a memory leak (and related cleanups) in the hyper-v keyboard
   driver by Dexuan Cui.

 - Code cleanups for hyper-v clocksource driver during the merge window
   by Dexuan Cui.

 - Fix for a false positive warning in the userspace hyper-v KVP store
   by Vitaly Kuznetsov.

* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE
  Tools: hv: kvp: eliminate 'may be used uninitialized' warning
  Input: hyperv-keyboard: Use in-place iterator API in the channel callback
  Drivers: hv: vmbus: Remove the unused "tsc_page" from struct hv_context
2019-08-24 11:42:06 -07:00