linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-12 14:12:51 +00:00

Author	SHA1	Message	Date
Linus Torvalds	7da4221b53	Pull request for inclusion in 4.20 * Finish removing the custom 9p request cache mechanism * Embed part of the fcall in the request to have better slab performance (msize usually is power of two aligned) * syzkaller fixes: - add a refcount to 9p requests to avoid use after free - a few double free issues * A few coverity fixes * Some old patches that were in the bugzilla: - do not trust pdu content for size header - mount option for lock retry interval ---------------------------------------------------------------- Dan Carpenter (1): 9p: potential NULL dereference Dinu-Razvan Chis-Serban (1): 9p locks: add mount option for lock retry interval Dominique Martinet (12): 9p/xen: fix check for xenbus_read error in front_probe v9fs_dir_readdir: fix double-free on p9stat_read error 9p: clear dangling pointers in p9stat_free 9p: embed fcall in req to round down buffer allocs 9p: add a per-client fcall kmem_cache 9p/rdma: do not disconnect on down_interruptible EAGAIN 9p: acl: fix uninitialized iattr access 9p/rdma: remove useless check in cm_event_handler 9p: p9dirent_read: check network-provided name length 9p locks: fix glock.client_id leak in do_lock 9p/trans_fd: abort p9_read_work if req status changed 9p/trans_fd: put worker reqs on destroy Gertjan Halkes (1): 9p: do not trust pdu content for stat item size Gustavo A. R. Silva (1): 9p: fix spelling mistake in fall-through annotation Matthew Wilcox (2): 9p: Use a slab for allocating requests 9p: Remove p9_idpool Tomas Bortoli (3): 9p: rename p9_free_req() function 9p: Add refcount to p9_req_t 9p: Rename req to rreq in trans_fd fs/9p/acl.c \| 2 +- fs/9p/v9fs.c \| 21 +++++ fs/9p/v9fs.h \| 1 + fs/9p/vfs_dir.c \| 19 +--- fs/9p/vfs_file.c \| 24 +++++- include/net/9p/9p.h \| 12 +-- include/net/9p/client.h \| 71 ++++++--------- net/9p/Makefile \| 1 - net/9p/client.c \| 551 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------- net/9p/mod.c \| 9 +- net/9p/protocol.c \| 20 ++++- net/9p/trans_fd.c \| 64 +++++++++----- net/9p/trans_rdma.c \| 37 ++++---- net/9p/trans_virtio.c \| 44 +++++++--- net/9p/trans_xen.c \| 17 ++-- net/9p/util.c \| 140 ------------------------------ 16 files changed, 482 insertions(+), 551 deletions(-) delete mode 100644 net/9p/util.c -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAlvWYE0ACgkQq06b7GqY 5nBBDxAAkz4rGRsnRpx9D1Bt+81eARTqsWcoPM138zEbwiJehe5aSF8x/CGyirDW +PoES/efho50sDfaUL9eBMgyy95UMN7LepMTcjNE1tywA+tRsH3dOL4757+lBwZc FGsHvfse4b5ZzmMxSo2L5KbzpNiUEXCJGEvNlBSbeWcSSUtCfOShMVx9IC6r7JhX pbsoczQ6W+384V4WgBxBZZf99EryxSeYJ9KMU+9b13ndzKICJqYeaEdItt28uYUC d05iXDEDQZD3B8R/1Y08ktMqmHLXP9kZdJ/w8HhCMRAxWKUD1pfDwJ45gIwhCb5R KxdxZ2tPvc8ihGngNJ4VaJoQQFTVzDgbyWXjwAFYkFUuPFgmuAgx3qApOhUFHCdM 8vgJd2u0attNwfOM3VsJApLQu09wmRw1LcmJ9re5OJ/YLiAUF3vw2oTpGhGlKRRI bMBI9rwXHzeoVI7ank5mh1NMVRpMbzL9A8uqux85gcpXL36V3XmZtKRDe1AAa5jB JR6dciAUABm9m+Ilt1Pl/faSjmysUOb1wZ0XU/cvVy5EBMUjbHv3x0TuDmrVwDas puUXEM8soVf9e/NUYqTOozwFIle6KFcyPmae/OopATLgkGES4cjDy9cLh5SgnERM vMcN/Z+DQO98zM/fPRy2B4qklflgTKI6VI/gSLRCWGpn49Iwhhg= =g1OF -----END PGP SIGNATURE----- Merge tag '9p-for-4.20' of git://github.com/martinetd/linux Pull 9p updates from Dominique Martinet: "Highlights this time around are the end of Matthew's work to remove the custom 9p request cache and use a slab directly for requests, with some extra patches on my end to not degrade performance, but it's a very good cleanup. Tomas and I fixed a few more syzkaller bugs (refcount is the big one), and I had a go at the coverity bugs and at some of the bugzilla reports we had open for a while. I'm a bit disappointed that I couldn't get much reviews for a few of my own patches, but the big ones got some and it's all been soaking in linux-next for quite a while so I think it should be OK. Summary: - Finish removing the custom 9p request cache mechanism - Embed part of the fcall in the request to have better slab performance (msize usually is power of two aligned) - syzkaller fixes: * add a refcount to 9p requests to avoid use after free * a few double free issues - A few coverity fixes - Some old patches that were in the bugzilla: * do not trust pdu content for size header * mount option for lock retry interval" * tag '9p-for-4.20' of git://github.com/martinetd/linux: (21 commits) 9p/trans_fd: put worker reqs on destroy 9p/trans_fd: abort p9_read_work if req status changed 9p: potential NULL dereference 9p locks: fix glock.client_id leak in do_lock 9p: p9dirent_read: check network-provided name length 9p/rdma: remove useless check in cm_event_handler 9p: acl: fix uninitialized iattr access 9p locks: add mount option for lock retry interval 9p: do not trust pdu content for stat item size 9p: Rename req to rreq in trans_fd 9p: fix spelling mistake in fall-through annotation 9p/rdma: do not disconnect on down_interruptible EAGAIN 9p: Add refcount to p9_req_t 9p: rename p9_free_req() function 9p: add a per-client fcall kmem_cache 9p: embed fcall in req to round down buffer allocs 9p: Remove p9_idpool 9p: Use a slab for allocating requests 9p: clear dangling pointers in p9stat_free v9fs_dir_readdir: fix double-free on p9stat_read error ...	2018-10-29 09:09:47 -07:00
Linus Torvalds	673c790e72	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68k nommu fix from Greg Ungerer: "Only a single change to fix an out of bounds array access when parsing boot command line" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k: fix command-line parsing when passed from u-boot	2018-10-29 09:04:26 -07:00
Linus Torvalds	83e7e5b544	m68k updates for v4.20 - Just two small cleanups. -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQQ9qaHoIs/1I4cXmEiKwlD9ZEnxcAUCW9WEMhUcZ2VlcnRAbGlu dXgtbTY4ay5vcmcACgkQisJQ/WRJ8XCbswD+NNpRY9sDL686TicLAB8nnIIi7wzA eCgPpsXprxrRjEsA/3Wpsgnu32y6zDqh3UCFdPBnYUTj3ev1oxGdmzI3gJMG =y1Lq -----END PGP SIGNATURE----- Merge tag 'm68k-for-v4.20-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k Pull m68k updates from Geert Uytterhoeven: "Just two small cleanups" * tag 'm68k-for-v4.20-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k/sun3: Remove is_medusa and m68k_pgtable_cachemode m68k/atari: ARAnyM - Remove reference to long-deprecated MODULE_PARM	2018-10-29 09:02:15 -07:00
Linus Torvalds	ac43507589	This tag contains the Linux port for C-SKY(csky) based on linux-4.19 Release, which has been through 10 rounds of review on mailing list. We almost got the Acked-by/Reviewed-by of all patches except "Process management and Signal", but all've been tested. Here is the LTP-20180118 test report: ----------------------------------------------- Total Tests: 1298 Total Skipped Tests: 281 Total Failures: 10 Kernel Version: 4.19.0+ Machine Architecture: csky Hostname: buildroot ----------------------------------------------- This patchset adds architecture support to Linux for C-SKY's 32-bit embedded There are two ABI versions with several CPU cores in this patchset: ABIv1: 610 (16-bit instruction, 32-bit data path, VIPT Cache ...) ABIv2: 807 810 860 (16/32-bit variable length instruction, PIPT Cache, SMP ...) More information: http://en.c-sky.com The development repo: https://github.com/c-sky/csky-linux ABI Documentation: https://github.com/c-sky/csky-doc Here is the pre-built cross compiler for fast test from our CI: https://gitlab.com/c-sky/buildroot/-/jobs/101608095/artifacts/file/output/images/csky_toolchain_qemu_csky_ck807f_4.18_glibc_defconfig_482b221e52908be1c9b2ccb444255e1562bb7025.tar.xz We use buildroot as our CI-test enviornment. "LTP, Lmbench ..." will be tested for every commit. See here for more details: https://gitlab.com/c-sky/buildroot/pipelines We'll continouslly improve csky subsystem in future. Changes in v10: - Remove duplicated headers in asm/Kbuild and uapi/asm/Kbuild. - Change to (__NR_arch_specific_syscall + 1) in unistd.h. - Drop dword access for get_user_size patch. - Involve the interrupt controller drivers after got Reviewed-by. Changes in v9: - Remove unused code in smp.c and use per_cpu for ipi_data. - Fixup r15 register access in abiv1/alignment.c. - Improve the changelog comment in commit-msg. Changes in v8: - Pass make allmodconfig. - Implement abiv1 get_user_dword(). - Remove set_irq_mapping() used by driver in smp.c. Changes in v7: - Use checkpatch.pl to check all patches and fixup as possible. - Remove github.com/c-sky print in bootup. - Give a return in DMA_ATTR_NON_CONSISTENT in csky_dma_alloc_atomic(). - Remove the NSIGXXX in fpu.c and use force_sig_fault() in fpu.c. - Remove irq.h and add it in asm/Kbuild. - Use byteswap helpers in abiv1/bswapXi.c. - Fixup arch_sync_dma() only with one page problem. Changes in v6: - use asm-generic/bitops/atomic.h for all in asm/bitops.h - fix flush_cache_range and tlb_start_vma - fix compile error with include linux/bug.h in cmpxchg.h - improve the comment Changes in v5: - remove redundant smp_mb operations in spinlock.h - add commit message for dt-bindings docs - add CPUHP_AP_CSKY_TIMER_STARTING in hotplug.h for csky_mptimer - add COMPILE_TEST for timer-gx6605s Kconfig - seperate csky two interrupt controllers with 2 patches - add MAINTAINERS patch for csky - move IPI_IRQ into csky_mptimer, fixup irq_mapping problem - coding convension Changes in v4: - cleanup defconfig - use ksys_ in syscall.c - remove wrong comment in vdso.c - Use GENERIC_IRQ_MULTI_HANDLER - optimize the memset.c - fixup dts warnings - remove big-endian in byteorder.h Changes in v3: dc560f1 csky: change to EM_CSKY 252 for elf.h 2ac3ddf csky: remove gx6605s.dts af00b8c csky: add defconfig and qemu.dts 6c87efb csky: remove the deprecate name. f6dda39 csky: add dt-bindings doc. d9f02a8 csky: remove KERNEL_VERSION in upstream branch 7bd663c csky: Use kernel/dma/noncoherent.c 1544c09 csky: bugfix emmc hang up LINS-976 e963271 csky: cleanup include/asm/Kbuild cd267ba csky: remove CSKY_DEBUG_INFO 78950da csky: remove dcache invalid. 13fe51d csky: remove csum_ipv6_magic(), use generic one. a7372db csky: bugfix CK810 access twice error. 1bb7c69 csky: bugfix add gcc asm memory for barrier. 5ea3257 csky: add -msoft-float instead of -mfloat-abi=soft. 38b037d csky: bugfix losing cache flush range. ab5e8c4 csky: Add ticket-spinlock and qrwlock support. c9aaec5 csky: rename cskyksyms.c to libgcc_ksyms.c 28c5e48 csky: avoid the MB on failure: trylock f929c97 csky: bugfix idly4 may cause exception. 09dc496 csky: Use GENERIC_ASHLDI3/ASHRDI3 etc 6ecc99d csky: optimize smp boot code. 16f50df csky: asm/bug.h simple implement. 0ba532a csky: csky asm/atomic.h added. df66947 csky: asm/compat.h added `275a06f` csky: String operations optimization 4c021dd csky: ck860 SMP memory barrier optimize fc39c66 csky: Add wait/doze/stop d005144 csky: add GENERIC_ALLOCATOR 4a10074 csky: bugfix cma failed for highmem. 9f2ca70 csky: CMA supported :) 53791f4 csky: optimize csky_dma_alloc_nonatomic 974676e csky: optimize the cpuinfo printf. 2538669 csky: bugfix make headers_install error. 1158d0c csky: prevent hard-float and vdsp instructions. dc3c856 csky: increase Normal Memory to 1GB 6ee5932 csky: bugfix qemu mmu couldn't support 0xffffe000 1d7dfb8 csky: csky_dma_alloc_atomic added. caf6610 csky: restruct the fixmap memory layout. 5a17eaa csky: use -Wa,-mcpu=ckxxxfv to the as. 4d51829 csky: use Kconfig.hz. f3f88fa csky: BUGFIX add -mcpu=ck860f support 6192fd1 csky: support ck860 fpu. 7aa5e01 csky: BUGFIX add smp_mb before ldex. 15758e2 csky: BUGFIX tlbi couldn't handle ASID in another CPU core. `d69640d` csky: enable tlbi.vas to flush one tlb entry Changes in v2: a29bfc8 csky: add pre_mmu_init, move misc mmu setup to mm/init.c 4eab702 csky: no need kmap for !VM_EXEC. 6770eec csky: Use TEE as the name of CPU Trusted Execution Enviornment. a56c8c7 csky: update the cache flush api. 1a48a95 csky: add C-SKY Trust Zone. b7a0a44 csky: use CONFIG_RAM_BASE as the same in memory of dts. 15adf81 csky: remove unused code. 35c0d97 csky: bugfix lost a cacheline flush when start isn't cacheline-aligned. 4e82c8d csky: use tlbi.alls for ck860 smp temporary. ae7149e csky: bugfix use kmap_atomic() to prevent no mapped addr. 5538795 csky: bugfix user access in kernel space. a7aa591 csky: add 16bit user space bkpt. 0de70ec csky: add sync.is for cmpxchg in SMP. c5c08a1 csky: seperate sync.is and sync for SMP and Non-SMP. dbbf4dc csky: use sync.is for ck860 mb(). f33f8da csky: rewrite the alignment implement. 68152c7 csky: bugfix alignment pt_regs error. d618d43 csky: support set_affinity for irq balance in SMP ebf86c9 csky: bugfix compile error without CONFIG_SMP. 8537eea csky: remove debug code. 4ebc051 csky: bugfix compile error with linux-4.9.56 75a938e csky: C-SKY SMP supported. 0eebc07 csky: use internal function for map_sg. 3d29751 csky: bugfix can't support highmem b545d2a csky: bugfix r26 is the link reg for jsri_to_jsr. 9e3313a csky: bugfix sync tls for abiv1 in ptrace. 587a0d2 csky: use __NR_rt_sigreturn in asm-generic. f562b46 csky: bugfix gpr_set & fpr_set f57266f csky: bugfix fpu_fpe_helper excute mtcr mfcr. c676669 csky: bugfix ave is default enable on reset. `d40d34d` csky: remove unused sc_mask in sigcontext.h. 274b7a2 csky: redesign the signal's api 7501771 csky: bugfix forget restore usp. 923e2ca csky: re-struct the pt_regs for regset. 2a1e499 csky: fixup config. ada81ec csky: bugfix abiv1 compile error. e34acb9 csky: bugfix abiv1 couldn't support -mno-stack-size. ec53560 csky: change irq map, reserve soft_irq&private_irq space. c7576f7 csky: bugfix modpost warning with -mno-stack-size c8ff9d4 csky: support csky mp timer alpha version. deabaaf csky: update .gitignore. 574815c csky: bugfix compile error with abiv1 in 4.15 0b426a7 csky: bugfix format of cpu verion id. 083435f csky: irq-csky-v2 alpha init. 21209e5 csky: add .gitignore 73e19b4 csky: remove FMFS_FPU_REGS/FMTS_FPU_REGS 07e8fac csky: add fpu regset in ptrace.c cac779d csky: add CSKY_VECIRQ_LEGENCY for SOC bug. 54bab1d csky: move usp into pt_regs. b167422 csky: support regset for ptrace. a098d4c csky: remove ARCH_WANT_IPC_PARSE_VERSION fe61a84 csky: add timer-of support. 27702e2 csky: bugfix boot error. ebe3edb csky: bugfix gx6605s boot failed - add __HEAD to head.section for head.S - move INIT_SECTION together to fix compile warning. 7138cae csky: coding convension for timer-nationalchip.c fa7f9bb csky: use ffs instead of fls. ddc9e81 csky: change to generic irq chip for irq-csky.c e9be8b9 irqchip: add generic irq chip for irq-nationalchip 2ee83fe csky: add set_handle_irq(), ref from openrisc & arm. 74181d6 csky: use irq_domain_add_linear instead of leagcy. fa45ae4 csky: bugfix setup stroge order for uncached. eb8030f csky: add HIGHMEM config in Kconfig 4f983d4 csky: remove "default n" in Kconfig 2467575 csky: use asm-generic/signal.h 77438e5 csky: coding conventions for irq.c 2e4a2b4 csky: optimize the cache flush ops. 96e1c58 csky: add CONFIG_CPU_ASID_BITS. 9339666 csky: add cprcr() cpwcr() for abiv1 ff05be4 csky: add THREAD_SHIFT define in asm/page.h 52ab022 csky: add mfcr() mtcr() in asm/reg_ops.h bdcd8f3 csky: revert back Kconfig select. 590c7e6 csky: bugfix compile error with CONFIG_AUDIT 1989292 csky: revert some back with cleanup unistd.h f1454fe csky: cleanup unistd.h 5d2985f csky: cleanup Kconfig and Makefile. 423d97e csky: cancel subdirectories cae2af4 csky: use asm-generic/fcntl.h -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE2KAv+isbWR/viAKHAXH1GYaIxXsFAlvT4KESHHJlbl9ndW9A Yy1za3kuY29tAAoJEAFx9RmGiMV7xvoP/RpZO14XK0LeKjP9DllhvxSF49iL2MxC GGqPpRzYvzjl6KNd8VxwUbi7VCMJd8et3FF7lIfRaLxeHgj/kqk6X3JIzFON6rom 3dRzy5XDV1bzsaZoqRK2lysGJwVmkh+CQVpmGvdSrziFPqoHyzALF0T2ggX97lft HVpkg/ugiamb7a3p8U2oI5p8V05+beXgTFAqb5QszJwmPVFNaqs6CvnjmFJxJ4L8 EAyXZ6MXjgLPE0z+0ACCkBcK66ejneTVrcGaxgNaMT5kzcE5qCF6bTZxRx7YxwnA tbk8q+/dd5W1pbfBff3NifXeSHvkWqn7vmTq+PqGnid/92cJTzjspjaakZgEfbTP F8kB5jND54hlchdKfqrfoOpi3W8UYEfqdf1BHWgqsptWpq6GoTMJJE6y6rfvy1E7 r/F8rSKJlc8SxVa8/VIQqO4LbSlxm/wnO7sGoeYi8TjO3ZMCBIm4Had0f5uJTgdH WENC5WP1xyle5BsL3BJiTLQl1qappWptX44YaVod2A9HtdIOsVWCFNdhFPCis1jX RaoP7pHBxQ5Q6r8Nue9bBa2F3gK7IOh1EKEJ821AgA9Pz2Gm8wZ1XzsPqFCiAkkS PcIuZQl7l/eVoiXXEqUTQFlzL4K8WnvKCYAdRZWklGKfE0Ek5RD+KVXhqFE8Ujz4 xa+OgEzn5cQA =kHer -----END PGP SIGNATURE----- Merge tag 'csky-for-linus-4.20' of https://github.com/c-sky/csky-linux Pull C-SKY architecture port from Guo Ren: "This contains the Linux port for C-SKY(csky) based on linux-4.19 Release, which has been through 10 rounds of review on mailing list. More information: http://en.c-sky.com The development repo: https://github.com/c-sky/csky-linux ABI Documentation: https://github.com/c-sky/csky-doc Here is the pre-built cross compiler for fast test from our CI: https://gitlab.com/c-sky/buildroot/-/jobs/101608095/artifacts/file/output/images/csky_toolchain_qemu_csky_ck807f_4.18_glibc_defconfig_482b221e52908be1c9b2ccb444255e1562bb7025.tar.xz We use buildroot as our CI-test enviornment. "LTP, Lmbench ..." will be tested for every commit. See here for more details: https://gitlab.com/c-sky/buildroot/pipelines We'll continouslly improve csky subsystem in future" Arnd acks, and adds the following notes: "I did a thorough review of the ABI, which as usual mainly consists of spotting any files that don't use the asm-generic ABI itself, and having it changed to it matches exactly what we do on other new architectures. I also looked at every other patch and commented on maybe half of them where I saw something that did not quite seem right. Others have reviewed specific patches in greater depth. I'm sure that one could fine more of the minor details, but as long as they are not ABI relevant, they can be fixed later. The only patch that is part of the ABI and that nobody reviewed is the signal handling. This is one of the areas I never worked on in much detail. I did not see anything wrong with it, but I also don't know what the problems with the other architectures are here, and we seem to be hitting issues occasionally, and we never managed to generalize this enough for new architectures to have a trivial implementation. I was originally hoping that we could have the 64-bit time_t interfaces ready in time to completely drop the 32-bit ones, but that did not happen. We might still remove them in the next merge window depending on whether the libc upstream people prefer to keep them or not. One more general comment: I think this may well be the last new CPU architecture we ever add to the kernel. Both nds32 and c-sky are made by companies that also work on risc-v, and generally speaking risc-v seems to be killing off any of the minor licensable instruction set projects, just like ARM has mostly killed off the custom vendor-specific instruction sets already. If we add another architecture in the future, it may instead be something like the LLVM bitcode or WebAssembly, who knows?" To which Geert Uytterhoeven pipes in about another architecture still in the pipeline: Kalray MPPA. * tag 'csky-for-linus-4.20' of https://github.com/c-sky/csky-linux: (24 commits) dt-bindings: interrupt-controller: C-SKY APB intc irqchip: add C-SKY APB bus interrupt controller dt-bindings: interrupt-controller: C-SKY SMP intc irqchip: add C-SKY SMP interrupt controller MAINTAINERS: Add csky dt-bindings: Add vendor prefix for csky dt-bindings: csky CPU Bindings csky: Misc headers csky: SMP support csky: Debug and Ptrace GDB csky: User access csky: Library functions csky: ELF and module probe csky: Atomic operations csky: IRQ handling csky: VDSO and rt_sigreturn csky: Process management and Signal csky: MMU and page table management csky: Cache and TLB routines csky: System Call ...	2018-10-29 08:25:00 -07:00
Linus Torvalds	9f51ae62c8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) GRO overflow entries are not unlinked properly, resulting in list poison pointers being dereferenced. 2) Fix bridge build with ipv6 disabled, from Nikolay Aleksandrov. 3) Direct packet access and other fixes in BPF from Daniel Borkmann. 4) gred_change_table_def() gets passed the wrong pointer, a pointer to a set of unparsed attributes instead of the attribute itself. From Jakub Kicinski. 5) Allow macsec device to be brought up even if it's lowerdev is down, from Sabrina Dubroca. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: diag: document swapped src/dst in udp_dump_one. macsec: let the administrator set UP state even if lowerdev is down macsec: update operstate when lower device changes net: sched: gred: pass the right attribute to gred_change_table_def() ptp: drop redundant kasprintf() to create worker name net: bridge: remove ipv6 zero address check in mcast queries net: Properly unlink GRO packets on overflow. bpf: fix wrong helper enablement in cgroup local storage bpf: add bpf_jit_limit knob to restrict unpriv allocations bpf: make direct packet write unclone more robust bpf: fix leaking uninitialized memory on pop/peek helpers bpf: fix direct packet write into pop/peek helpers bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data bpf: fix direct packet access for flow dissector progs bpf: disallow direct packet access for unpriv in cg_skb bpf: fix test suite to enable all unpriv program types bpf, btf: fix a missing check bug in btf_parse selftests/bpf: add config fragments BPF_STREAM_PARSER and XDP_SOCKETS bpf: devmap: fix wrong interface selection in notifier_call	2018-10-28 20:17:49 -07:00
Lorenzo Colitti	747569b0a7	net: diag: document swapped src/dst in udp_dump_one. Since its inception, udp_dump_one has had a bug where userspace needs to swap src and dst addresses and ports in order to find the socket it wants. This is because it passes the socket source address to __udp[46]_lib_lookup's saddr argument, but those functions are intended to find local sockets matching received packets, so saddr is the remote address, not the local address. This can no longer be fixed for backwards compatibility reasons, so add a brief comment explaining that this is the case. This will avoid confusion and help ensure SOCK_DIAG implementations of new protocols don't have the same problem. Fixes: `a925aa00a5` ("udp_diag: Implement the get_exact dumping functionality") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:27:21 -07:00
David S. Miller	3bdf6bac58	Merge branch 'macsec-fixes' Sabrina Dubroca says: ==================== macsec: linkstate fixes This series fixes issues with handling administrative and operstate of macsec devices. Radu Rendec proposed another version of the first patch [0] but I'd rather not follow the behavior of vlan devices, going with macvlan does instead. Patrick Talbert also reported the same issue to me. The second patch is a follow-up. The restriction on setting the device up is a bit unreasonable, and operstate provides the information we need in this case. [0] https://patchwork.ozlabs.org/patch/971374/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:26:43 -07:00
Sabrina Dubroca	07bddef983	macsec: let the administrator set UP state even if lowerdev is down Currently, the kernel doesn't let the administrator set a macsec device up unless its lower device is currently up. This is inconsistent, as a macsec device that is up won't automatically go down when its lower device goes down. Now that linkstate propagation works, there's really no reason for this limitation, so let's remove it. Fixes: `c09440f7dc` ("macsec: introduce IEEE 802.1AE driver") Reported-by: Radu Rendec <radu.rendec@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:26:42 -07:00
Sabrina Dubroca	e6ac075882	macsec: update operstate when lower device changes Like all other virtual devices (macvlan, vlan), the operstate of a macsec device should match the state of its lower device. This is done by calling netif_stacked_transfer_operstate from its netdevice notifier. We also need to call netif_stacked_transfer_operstate when a new macsec device is created, so that its operstate is set properly. This is only relevant when we try to bring the device up directly when we create it. Radu Rendec proposed a similar patch, inspired from the 802.1q driver, that included changing the administrative state of the macsec device, instead of just the operstate. This version is similar to what the macvlan driver does, and updates only the operstate. Fixes: `c09440f7dc` ("macsec: introduce IEEE 802.1AE driver") Reported-by: Radu Rendec <radu.rendec@gmail.com> Reported-by: Patrick Talbert <ptalbert@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:26:42 -07:00
Jakub Kicinski	38b4f18d56	net: sched: gred: pass the right attribute to gred_change_table_def() gred_change_table_def() takes a pointer to TCA_GRED_DPS attribute, and expects it will be able to interpret its contents as struct tc_gred_sopt. Pass the correct gred attribute, instead of TCA_OPTIONS. This bug meant the table definition could never be changed after Qdisc was initialized (unless whatever TCA_OPTIONS contained both passed netlink validation and was a valid struct tc_gred_sopt...). Old behaviour: $ ip link add type dummy $ tc qdisc replace dev dummy0 parent root handle 7: \ gred setup vqs 4 default 0 $ tc qdisc replace dev dummy0 parent root handle 7: \ gred setup vqs 4 default 0 RTNETLINK answers: Invalid argument Now: $ ip link add type dummy $ tc qdisc replace dev dummy0 parent root handle 7: \ gred setup vqs 4 default 0 $ tc qdisc replace dev dummy0 parent root handle 7: \ gred setup vqs 4 default 0 $ tc qdisc replace dev dummy0 parent root handle 7: \ gred setup vqs 4 default 0 Fixes: `f62d6b936d` ("[PKT_SCHED]: GRED: Use central VQ change procedure") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:23:26 -07:00
Rasmus Villemoes	822c5f7341	ptp: drop redundant kasprintf() to create worker name Building with -Wformat-nonliteral, gcc complains drivers/ptp/ptp_clock.c: In function ‘ptp_clock_register’: drivers/ptp/ptp_clock.c:239:26: warning: format not a string literal and no format arguments [-Wformat-nonliteral] worker_name : info->name); kthread_create_worker takes fmt+varargs to set the name of the worker, and that happens with a vsnprintf() to a stack buffer (that is then copied into task_comm). So there's no reason not to just pass "ptp%d", ptp->index to kthread_create_worker() and avoid the intermediate worker_name variable. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:20:06 -07:00
Nikolay Aleksandrov	0fe5119e26	net: bridge: remove ipv6 zero address check in mcast queries Recently a check was added which prevents marking of routers with zero source address, but for IPv6 that cannot happen as the relevant RFCs actually forbid such packets: RFC 2710 (MLDv1): "To be valid, the Query message MUST come from a link-local IPv6 Source Address, be at least 24 octets long, and have a correct MLD checksum." Same goes for RFC 3810. And also it can be seen as a requirement in ipv6_mc_check_mld_query() which is used by the bridge to validate the message before processing it. Thus any queries with :: source address won't be processed anyway. So just remove the check for zero IPv6 source address from the query processing function. Fixes: `5a2de63fd1` ("bridge: do not add port to router list when receives query with source 0.0.0.0") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 19:18:09 -07:00
Linus Torvalds	53b3b6bbfd	drm pull for 4.20-rc1 -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJbz8FfAAoJEAx081l5xIa+t8AQAJ6KaMM7rYlRDzIr1Vuh0++2 kFmfEQnSZnOWxO+zpyQpCJQr+/1aAHQ6+QzWq+/fX/bwH01H1Q4+z6t+4QoPqYlw rUYXZYRxnqGVPYx8hwNLmRbJcsXMVKly1SemBXIoabKkEtPNX5AZ4FXCR2WEbsV3 /YLxsYQv2KMd8aoeC9Hupa07Jj9GfOtEo0a9B1hKmo+XiF9HqPadxaofqOpQ6MCh 54itBgP7Kj4mwwr8KxG2JCNJagG5aG8q3yiEwaU5b0KzWya4o1wrOAJcBaEAIxpj JAgPde+f2L6w9dQbBBWeVFYKNn0jJqJdmbPs5Ek3i/NNFyx01Mn/3vlTZoRUqJN7 TGXwOI/BWz1iTaHyFPqVH6RPQAoUUDeCwgHkXonogFxvQLpiFG+dRNqxue0XVUMX 9tDSdZefWPoH3n9J/gDhwbV2Qbw/2n6yzCRYCb8HkqX1Y1JTmdYVgKvcnOwyYwsJ QzcVkWUJ31UAaZcTLCEW6SVqcUR0mso3LJAPSKp2NJiVLL8mSd/ViUTUbxRNkkXf H0abVGDjWAAZaT5uqNVqg4kV1Vc4Kj+/9QtspW4ktGezOz9DsctwJtfhTgOmT8Fx zlEwWmAbf1iJP9UgqI7r4+Nq24saqUYmIX0bowEasLIRO+l14Pf9mQJjgKRcMs/j SK4W5EreSFosKsxtQU4H =3yxi -----END PGP SIGNATURE----- Merge tag 'drm-next-2018-10-24' of git://anongit.freedesktop.org/drm/drm Pull drm updates from Dave Airlie: "This is going to rebuild more than drm as it adds a new helper to list.h for doing bulk updates. Seemed like a reasonable addition to me. Otherwise the usual merge window stuff lots of i915 and amdgpu, not so much nouveau, and piles of everything else. Core: - Adds a new list.h helper for doing bulk list updates for TTM. - Don't leak fb address in smem_start to userspace (comes with EXPORT workaround for people using mali out of tree hacks) - udmabuf device to turn memfd regions into dma-buf - Per-plane blend mode property - ref/unref replacements with get/put - fbdev conflicting framebuffers code cleaned up - host-endian format variants - panel orientation quirk for Acer One 10 bridge: - TI SN65DSI86 chip support vkms: - GEM support. - Cursor support amdgpu: - Merge amdkfd and amdgpu into one module - CEC over DP AUX support - Picasso APU support + VCN dynamic powergating - Raven2 APU support - Vega20 enablement + kfd support - ACP powergating improvements - ABGR/XBGR display support - VCN jpeg support - xGMI support - DC i2c/aux cleanup - Ycbcr 4:2:0 support - GPUVM improvements - Powerplay and powerplay endian fixes - Display underflow fixes vmwgfx: - Move vmwgfx specific TTM code to vmwgfx - Split out vmwgfx buffer/resource validation code - Atomic operation rework bochs: - use more helpers - format/byteorder improvements qxl: - use more helpers i915: - GGTT coherency getparam - Turn off resource streamer API - More Icelake enablement + DMC firmware - Full PPGTT for Ivybridge, Haswell and Valleyview - DDB distribution based on resolution - Limited range DP display support nouveau: - CEC over DP AUX support - Initial HDMI 2.0 support virtio-gpu: - vmap support for PRIME objects tegra: - Initial Tegra194 support - DMA/IOMMU integration fixes msm: - a6xx perf improvements + clock prefix - GPU preemption optimisations - a6xx devfreq support - cursor support rockchip: - PX30 support - rgb output interface support mediatek: - HDMI output support on mt2701 and mt7623 rcar-du: - Interlaced modes on Gen3 - LVDS on R8A77980 - D3 and E3 SoC support hisilicon: - misc fixes mxsfb: - runtime pm support sun4i: - R40 TCON support - Allwinner A64 support - R40 HDMI support omapdrm: - Driver rework changing display pipeline ordering to use common code - DMM memory barrier and irq fixes - Errata workarounds exynos: - out-bridge support for LVDS bridge driver - Samsung 16x16 tiled format support - Plane alpha and pixel blend mode support tilcdc: - suspend/resume update mali-dp: - misc updates" * tag 'drm-next-2018-10-24' of git://anongit.freedesktop.org/drm/drm: (1382 commits) firmware/dmc/icl: Add missing MODULE_FIRMWARE() for Icelake. drm/i915/icl: Fix signal_levels drm/i915/icl: Fix DDI/TC port clk_off bits drm/i915/icl: create function to identify combophy port drm/i915/gen9+: Fix initial readout for Y tiled framebuffers drm/i915: Large page offsets for pread/pwrite drm/i915/selftests: Disable shrinker across mmap-exhaustion drm/i915/dp: Link train Fallback on eDP only if fallback link BW can fit panel's native mode drm/i915: Fix intel_dp_mst_best_encoder() drm/i915: Skip vcpi allocation for MSTB ports that are gone drm/i915: Don't unset intel_connector->mst_port drm/i915: Only reset seqno if actually idle drm/i915: Use the correct crtc when sanitizing plane mapping drm/i915: Restore vblank interrupts earlier drm/i915: Check fb stride against plane max stride drm/amdgpu/vcn:Fix uninitialized symbol error drm: panel-orientation-quirks: Add quirk for Acer One 10 (S1003) drm/amd/amdgpu: Fix debugfs error handling drm/amdgpu: Update gc_9_0 golden settings. drm/amd/powerplay: update PPtable with DC BTC and Tvr SocLimit fields ...	2018-10-28 17:49:53 -07:00
Linus Torvalds	746bb4ed6d	Globally warn on VLA use - Remove unused fallback for BUILD_BUG_ON (which technically contains a VLA) - Lift -Wvla to the top-level Makefile -----BEGIN PGP SIGNATURE----- Comment: Kees Cook <kees@outflux.net> iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvV7jMWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkUwD/46aPTmVXQqzVr1QxRC087Aou5H hCMaUSG0mSuinhIpB398xh58imTqz48n44gf8yrBgittecV+g8cQ3TJZBp8fSbj4 zyuSy0xlghxqNYhsirMPdN61A8qOS/F7i60XFBXSKpdzKorUUsqlP6paDg1CWslB KWIOr2aHxvQk93pHFsWjOeM7CGqQIq1brKKDPAL+R4zj8EzXfi0s1sOR4tHCXRZP sTsuHysAjsBlaw54tvbCA5SIyABzZK5xsQoeChSKMoCDQb8TOQK4j8f78470/nmk lFWZWGKFr2sPUPcuf1casL5Cp57ycjwi4qTzKX2Qa1hhEhrYTIvcOwzOWdc0AY+6 Fttbopla1QmrGndLtm8FOJRGWiCAzhiSpV9vk1VDaP2jeCc6MEvTC0shsAgxSfsr JRIHqq37w3TBr78qeNuxOaSEkoqtjTVYug2aq7kefG66DGGChzCTVNQrLVNei3Qg ZdamzUZz7FVV6WmXlWsBfbm14sIRd02r7XORm0cJdIVvIwqJ9QIGJigR/Sfc4Qdi pXuuE3TNSfArACXlCkaBfqMYAhWO35qy41TerRlRDkri89DNHPY8RAVV0GpNSp7q kPaPBHZRKXAjHPnnypXz3A/zQoqJ7uWRG5msethAWtEXJBQ4qQWVhjTNmV7tkOkr HIaJFTb03LLIcuv23Q== =Vnw8 -----END PGP SIGNATURE----- Merge tag 'vla-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull VLA removal from Kees Cook: "Globally warn on VLA use. This turns on "-Wvla" globally now that the last few trees with their VLA removals have landed (crypto, block, net, and powerpc). Arnd mentioned that there may be a couple more VLAs hiding in hard-to-find randconfigs, but nothing big has shaken out in the last month or so in linux-next. We should be basically VLA-free now! Wheee. :) Summary: - Remove unused fallback for BUILD_BUG_ON (which technically contains a VLA) - Lift -Wvla to the top-level Makefile" * tag 'vla-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Makefile: Globally enable VLA warning compiler.h: give up __compiletime_assert_fallback()	2018-10-28 13:26:45 -07:00
Linus Torvalds	ac747c0715	Kbuild updates for v4.20 - optimize kallsyms slightly - remove check for old CFLAGS usage - add some compiler flags unconditionally instead of evaluating $(call cc-option,...) - fix variable shadowing in host tools - refactor scripts/mkmakefile - refactor various makefiles -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJb1eyEAAoJED2LAQed4NsGUjwP/0/W0nLP+nKCJ0NpsSD151Ea Vrbm+RMxl8uPuQ0+Bh59rdO2yWR8v7jOX8CPbX1DqW0XwW4vTNRZm9j6A83rJzHv dPpinUObq6vXGJCYvMLoumhOkM1lRZie1AeBeLgKF0G0jprxUspGvakSyM/5SOo6 3gIhpPcczijC980KHIJbmwiGqRFVs3/zwqcKjQaRD3C6f4HdJL3i9Zr7kuZF0g1x dxCdN4Shv5x0igggp58z8646bbKlN7hItYaPq2csDop55/jWKzUkBkFkeaDjiUdB B6ysQfwkuTb6sbKXE6euYLUTyc6epx9v9pey0FpOx5tjXT+QmgK1ddLQEwdFH2+/ Fd5VR70h8uKTNvmqFJ+iebpR/aC71sUqYAzsoiTVZibR/F6+QAuhHJPZJDlSSwLC QH7j0fwJ3X5p9PYe3JBWyWzOd9BDvtV+HGE67Kx17iRNW0FDyKSvn7tezRVtqvHr KENrMT4n+lNQzB+c4lfjOE9ANpOf+PP4+ODhbrtvKItyb5QfF4F+5CDVif3wv3Kt 6dvh13hvR06gQCpbuxcFgD17mT1T/lVieuzOnWjtEh4HiMI1H0abYfkRWbQshrti u7TfYxwAyWKtisiPiT/1THYsW3ux7QmTSIo2iOznQxDlbmzENpQx+f22HsLaC0XU AUgvKGBaN0WOar2f2gyq =jM4E -----END PGP SIGNATURE----- Merge tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - optimize kallsyms slightly - remove check for old CFLAGS usage - add some compiler flags unconditionally instead of evaluating $(call cc-option,...) - fix variable shadowing in host tools - refactor scripts/mkmakefile - refactor various makefiles * tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: modpost: Create macro to avoid variable shadowing ASN.1: Remove unnecessary shadowed local variable kbuild: use 'else ifeq' for checksrc to improve readability kbuild: remove unneeded link_multi_deps kbuild: add -Wno-unused-but-set-variable flag unconditionally kbuild: add -Wdeclaration-after-statement flag unconditionally kbuild: add -Wno-pointer-sign flag unconditionally modpost: remove leftover symbol prefix handling for module device table kbuild: simplify command line creation in scripts/mkmakefile kbuild: do not pass $(objtree) to scripts/mkmakefile kbuild: remove user ID check in scripts/mkmakefile kbuild: remove VERSION and PATCHLEVEL from $(objtree)/Makefile kbuild: add --include-dir flag only for out-of-tree build kbuild: remove dead code in cmd_files calculation in top Makefile kbuild: hide most of targets when running config or mixed targets kbuild: remove old check for CFLAGS use kbuild: prefix Makefile.dtbinst path with $(srctree) unconditionally kallsyms: remove left-over Blackfin code kallsyms: reduce size a little on 64-bit	2018-10-28 13:22:35 -07:00
Linus Torvalds	f8cab69be0	linux-kselftest-4.20-rc1 This Kselftest update for Linux 4.20-rc1 consists of: - Improvements to ftrace test suite from Masami Hiramatsu. - Color coded ftrace PASS / FAIL results from Steven Rostedt (VMware) to improve readability of reports. - watchdog Fixes and enhancement to add gettimeout and get\|set pretimeout options from Jerry Hoemann. - Several fixes to warnings and spelling etc. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAlvSA/oACgkQCwJExA0N QxxQgA//UINds/f39g3nPf6QMhHcNQmNxdT/EYBSmKkorUTl3ojRFmcjerSt1DQ2 evgfFqU2oxyeqEvgApb1b39s7FGLVwzU67fALxpnn8xgFCvcHu1hWjbGLN34W1OY bnHnrTAhzgL6FV2EpJDAvN/1M1msq4Gp74GjZ/47aGwYQxkGXtm/WXMbQw+bJTFd bPC103URY56ihRlLLzxn+cNpbueYcbhI/OpYlYajvwBCFpIhTbdluscZpOmPt5nl Ej/TJ6DjgzlXqf8ZdvbAKblvTmjp2ik2OmtNybbr7CuvgKvxSfhsfP7itRL4UKhg /uHs2Ki5KmGczYxv6BYpNpjpyslWDz0WMcEBAsH+017IFsB+jrkEH7ys9KEkUhWc U+/LqPld/B9C1FpLah92SDuRck0MSUfOvqKdJTABdSIWO8REDiPhiIDlvH9tYrPm XgDvf/yldU3Sxemrqsxe9ooHp4kqr+vmVuXBfFd24CJTaGBPtcM25uTYEb06IyyL Lkd07YoHMRXzKxCGV9Lpfs4DQpP8WGhkMwFoFhbTXYkrf8yLk5GVpj3UVUluTAk4 q0/L9wKcaR5r1CqwCIE+ZsioRGlFBcpsAYD6zPc8HaV50X1zj65boK7Fj/HHzmTs Oyi9epp3T4Ml5NAbJeiA66GCpy1RRlEX3r/EVp2dW+FvBlyidUE= =cpOg -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest updates from Shuah Khan: "This Kselftest update for Linux 4.20-rc1 consists of: - Improvements to ftrace test suite from Masami Hiramatsu. - Color coded ftrace PASS / FAIL results from Steven Rostedt (VMware) to improve readability of reports. - watchdog Fixes and enhancement to add gettimeout and get\|set pretimeout options from Jerry Hoemann. - Several fixes to warnings and spelling etc" * tag 'linux-kselftest-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (40 commits) selftests/ftrace: Strip escape sequences for log file selftests/ftrace: Use colored output when available selftests: fix warning: "_GNU_SOURCE" redefined selftests: kvm: Fix -Wformat warnings selftests/ftrace: Add color to the PASS / FAIL results kvm: selftests: fix spelling mistake "Insufficent" -> "Insufficient" selftests: gpio: Fix OUTPUT directory in Makefile selftests: gpio: restructure Makefile selftests: watchdog: Fix ioctl SET* error paths to take oneshot exit path selftests: watchdog: Add gettimeout and get\|set pretimeout selftests: watchdog: Fix error message. selftests: watchdog: fix message when /dev/watchdog open fails selftests/ftrace: Add ftrace cpumask testcase selftests/ftrace: Add wakeup_rt tracer testcase selftests/ftrace: Add wakeup tracer testcase selftests/ftrace: Add stacktrace ftrace filter command testcase selftests/ftrace: Add trace_pipe testcase selftests/ftrace: Add function filter on module testcase selftests/ftrace: Add max stack tracer testcase selftests/ftrace: Add function profiling stat testcase ...	2018-10-28 12:58:42 -07:00
Linus Torvalds	dad4f140ed	Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...	2018-10-28 11:35:40 -07:00
David S. Miller	ece23711dd	net: Properly unlink GRO packets on overflow. Just like with normal GRO processing, we have to initialize skb->next to NULL when we unlink overflow packets from the GRO hash lists. Fixes: `d4546c2509` ("net: Convert GRO SKB handling to list_head.") Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-28 10:35:12 -07:00
Leonardo Bras	c2b1a9226f	modpost: Create macro to avoid variable shadowing Create DEF_FIELD_ADDR_VAR as a more generic version of the DEF_FIELD_ADD macro, allowing usage of a variable name other than the struct element name. Also, sets DEF_FIELD_ADDR as a specific usage of DEF_FILD_ADDR_VAR in which the var name is the same as the struct element name. Then, makes use of DEF_FIELD_ADDR_VAR to create a variable of another name, in order to avoid variable shadowing. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>	2018-10-29 00:41:41 +09:00
Leonardo Bras	9e1e819433	ASN.1: Remove unnecessary shadowed local variable Remove an unnecessary shadowed local variable (start). It was used only once, with the same value it was started before the if block. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>	2018-10-29 00:19:41 +09:00
John David Anglin	c9fa406f62	parisc: Fix A500 boot crash I believe the following change will fix the cache/TLB inconsistency observed by Meelis. After changing the page table entries, we need to flush the cache and TLB to ensure that we don't have any stale PTE values in the cache or TLB. The alternative patching is done after all CPUs are running. Thus, we need to flush the whole cache and TLB. I included the init section in the range modified by map_pages as suggested by Helge. Some routines in the init section may require patching. Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de>	2018-10-28 10:51:07 +01:00
Linus Torvalds	69d5b97c59	HID: we do not randomly make new drivers 'default y' .. even when that "default y" is hidden syntactically as a default !EXPERT it's wrong. The only reason something should be 'default y' is if it used to be built-in, and it was made configurable, and the 'default y' is just retaining the status quo. Altheratively, the hardware for the driver has become _so_ common that it really makes sense for everybody to build it. Finally, one possible reason for 'default y' is because the option is not enabling any new code at all, but is just enabling other options (the networking people do this for vendor options, for example, so that you can disable whole vendors at a time). Clearly, none of these cases hold for the BigBen Interactive Kids' gamepad, and HID_BIGBEN_FF should thus most definitely not default to on for everybody. Cc: Hanno Zulla <kontakt@hanno.de> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-27 11:03:27 -07:00
Linus Torvalds	5ecf3e110c	linux-watchdog 4.20-rc1 tag -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) iEYEABECAAYFAlvURsoACgkQ+iyteGJfRspw9gCgtLHtFIm++UnS48IhP+ZlioGI KqYAoLMhbd9IzpgPd1F8cLT/N29zAUcU =bvZ1 -----END PGP SIGNATURE----- Merge tag 'linux-watchdog-4.20-rc1' of git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - Add Armada 37xx CPU watchdog - w83627hf_wdt: Add Support for NCT6796D, NCT6797D, NCT6798D - hpwdt: several improvements - renesas_wdt: SPDX identifiers, stop when unregistering, support for R7S9210 - rza_wdt: SPDX identifiers, support longer timeouts - core: fix null pointer dereference when releasing cdev - iTCO_wdt: Drop option vendorsupport=2 - sama5d4: fix timeout-sec usage - lantiq_wdt: convert to watchdog framework - several small fixes * tag 'linux-watchdog-4.20-rc1' of git://www.linux-watchdog.org/linux-watchdog: (30 commits) watchdog: ts4800: release syscon device node in ts4800_wdt_probe() watchdog: armada_37xx_wdt: use do_div for u64 division documentation: watchdog: add documentation for armada-37xx-wdt dt-bindings: watchdog: Document armada-37xx-wdt binding watchdog: Add support for Armada 37xx CPU watchdog dt-bindings: watchdog: add mpc8xxx-wdt support watchdog: mpc8xxx: provide boot status MAINTAINERS: Fix file pattern for MEN Z069 watchdog driver dt-bindings: watchdog: renesas-wdt: Add support for R7S9210 watchdog: rza_wdt: Support longer timeouts watchdog: hpwdt: Disable PreTimeout when Timeout is smaller watchdog: w83627hf_wdt: Support NCT6796D, NCT6797D, NCT6798D watchdog: mpc8xxx: use dev_xxxx() instead of pr_xxxx() watchdog: lantiq: add get_timeleft callback watchdog: lantiq: Convert to watchdog_device watchdog: lantiq: update register names to better match spec watchdog: sama5d4: fix timeout-sec usage watchdog: fix a small number of "watchog" typos in comments watchdog: rza_wdt: convert to SPDX identifiers watchdog: iTCO_wdt: Remove unused hooks ...	2018-10-27 10:25:22 -07:00
Linus Torvalds	ed3f4e2398	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: "Just random driver fixups, nothing exiting" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: synaptics - avoid using uninitialized variable when probing Input: xen-kbdfront - mark expected switch fall-through Input: atmel_mxt_ts - mark expected switch fall-through Input: cyapa - mark expected switch fall-throughs Input: wm97xx-ts - fix exit path Input: of_touchscreen - add support for touchscreen-min-x\|y Input: Fix DIR-685 touchkeys MAINTAINERS entry Input: elants_i2c - use DMA safe i2c when possible Input: silead - try firmware reload after unsuccessful resume Input: st1232 - set INPUT_PROP_DIRECT property Input: xilinx_ps2 - convert to using %pOFn instead of device_node.name Input: atmel_mxt_ts - fix multiple <linux/property.h> includes Input: sun4i-lradc - convert to using %pOFn instead of device_node.name Input: pwm-vibrator - correct pwms in DT binding example	2018-10-27 10:20:39 -07:00
Linus Torvalds	c7b7eefa57	RTC for 4.20 Subsystem: - non devm managed registration is now removed from the driver API. - all the unnecessary rtc_valid_tm() calls have been removed Drivers: - abx80X: watchdog support - cmos: fix non ACPI support - sc27xx: fix alarm support - Remove a possible sysfs race condition for ab8500, ds1307, ds1685, isl1208 - Fix a possible race condition where an irq handler may be called before the rtc_device struct is allocated for mt6397, pl030, menelaus, armada38x -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEXx9Viay1+e7J/aM4AyWl4gNJNJIFAlvTekMACgkQAyWl4gNJ NJIJaA/+ONa/L5xvbMXRAAx164WX1f/8kZUZzku5rDAdnwrfj/4N13LRj3rtQlpD 0tv7/JZW3f1P8Ccfd3AkIZa1t7i3ruFpsCvQ9ModzB+gAR9n9ic+k8ugExEXXi4d wrVwEiPHgoYyX/GwGKLUrbTJTqOBcp3PeSW5jK2WhKdnCnYVOQNL38MVmBxSCZ/m 3p+lOBsR4ND2bIf3QjJpck6K+6thaX0JQUIuM14oK2rmNA5dfrhaUw2benlBG/dv Cd/H7TO5qRuUaY+HwwUB/M78wYhaLyVjPU0JbfHwVf04dZFyWQYuZxVZJh9boowO XwkwKBPgti43Opa9i6VIcOR6j2/ym2U8JJcU5eL4scFxjwoTUbbE2WAvNJ9/sF5J z434yfn9Iy0dVEmp/K5Dl0YQAhNHVQChi0RsQ0UnqxDzuxEU8J26JQvp8MHNKYsp FxeyCDjEMf8u+xqQpCQ24chzyIKUGSyFurYGVDhDOW2/Rh08Z17GMhiLfOO73pFU B9GAaEpsJwfZbmtlqJAm3lE521kCo4JBoVM7tY+75p3b9epPW5L6tGQIDdj6Q2A7 VJkCo+O/2cVO3NRXWopTe6/vMm8mvJALvaQVccrLqx/oBpN6GBDFPXfz4a6AB+GF o2XBNEkUOrEAkSONfGKEKAkKU//tIssr9Yx6XmpjFOcrZxgfAL0= =5JM5 -----END PGP SIGNATURE----- Merge tag 'rtc-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "This cycle, there were mostly non urgent fixes in drivers. I also finally unexported the non managed registration. Subsystem: - non devm managed registration is now removed from the driver API - all the unnecessary rtc_valid_tm() calls have been removed Drivers: - abx80X: watchdog support - cmos: fix non ACPI support - sc27xx: fix alarm support - Remove a possible sysfs race condition for ab8500, ds1307, ds1685, isl1208 - Fix a possible race condition where an irq handler may be called before the rtc_device struct is allocated for mt6397, pl030, menelaus, armada38x" * tag 'rtc-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (54 commits) rtc: sc27xx: Always read normal alarm when registering RTC device rtc: sc27xx: Add check to see if need to enable the alarm interrupt rtc: sc27xx: Remove interrupts disable and clear in probe() rtc: sc27xx: Clear SPG value update interrupt status rtc: sc27xx: Set wakeup capability before registering rtc device rtc: s35390a: Change buf's type to u8 in s35390a_init rtc: ds1307: fix ds1339 wakealarm support rtc: ds1685: simplify getting .driver_data rtc: m41t80: mark expected switch fall-through rtc: tegra: Propagate errors from platform_get_irq() rtc: cmos: Remove the `use_acpi_alarm' module parameter for !ACPI rtc: cmos: Fix non-ACPI undefined reference to `hpet_rtc_interrupt' rtc: mv: let the core handle invalid alarms rtc: vr41xx: switch to rtc_time64_to_tm/rtc_tm_to_time64 rtc: ab8500: remove useless check rtc: ab8500: let the core handle range rtc: ab8500: use rtc_add_group rtc: rs5c348: report error when time is invalid rtc: rs5c348: remove forward declaration rtc: rs5c348: remove useless label ...	2018-10-27 09:24:24 -07:00
Linus Torvalds	e558545349	LED fix for 4-20-rc1 -----BEGIN PGP SIGNATURE----- iJAEABYIADkWIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCW9NpRRscamFjZWsuYW5h c3pld3NraUBnbWFpbC5jb20ACgkQYhBsURv0pdvr0AD4zq6QtPbvLI7+s7S3812N BOfYaA/gTA1eMZr/4QZPPAEA6h2sMjEA+jhf5lhA8QsCMGIeuWu/npcYkpFgnN9Q 8gM= =MecC -----END PGP SIGNATURE----- Merge tag 'led-fix-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds Pull LED fix from Jacek Anaszewski. * tag 'led-fix-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: leds: gpio: set led_dat->gpiod pointer for OF defined GPIO leds	2018-10-27 09:20:42 -07:00
Linus Torvalds	b59dfdaef1	i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array Commit `9ee3e06610` ("HID: i2c-hid: override HID descriptors for certain devices") added a new dmi_system_id quirk table to override certain HID report descriptors for some systems that lack them. But the table wasn't properly terminated, causing the dmi matching to walk off into la-la-land, and starting to treat random data as dmi descriptor pointers, causing boot-time oopses if you were at all unlucky. Terminate the array. We really should have some way to just statically check that arrays that should be terminated by an empty entry actually are so. But the HID people really should have caught this themselves, rather than have me deal with an oops during the merge window. Tssk, tssk. Cc: Julian Sax <jsbc@gmx.de> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-27 09:10:48 -07:00
David S. Miller	6788fac820	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2018-10-27 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Fix toctou race in BTF header validation, from Martin and Wenwen. 2) Fix devmap interface comparison in notifier call which was neglecting netns, from Taehee. 3) Several fixes in various places, for example, correcting direct packet access and helper function availability, from Daniel. 4) Fix BPF kselftest config fragment to include af_xdp and sockmap, from Naresh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-26 21:41:49 -07:00
Linus Torvalds	345671ea0f	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits) hugetlbfs: dirty pages as they are added to pagecache mm: export add_swap_extent() mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t mm/gup: cache dev_pagemap while pinning pages Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" mm: return zero_resv_unavail optimization mm: zero remaining unavailable struct pages tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option tools/testing/selftests/vm/gup_benchmark.c: allow user specified file tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage mm/gup_benchmark.c: add additional pinning methods mm/gup_benchmark.c: time put_page() mm: don't raise MEMCG_OOM event due to failed high-order allocation mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock ...	2018-10-26 19:33:41 -07:00
Linus Torvalds	4904008165	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "What better way to start off a weekend than with some networking bug fixes: 1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed by David Ahern and Bjørn Mork. 2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE properly in UDP, from Sean Tranchetti. 3) Remove TCA_OPTIONS from policy validation, it turns out we don't consistently use nested attributes for this across all packet schedulers. From David Ahern. 4) Fix SKB corruption in cadence driver, from Tristram Ha. 5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit. 6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits) net/neigh: fix NULL deref in pneigh_dump_table() net: allow traceroute with a specified interface in a vrf bridge: do not add port to router list when receives query with source 0.0.0.0 net/smc: fix smc_buf_unuse to use the lgr pointer ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called net/{ipv4,ipv6}: Do not put target net if input nsid is invalid lan743x: Remove SPI dependency from Microchip group. drivers: net: remove <net/busy_poll.h> inclusion when not needed net: phy: genphy_10g_driver: Avoid NULL pointer dereference r8169: fix broken Wake-on-LAN from S5 (poweroff) octeontx2-af: Use GFP_ATOMIC under spin lock net: ethernet: cadence: fix socket buffer corruption problem net/ipv6: Allow onlink routes to have a device mismatch if it is the default route net: sched: Remove TCA_OPTIONS from policy ice: Poll for link status change ice: Allocate VF interrupts and set queue map ice: Introduce ice_dev_onetime_setup net: hns3: Fix for warning uninitialized symbol hw_err_lst3 octeontx2-af: Copy the right amount of memory net: udp: fix handling of CHECKSUM_COMPLETE packets ...	2018-10-26 19:25:07 -07:00
Linus Torvalds	a45dcff748	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc fixes from David Miller: "Some more sparc fixups, mostly aimed at getting the allmodconfig build up and clean again" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Rework xchg() definition to avoid warnings. sparc64: Export __node_distance. sparc64: Make corrupted user stacks more debuggable.	2018-10-26 17:15:49 -07:00
Mike Kravetz	22146c3ce9	hugetlbfs: dirty pages as they are added to pagecache Some test systems were experiencing negative huge page reserve counts and incorrect file block counts. This was traced to /proc/sys/vm/drop_caches removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs explicit code removes the pages, the appropriate accounting is not performed. This can be recreated as follows: fallocate -l 2M /dev/hugepages/foo echo 1 > /proc/sys/vm/drop_caches fallocate -l 2M /dev/hugepages/foo grep -i huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 2048 HugePages_Free: 2047 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 4194304 kB ls -lsh /dev/hugepages/foo 4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo To address this issue, dirty pages as they are added to pagecache. This can easily be reproduced with fallocate as shown above. Read faulted pages will eventually end up being marked dirty. But there is a window where they are clean and could be impacted by code such as drop_caches. So, just dirty them all as they are added to the pagecache. Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com Fixes: `6bda666a03` ("hugepages: fold find_or_alloc_pages into huge_no_page()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mihcla Hocko <mhocko@suse.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:16 -07:00
Omar Sandoval	aa8aa8a331	mm: export add_swap_extent() Btrfs currently does not support swap files because swap's use of bmap does not work with copy-on-write and multiple devices. See `35054394c4` ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions"). However, the swap code has a mechanism for the filesystem to manually add swap extents using add_swap_extent() from the ->swap_activate() aop. iomap has done this since `67482129cd` ("iomap: add a swapfile activation function"). Btrfs will do the same in a later patch, so export add_swap_extent(). Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Sterba <dsterba@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Omar Sandoval	bc4ae27d81	mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Michael Ellerman	91cbacc345	tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann Horn. This would have caught the overlap bug reported by Daniel Micay. I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but instead of writing a selftest I spent my time bike-shedding whether it should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW .. mea culpa. Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jann Horn <jannh@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Andrea Arcangeli	7eef5f97c1	mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() There should be no cache left by the time we overwrite the old transhuge pmd with the new one. It's already too late to flush through the virtual address because we already copied the page data to the new physical address. So flush the cache before the data copy. Also delete the "end" variable to shutoff a "unused variable" warning on x86 where flush_cache_range() is a noop. Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Andrea Arcangeli	7066f0f933	mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page() runs some CPU could still access the page through the TLB. change_huge_pmd() before arming the numa/protnone transhuge pmd calls mmu_notifier_invalidate_range_start(). So there's no need of mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end() sequence in migrate_misplaced_transhuge_page() too, because by the time migrate_misplaced_transhuge_page() runs, the pmd mapping has already been invalidated in the secondary MMUs. It has to or if a secondary MMU can still write to the page, the migrate_page_copy() would lose data. However an explicit mmu_notifier_invalidate_range() is needed before migrate_misplaced_transhuge_page() starts copying the data of the transhuge page or the below can happen for MMU notifier users sharing the primary MMU pagetables and only implementing ->invalidate_range: CPU0 CPU1 GPU sharing linux pagetables using only ->invalidate_range ----------- ------------ --------- GPU secondary MMU writes to the page mapped by the transhuge pmd change_pmd_range() mmu..._range_start() ->invalidate_range_start() noop change_huge_pmd() set_pmd_at(numa/protnone) pmd_unlock() do_huge_pmd_numa_page() CPU TLB flush globally (1) CPU cannot write to page migrate_misplaced_transhuge_page() GPU writes to the page... migrate_page_copy() ...GPU stops writing to the page CPU TLB flush (2) mmu..._range_end() (3) ->invalidate_range_stop() noop ->invalidate_range() GPU secondary MMU is invalidated and cannot write to the page anymore (too late) Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives too late, we also need a mmu_notifier_invalidate_range() before calling migrate_misplaced_transhuge_page(), because the ->invalidate_range() in (3) also arrives too late. This requirement is the result of the lazy optimization in change_huge_pmd() that releases the pmd_lock without first flushing the TLB and without first calling mmu_notifier_invalidate_range(). Even converting the removed mmu_notifier_invalidate_range_only_end() into a mmu_notifier_invalidate_range_end() would not have been enough to fix this, because it run after migrate_page_copy(). After the hugepage data copy is done migrate_misplaced_transhuge_page() can proceed and call set_pmd_at without having to flush the TLB nor any secondary MMUs because the secondary MMU invalidate, just like the CPU TLB flush, has to happen before the migrate_page_copy() is called or it would be a bug in the first place (and it was for drivers using ->invalidate_range()). KVM is unaffected because it doesn't implement ->invalidate_range(). The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page. Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Andrea Arcangeli	d7c3393413	mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Patch series "migrate_misplaced_transhuge_page race conditions". Aaron found a new instance of the THP MADV_DONTNEED race against pmdp_clear_flush* variants, that was apparently left unfixed. While looking into the race found by Aaron, I may have found two more issues in migrate_misplaced_transhuge_page. These race conditions would not cause kernel instability, but they'd corrupt userland data or leave data non zero after MADV_DONTNEED. I did only minor testing, and I don't expect to be able to reproduce this (especially the lack of ->invalidate_range before migrate_page_copy, requires the latest iommu hardware or infiniband to reproduce). The last patch is noop for x86 and it needs further review from maintainers of archs that implement flush_cache_range() (not in CC yet). To avoid confusion, it's not the first patch that introduces the bug fixed in the second patch, even before removing the pmdp_huge_clear_flush_notify, that _notify suffix was called after migrate_page_copy already run. This patch (of 3): This is a corollary of `ced108037c` ("thp: fix MADV_DONTNEED vs. numa balancing race"), `58ceeb6bec` ("thp: fix MADV_DONTNEED vs. MADV_FREE race") and `5b7abeae3a` ("thp: fix MADV_DONTNEED vs clear soft dirty race). When the above three fixes where posted Dave asked https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com but apparently this was missed. The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced in `a54a407fbf` ("mm: Close races between THP migration and PMD numa clearing"). The important part of such commit is only the part where the page lock is not released until the first do_huge_pmd_numa_page() finished disarming the pagenuma/protnone. The addition of pmdp_clear_flush() wasn't beneficial to such commit and there's no commentary about such an addition either. I guess the pmdp_clear_flush() in such commit was added just in case for safety, but it ended up introducing the MADV_DONTNEED race condition found by Aaron. At that point in time nobody thought of such kind of MADV_DONTNEED race conditions yet (they were fixed later) so the code may have looked more robust by adding the pmdp_clear_flush(). This specific race condition won't destabilize the kernel, but it can confuse userland because after MADV_DONTNEED the memory won't be zeroed out. This also optimizes the code and removes a superfluous TLB flush. [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)] Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Clark Williams	026d1eaf5e	mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t The static lock quarantine_lock is used in quarantine.c to protect the quarantine queue datastructures. It is taken inside quarantine queue manipulation routines (quarantine_put(), quarantine_reduce() and quarantine_remove_cache()), with IRQs disabled. This is not a problem on a stock kernel but is problematic on an RT kernel where spin locks are sleeping spinlocks, which can sleep and can not be acquired with disabled interrupts. Convert the quarantine_lock to a raw spinlock_t. The usage of quarantine_lock is confined to quarantine.c and the work performed while the lock is held is used for debug purpose. [bigeasy@linutronix.de: slightly altered the commit message] Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	df06b37ffe	mm/gup: cache dev_pagemap while pinning pages Getting pages from ZONE_DEVICE memory needs to check the backing device's live-ness, which is tracked in the device's dev_pagemap metadata. This metadata is stored in a radix tree and looking it up adds measurable software overhead. This patch avoids repeating this relatively costly operation when dev_pagemap is used by caching the last dev_pagemap while getting user pages. The gup_benchmark kernel self test reports this reduces time to get user pages to as low as 1/3 of the previous time. Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Masayoshi Mizuma	9fd61bc951	Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" commit `124049decb` ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved") breaks movable_node kernel option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. ===================================================================== kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable ... kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0 ... kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug ... kernel: Movable zone start for each node kernel: Node 3: 0x00001c0000000000 kernel: Early memory node ranges ... ===================================================================== The original issue is fixed by the former patches, so let's revert commit `124049decb` ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.com Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Pavel Tatashin	ec393a0f01	mm: return zero_resv_unavail optimization When checking for valid pfns in zero_resv_unavail(), it is not necessary to verify that pfns within pageblock_nr_pages ranges are valid, only the first one needs to be checked. This is because memory for pages are allocated in contiguous chunks that contain pageblock_nr_pages struct pages. Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Naoya Horiguchi	907ec5fca3	mm: zero remaining unavailable struct pages Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit `124049decb` ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). The commit breaks the option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. First and second patch fix the original issue which the commit tried to fix, then revert the commit. This patch (of 3): There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit `f7f99100d8` which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages outside memblock.memory, but currently it covers only the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch extends it to cover all unavailable range, which fixes the reported issue. Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com Fixes: `f7f99100d8` ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	3821b76c3c	tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option Add a new option '-H' to the gup benchmark to help understand how hugetlb mapping pages compare with the default. Link: http://lkml.kernel.org/r/20181010195605.10689-6-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	0dd8666afb	tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option Add a new benchmark option, -S, to request MAP_SHARED. This can be used to compare with MAP_PRIVATE, or for files that require this option, like dax. Link: http://lkml.kernel.org/r/20181010195605.10689-5-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	aeb85ed4f4	tools/testing/selftests/vm/gup_benchmark.c: allow user specified file Allow a user to specify a file to map by adding a new option, '-f', providing a means to test various file backings. If not specified, the benchmark will use a private mapping of /dev/zero, which produces an anonymous mapping as before. [akpm@linux-foundation.org: avoid using comma operator] Link: http://lkml.kernel.org/r/20181010195605.10689-4-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	319e0bec1a	tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage If the '-w' parameter was provided, the benchmark would exit due to a mssing 'break'. Link: http://lkml.kernel.org/r/20181010195605.10689-3-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	714a3a1eba	mm/gup_benchmark.c: add additional pinning methods Provide new gup benchmark ioctl commands to run different user page pinning methods, get_user_pages_longterm() and get_user_pages(), in addition to the existing get_user_pages_fast(). Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Keith Busch	26db3d09d9	mm/gup_benchmark.c: time put_page() We'd like to measure time to unpin user pages, so this adds a second benchmark timer on put_page, separate from get_page. Adding the field breaks this ioctl ABI, but should be okay since this an in-tree kernel selftest. [akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use] Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:15 -07:00
Roman Gushchin	7a1adfddaf	mm: don't raise MEMCG_OOM event due to failed high-order allocation It was reported that on some of our machines containers were restarted with OOM symptoms without an obvious reason. Despite there were almost no memory pressure and plenty of page cache, MEMCG_OOM event was raised occasionally, causing the container management software to think, that OOM has happened. However, no tasks have been killed. The following investigation showed that the problem is caused by a failing attempt to charge a high-order page. In such case, the OOM killer is never invoked. As shown below, it can happen under conditions, which are very far from a real OOM: e.g. there is plenty of clean page cache and no memory pressure. There is no sense in raising an OOM event in this case, as it might confuse a user and lead to wrong and excessive actions (e.g. restart the workload, as in my case). Let's look at the charging path in try_charge(). If the memory usage is about memory.max, which is absolutely natural for most memory cgroups, we try to reclaim some pages. Even if we were able to reclaim enough memory for the allocation, the following check can fail due to a race with another concurrent allocation: if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; For regular pages the following condition will save us from triggering the OOM: if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) goto retry; But for high-order allocation this condition will intentionally fail. The reason behind is that we'll likely fall to regular pages anyway, so it's ok and even preferred to return ENOMEM. In this case the idea of raising MEMCG_OOM looks dubious. Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation order check, so that the event won't be raised for high order allocations. This change doesn't affect regular pages allocation and charging. Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:38:14 -07:00

1 2 3 4 5 ...

795418 Commits