Commit Graph

750228 Commits

Author SHA1 Message Date
Alexey Kodanev
7d6850f7c6 ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:31:57 -04:00
Russell King
0d3ad8549c net: phy: marvell10g: add thermal hwmon device
Add a thermal monitoring device for the Marvell 88x3310, which updates
once a second.  We also need to hook into the suspend/resume mechanism
to ensure that the thermal monitoring is reconfigured when we resume.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:24:54 -04:00
Eric Dumazet
bfacfb457b pptp: remove a buggy dst release in pptp_connect()
Once dst has been cached in socket via sk_setup_caps(),
it is illegal to call ip_rt_put() (or dst_release()),
since sk_setup_caps() did not change dst refcount.

We can still dereference it since we hold socket lock.

Caugth by syzbot :

BUG: KASAN: use-after-free in atomic_dec_return include/asm-generic/atomic-instrumented.h:198 [inline]
BUG: KASAN: use-after-free in dst_release+0x27/0xa0 net/core/dst.c:185
Write of size 4 at addr ffff8801c54dc040 by task syz-executor4/20088

CPU: 1 PID: 20088 Comm: syz-executor4 Not tainted 4.16.0+ #376
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1a7/0x27d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23c/0x360 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
 atomic_dec_return include/asm-generic/atomic-instrumented.h:198 [inline]
 dst_release+0x27/0xa0 net/core/dst.c:185
 sk_dst_set include/net/sock.h:1812 [inline]
 sk_dst_reset include/net/sock.h:1824 [inline]
 sock_setbindtodevice net/core/sock.c:610 [inline]
 sock_setsockopt+0x431/0x1b20 net/core/sock.c:707
 SYSC_setsockopt net/socket.c:1845 [inline]
 SyS_setsockopt+0x2ff/0x360 net/socket.c:1828
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4552d9
RSP: 002b:00007f4878126c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f48781276d4 RCX: 00000000004552d9
RDX: 0000000000000019 RSI: 0000000000000001 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000010 R09: 0000000000000000
R10: 00000000200010c0 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000526 R14: 00000000006fac30 R15: 0000000000000000

Allocated by task 20088:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3542
 dst_alloc+0x11f/0x1a0 net/core/dst.c:104
 rt_dst_alloc+0xe9/0x540 net/ipv4/route.c:1520
 __mkroute_output net/ipv4/route.c:2265 [inline]
 ip_route_output_key_hash_rcu+0xa49/0x2c60 net/ipv4/route.c:2493
 ip_route_output_key_hash+0x20b/0x370 net/ipv4/route.c:2322
 __ip_route_output_key include/net/route.h:126 [inline]
 ip_route_output_flow+0x26/0xa0 net/ipv4/route.c:2577
 ip_route_output_ports include/net/route.h:163 [inline]
 pptp_connect+0xa84/0x1170 drivers/net/ppp/pptp.c:453
 SYSC_connect+0x213/0x4a0 net/socket.c:1639
 SyS_connect+0x24/0x30 net/socket.c:1620
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Freed by task 20082:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527
 __cache_free mm/slab.c:3486 [inline]
 kmem_cache_free+0x83/0x2a0 mm/slab.c:3744
 dst_destroy+0x266/0x380 net/core/dst.c:140
 dst_destroy_rcu+0x16/0x20 net/core/dst.c:153
 __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
 rcu_do_batch kernel/rcu/tree.c:2675 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2930 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2897 [inline]
 rcu_process_callbacks+0xd6c/0x17b0 kernel/rcu/tree.c:2914
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285

The buggy address belongs to the object at ffff8801c54dc000
 which belongs to the cache ip_dst_cache of size 168
The buggy address is located 64 bytes inside of
 168-byte region [ffff8801c54dc000, ffff8801c54dc0a8)
The buggy address belongs to the page:
page:ffffea0007153700 count:1 mapcount:0 mapping:ffff8801c54dc000 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801c54dc000 0000000000000000 0000000100000010
raw: ffffea0006b34b20 ffffea0006b6c1e0 ffff8801d674a1c0 0000000000000000
page dumped because: kasan: bad access detected

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:17:55 -04:00
Florian Fainelli
18bd5949e5 net: dsa: mt7530: Use NULL instead of plain integer
We would be passing 0 instead of NULL as the rsp argument to
mt7530_fdb_cmd(), fix that.

Fixes: b8f126a8d5 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:15:49 -04:00
Florian Fainelli
861690d054 net: dsa: b53: Fix sparse warnings in b53_mmap.c
sparse complains about the following warnings:

drivers/net/dsa/b53/b53_mmap.c:33:31: warning: incorrect type in
initializer (different address spaces)
drivers/net/dsa/b53/b53_mmap.c:33:31:    expected unsigned char
[noderef] [usertype] <asn:2>*regs
drivers/net/dsa/b53/b53_mmap.c:33:31:    got void *priv

and indeed, while what we are doing is functional, we are dereferencing
a void * pointer into a void __iomem * which is not great. Just use the
defined b53_mmap_priv structure which holds our register base and use
that.

Fixes: 967dd82ffc ("net: dsa: b53: Add support for Broadcom RoboSwitch")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:15:27 -04:00
Cong Wang
3848ec5dc8 af_unix: remove redundant lockdep class
After commit 581319c586 ("net/socket: use per af lockdep classes for sk queues")
sock queue locks now have per-af lockdep classes, including unix socket.
It is no longer necessary to workaround it.

I noticed this while looking at a syzbot deadlock report, this patch
itself doesn't fix it (this is why I don't add Reported-by).

Fixes: 581319c586 ("net/socket: use per af lockdep classes for sk queues")
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:13:40 -04:00
David S. Miller
51508179ec Merge branch 'net-Broadcom-drivers-sparse-fixes'
Florian Fainelli says:

====================
net: Broadcom drivers sparse fixes

This patch series fixes the same warning reported by sparse in bcmsysport and
bcmgenet in the code that deals with inserting the TX checksum pointers:

drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: cast from restricted __be16
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26: warning: incorrect type in argument 1 (different base types)
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:    expected unsigned short [unsigned] [usertype] val
drivers/net/ethernet/broadcom/bcmsysport.c:1155:26:    got restricted __be16 [usertype] protocol

This patch fixes both issues by using the same construct and not swapping
skb->protocol but instead the values we are checking against.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:07:21 -04:00
Florian Fainelli
c0eb05585d net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()
skb->protocol is a __be16 which we would be calling htons() against,
while this is not wrong per-se as it correctly results in swapping the
value on LE hosts, this still upsets sparse. Adopt a similar pattern to
what other drivers do and just assign ip_ver to skb->protocol, and then
use htons() against the different constants such that the compiler can
resolve the values at build time.

Fixes: 80105befdb ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:07:21 -04:00
Florian Fainelli
6f89421180 net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()
skb->protocol is a __be16 which we would be calling htons() against,
while this is not wrong per-se as it correctly results in swapping the
value on LE hosts, this still upsets sparse. Adopt a similar pattern to
what other drivers do and just assign ip_ver to skb->protocol, and then
use htons() against the different constants such that the compiler can
resolve the values at build time.

Fixes: 1c1008c793 ("net: bcmgenet: add main driver file")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:07:21 -04:00
David Howells
b41d7cfef5 rxrpc: Fix undefined packet handling
By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11
should just be discarded rather than being aborted like other undefined
packet types.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:04:08 -04:00
John Garry
6183d9b3ce MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver
Add John Garry as maintainer for drivers/bus/hisi_lpc.c, the HiSilicon LPC
driver.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-04-04 08:42:52 -05:00
John Garry
e0aa1563f8 HISI LPC: Add ACPI support
Based on the previous patches, this patch supports the LPC host on
Hip06/Hip07 for ACPI FW.

It is the responsibility of the LPC host driver to enumerate the child
devices, as the ACPI scan code will not enumerate children of "indirect IO"
hosts.

The ACPI table for the LPC host controller and the child devices is in the
following format:

  Device (LPC0) {
    Name (_HID, "HISI0191")  // HiSi LPC
    Name (_CRS, ResourceTemplate () {
      Memory32Fixed (ReadWrite, 0xa01b0000, 0x1000)
    })
  }

  Device (LPC0.IPMI) {
    Name (_HID, "IPI0001")
    Name (LORS, ResourceTemplate() {
      QWordIO (
        ResourceConsumer,
        MinNotFixed,     // _MIF
        MaxNotFixed,     // _MAF
        PosDecode,
        EntireRange,
        0x0,             // _GRA
        0xe4,            // _MIN
        0x3fff,          // _MAX
        0x0,             // _TRA
        0x04,            // _LEN
        , ,
        BTIO
      )
    })

Since the IO resources of the child devices need to be translated from LPC
bus addresses to logical PIO addresses, and we shouldn't modify the
resources of the devices generated in the FW scan, a per-child MFD is
created as a substitute.  The MFD IO resources will be the translated bus
addresses of the ACPI child.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04 08:42:51 -05:00
John Garry
dfda449232 ACPI / scan: Do not enumerate Indirect IO host children
Through the logical PIO framework, systems which otherwise have no IO space
access to legacy ISA/LPC devices may access these devices through so-called
"indirect IO" method.  In this, IO space accesses for non-PCI hosts are
redirected to a host LLDD to manually generate the IO space (bus) accesses.
Hosts are able to register a region in logical PIO space to map to its bus
address range.

Indirect IO child devices have an associated host-specific bus address.
Special translation is required to map between a logical PIO address for a
device and its host bus address.

Since in the ACPI tables the child device IO resources would be the
host-specific values, it is required the ACPI scan code should not
enumerate these devices, and that this should be the responsibility of the
host driver so that it can "fixup" the resources so that they map to the
appropriate logical PIO addresses.

To avoid enumerating these child devices, add a check from
acpi_device_enumeration_by_parent() as to whether the parent for a device
is a member of a known list of "indirect IO" hosts.  For now, the HiSilicon
LPC host controller ID is added.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-04 08:42:50 -05:00
John Garry
d87fb0917a ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use
Currently the ACPI scan has special handling for serial bus slaves, in that
it makes it the responsibility of the slave device's parent to enumerate
the device.

To support other types of slave devices which require the same special
handling but where the bus is not strictly a serial bus, such as devices on
the HiSilicon LPC controller bus, rename acpi_is_serial_bus_slave() to
acpi_device_enumeration_by_parent(), so that the name can fit the wider
purpose.

Also rename the associated device flag acpi_device_flags.serial_bus_slave
to .enumeration_by_parent.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-04 08:42:49 -05:00
Zhichang Yuan
adf38bb0b5 HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings
The low-pin-count (LPC) interface of Hip06/Hip07 accesses I/O port space of
peripherals.

Implement the LPC host controller driver which performs the I/O operations
on the underlying hardware.  We don't want to touch existing drivers such
as ipmi-bt, so this driver applies the indirect-IO introduced in the
previous patch after registering an indirect-IO node to the indirect-IO
devices list which will be searched by the I/O accessors to retrieve the
host-local I/O port.

The driver config is set as a bool instead of a tristate.  The reason here
is that, by the very nature of the driver providing a logical PIO range, it
does not make sense to have this driver as a loadable module.  Another more
specific reason is that the Huawei D03 board which includes Hip06 SoC
requires the LPC bus for UART console, so should be built in.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Zou Rongrong <zourongrong@huawei.com>
Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>	# dts part
2018-04-04 08:42:48 -05:00
Zhichang Yuan
65af618d2c of: Add missing I/O range exception for indirect-IO devices
There are some special ISA/LPC devices that work on a specific I/O range
where it is not correct to specify a 'ranges' property in the DTS parent
node as CPU addresses translated from DTS node are only for memory space on
some architectures, such as ARM64.  Without the parent 'ranges' property,
of_translate_address() returns an error.

Here we add special handling for this case.

During the OF address translation, some checking will be performed to
identify whether the device node is registered as indirect-IO.  If it is,
the I/O translation will be done in a different way from that one of PCI
MMIO.  In this way, the I/O 'reg' property of the special ISA/LPC devices
will be parsed correctly.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>    # earlier draft
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
2018-04-04 08:42:47 -05:00
Zhichang Yuan
5745392e0c PCI: Apply the new generic I/O management on PCI IO hosts
After introducing the new generic I/O space management (Logical PIO), the
original PCI MMIO relevant helpers need to be updated based on the new
interfaces defined in logical PIO.

Adapt the corresponding code to match the changes introduced by logical
PIO.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>        # earlier draft
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04 08:42:46 -05:00
Gabriele Paoloni
fcfaab3093 PCI: Add fwnode handler as input param of pci_register_io_range()
In preparation for having the PCI MMIO helpers use the new generic I/O
space management (logical PIO) we need to add the fwnode handler as an
extra input parameter.

Changes the signature of pci_register_io_range() and its callers as
needed.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
2018-04-04 08:42:45 -05:00
Gabriele Paoloni
e2515476ab PCI: Remove __weak tag from pci_register_io_range()
pci_register_io_range() has only one definition, so there is no need for
the __weak attribute.  Remove it.

Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04 08:42:44 -05:00
David Howells
402cb8dda9 fscache: Attach the index key and aux data to the cookie
Attach copies of the index key and auxiliary data to the fscache cookie so
that:

 (1) The callbacks to the netfs for this stuff can be eliminated.  This
     can simplify things in the cache as the information is still
     available, even after the cache has relinquished the cookie.

 (2) Simplifies the locking requirements of accessing the information as we
     don't have to worry about the netfs object going away on us.

 (3) The cache can do lazy updating of the coherency information on disk.
     As long as the cache is flushed before reboot/poweroff, there's no
     need to update the coherency info on disk every time it changes.

 (4) Cookies can be hashed or put in a tree as the index key is easily
     available.  This allows:

     (a) Checks for duplicate cookies can be made at the top fscache layer
     	 rather than down in the bowels of the cache backend.

     (b) Caching can be added to a netfs object that has a cookie if the
     	 cache is brought online after the netfs object is allocated.

A certain amount of space is made in the cookie for inline copies of the
data, but if it won't fit there, extra memory will be allocated for it.

The downside of this is that live cache operation requires more memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
2018-04-04 13:41:28 +01:00
David Howells
08c2e3d087 fscache: Add more tracepoints
Add more tracepoints to fscache, including:

 (*) fscache_page - Tracks netfs pages known to fscache.

 (*) fscache_check_page - Tracks the netfs querying whether a page is
     pending storage.

 (*) fscache_wake_cookie - Tracks cookies being woken up after a page
     completes/aborts storage in the cache.

 (*) fscache_op - Tracks operations being initialised.

 (*) fscache_wrote_page - Tracks return of the backend write_page op.

 (*) fscache_gang_lookup - Tracks lookup of pages to be stored in the write
     operation.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:27 +01:00
David Howells
a18feb5576 fscache: Add tracepoints
Add some tracepoints to fscache:

 (*) fscache_cookie - Tracks a cookie's usage count.

 (*) fscache_netfs - Logs registration of a network filesystem, including
     the pointer to the cookie allocated.

 (*) fscache_acquire - Logs cookie acquisition.

 (*) fscache_relinquish - Logs cookie relinquishment.

 (*) fscache_enable - Logs enablement of a cookie.

 (*) fscache_disable - Logs disablement of a cookie.

 (*) fscache_osm - Tracks execution of states in the object state machine.

and cachefiles:

 (*) cachefiles_ref - Tracks a cachefiles object's usage count.

 (*) cachefiles_lookup - Logs result of lookup_one_len().

 (*) cachefiles_mkdir - Logs result of vfs_mkdir().

 (*) cachefiles_create - Logs result of vfs_create().

 (*) cachefiles_unlink - Logs calls to vfs_unlink().

 (*) cachefiles_rename - Logs calls to vfs_rename().

 (*) cachefiles_mark_active - Logs an object becoming active.

 (*) cachefiles_wait_active - Logs a wait for an old object to be
     destroyed.

 (*) cachefiles_mark_inactive - Logs an object becoming inactive.

 (*) cachefiles_mark_buried - Logs the burial of an object.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:27 +01:00
David Howells
2c98425720 fscache: Fix hanging wait on page discarded by writeback
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.

The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.

Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:26 +01:00
David Howells
d0fb31ecda fscache: Detect multiple relinquishment of a cookie
Report if an fscache cookie is relinquished multiple times by the netfs.

Signed-off-by: David <dhowells@redhat.com>
2018-04-04 13:41:26 +01:00
David Howells
b27ddd4624 fscache: Pass the correct cancelled indications to fscache_op_complete()
The last parameter to fscache_op_complete() is a bool indicating whether or
not the operation was cancelled.  A lot of the time the inverse value is
given or no differentiation is made.  Fix this.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:26 +01:00
David Howells
bfa3837ec3 fscache, cachefiles: Fix checker warnings
Fix a couple of checker warnings in fscache and cachefiles:

 (1) fscache_n_op_requeue is never used, so get rid of it.

 (2) cachefiles_uncache_page() is passed in a lock that it releases, so
     this needs annotating.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:26 +01:00
David Howells
678edd09c2 afs: Be more aggressive in retiring cached vnodes
When relinquishing cookies, either due to iget failure or to inode
eviction, retire a cookie if we think the corresponding vnode got deleted
on the server rather than just letting it lie in the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:26 +01:00
David Howells
27a3ee3a04 afs: Use the vnode ID uniquifier in the cache key not the aux data
AFS vnodes (files) are referenced by a triplet of { volume ID, vnode ID,
uniquifier }.  Currently, kafs is only using the vnode ID as the file key
in the volume fscache index and checking the uniquifier on cookie
acquisition against the contents of the auxiliary data stored in the cache.

Unfortunately, this is subject to a race in which an FS.RemoveFile or
FS.RemoveDir op is issued against the server but the local afs inode isn't
torn down and disposed off before another thread issues something like
FS.CreateFile.  The latter then gets given the vnode ID that just got
removed, but with a new uniquifier and a cookie collision occurs in the
cache because the cookie is only keyed on the vnode ID whereas the inode is
keyed on the vnode ID plus the uniquifier.

Fix this by keying the cookie on the uniquifier in addition to the vnode ID
and dropping the uniquifier from the auxiliary data supplied.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:25 +01:00
David Howells
c1515999bd afs: Invalidate cache on server data change
Invalidate any data stored in fscache for a vnode that changes on the
server so that we don't end up with the cache in a bad state locally.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04 13:41:25 +01:00
Frederic Barrat
ad7b4e8022 cxl: Fix possible deadlock when processing page faults from cxllib
cxllib_handle_fault() is called by an external driver when it needs to
have the host resolve page faults for a buffer. The buffer can cover
several pages and VMAs. The function iterates over all the pages used
by the buffer, based on the page size of the VMA.

To ensure some stability while processing the faults, the thread T1
grabs the mm->mmap_sem semaphore with read access (R1). However, when
processing a page fault for a single page, one of the underlying
functions, copro_handle_mm_fault(), also grabs the same semaphore with
read access (R2). So the thread T1 takes the semaphore twice.

If another thread T2 tries to access the semaphore in write mode W1
(say, because it wants to allocate memory and calls 'brk'), then that
thread T2 will have to wait because there's a reader (R1). If the
thread T1 is processing a new page at that time, it won't get an
automatic grant at R2, because there's now a writer thread
waiting (T2). And we have a deadlock.

The timeline is:
1. thread T1 owns the semaphore with read access R1
2. thread T2 requests write access W1 and waits
3. thread T1 requests read access R2 and waits

The fix is for the thread T1 to release the semaphore R1 once it got
the information it needs from the current VMA. The address space/VMAs
could evolve while T1 iterates over the full buffer, but in the
unlikely case where T1 misses a page, the external driver will raise a
new page fault when retrying the memory access.

Fixes: 3ced8d7300 ("cxl: Export library to support IBM XSL")
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 22:09:33 +10:00
Naveen N. Rao
5d6a03ebc8 powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
We get the below warning if we try to use kexec on P9:
   kexec_core: Starting new kernel
   WARNING: CPU: 0 PID: 1223 at arch/powerpc/kernel/process.c:826 __set_breakpoint+0xb4/0x140
   [snip]
   NIP __set_breakpoint+0xb4/0x140
   LR  kexec_prepare_cpus_wait+0x58/0x150
   Call Trace:
     0xc0000000ee70fb20 (unreliable)
     0xc0000000ee70fb20
     default_machine_kexec+0x234/0x2c0
     machine_kexec+0x84/0x90
     kernel_kexec+0xd8/0xe0
     SyS_reboot+0x214/0x2c0
     system_call+0x58/0x6c

This happens since we are trying to clear hw breakpoint on POWER9,
though we don't have CPU_FTR_DAWR enabled. Guard __set_breakpoint()
within hw_breakpoint_disable() with ppc_breakpoint_available() to
address this.

Fixes: 9654153158 ("powerpc: Disable DAWR in the base POWER9 CPU features")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 21:54:02 +10:00
Palmer Dabbelt
83fbdf1c05 openrisc: Set CONFIG_MULTI_IRQ_HANDLER
arm has an optional MULTI_IRQ_HANDLER, which openrisc copied but didn't
make optional.  The multi irq handler infrastructure has been copied to
generic code selectable with a new config symbol. That symbol can be
selected by randconfig builds and can cause build breakage.

Introduce CONFIG_MULTI_IRQ_HANDLER as an intermediate step which prevents
the core config symbol from being selected. The openrisc local config
symbol will be removed once openrisc gets converted to the generic code.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20180404043130.31277-3-palmer@sifive.com
2018-04-04 12:04:28 +02:00
Palmer Dabbelt
667b24d049 arm64: Set CONFIG_MULTI_IRQ_HANDLER
arm has an optional MULTI_IRQ_HANDLER, which arm64 copied but didn't make
optional.  The multi irq handler infrastructure has been copied to generic
code selectable with a new config symbol. That symbol can be selected by
randconfig builds and can cause build breakage.

Introduce CONFIG_MULTI_IRQ_HANDLER as an intermediate step which prevents
the core config symbol from being selected. The arm64 local config symbol
will be removed once arm64 gets converted to the generic code.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20180404043130.31277-2-palmer@sifive.com
2018-04-04 12:04:28 +02:00
Palmer Dabbelt
d6f73825dc genirq: Make GENERIC_IRQ_MULTI_HANDLER depend on !MULTI_IRQ_HANDLER
These config switches enable the same code in the core and the not yet
converted architecture code. They can be selected both by randconfig builds
and cause linker error because the same symbols are defined twice.

Make the new GENERIC_IRQ_MULTI_HANDLER depend on !MULTI_IRQ_HANDLER to
prevent that. The dependency will be removed once all architectures are
converted over.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20180404043130.31277-4-palmer@sifive.com
2018-04-04 12:04:28 +02:00
Aneesh Kumar K.V
7a22d6321c powerpc/mm/radix: Update command line parsing for disable_radix
kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0  or just disable_radix.

prom_init parsing is not supporting these options.

Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 16:59:50 +10:00
Aneesh Kumar K.V
cec4e9b28f powerpc/mm/radix: Parse disable_radix commandline correctly.
kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0 or just disable_radix. When using the later
format we get below error.

 `Malformed early option 'disable_radix'`

Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 16:59:36 +10:00
Aneesh Kumar K.V
6fa504835d powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
With 64k page size, we have hugetlb pte entries at the pmd and pud level for
book3s64. We don't need to create a separate page table cache for that. With 4k
we need to make sure hugepd page table cache for 16M is placed at PUD level
and 16G at the PGD level.

Simplify all these by not using HUGEPD_PD_SHIFT which is confusing for book3s64.

Without this patch, with 64k page size we create pagetable caches with shift
value 10 and 7 which are not used at all.

Fixes: 419df06eea ("powerpc: Reduce the PTE_INDEX_SIZE")

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 16:58:53 +10:00
Aneesh Kumar K.V
fb4e5dbd44 powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
With split PTL (page table lock) config, we allocate the level
4 (leaf) page table using pte fragment framework instead of slab cache
like other levels. This was done to enable us to have split page table
lock at the level 4 of the page table. We use page->plt backing the
all the level 4 pte fragment for the lock.

Currently with Radix, we use only 16 fragments out of the allocated
page. In radix each fragment is 256 bytes which means we use only 4k
out of the allocated 64K page wasting 60k of the allocated memory.
This was done earlier to keep it closer to hash.

This patch update the pte fragment count to 256, thereby using the
full 64K page and reducing the memory usage. Performance tests shows
really low impact even with THP disabled. With THP disabled we will be
contenting further less on level 4 ptl and hence the impact should be
further low.

  256 threads:
    without patch (10 runs of ./ebizzy  -m -n 1000 -s 131072 -S 100)
      median = 15678.5
      stdev = 42.1209

    with patch:
      median = 15354
      stdev = 194.743

This is with THP disabled. With THP enabled the impact of the patch
will be less.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 16:58:06 +10:00
Aneesh Kumar K.V
f2ed480fa4 powerpc/mm/keys: Update documentation and remove unnecessary check
Adds more code comments. We also remove an unnecessary pkey check
after we check for pkey error in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04 15:23:09 +10:00
Al Viro
04bbc9795d Merge branch 'old.dcache' into work.dcache 2018-04-04 00:40:19 -04:00
Parav Pandit
414448d249 RDMA: Use ib_gid_attr during GID modification
Now that ib_gid_attr contains device, port and index, simplify the
provider APIs add_gid() and del_gid() to use device, port and index
fields from the ib_gid_attr attributes structure.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:34:16 -06:00
Parav Pandit
3e44e0ee08 IB/providers: Avoid null netdev check for RoCE
Now that IB core GID cache ensures that all RoCE entries have an
associated netdev remove null checks from the provider drivers for
clarity.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:51 -06:00
Parav Pandit
14169e333e IB/providers: Avoid zero GID check for RoCE
Now that the IB core GID cache ensures that a zero GID doesn't exist in
the GID table remove zero GID checks from the provider drivers for
clarity.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:51 -06:00
Parav Pandit
598ff6bae6 IB/core: Refactor GID modify code for RoCE
Code is refactored to prepare separate functions for RoCE which can do more
complex operations related to reference counting, while still
maintainining code readability. This includes
(a) Simplification to not perform netdevice checks and modifications
for IB link layer.
(b) Do not add RoCE GID entry which has NULL netdevice; instead return
an error.
(c) If GID addition fails at provider level add_gid(), do not add the
entry in the cache and keep the entry marked as INVALID.
(d) Simplify and reuse the ib_cache_gid_add()/del() routines so that they
can be used even for modifying default GIDs. This avoid some code
duplication in modifying default GIDs.
(e) find_gid() routine refers to the data entry flags to qualify a GID
as valid or invalid GID rather than depending on attributes and zeroness
of the GID content.
(f) gid_table_reserve_default() sets the GID default attribute at
beginning while setting up the GID table. There is no need to use
default_gid flag in low level functions such as write_gid(), add_gid(),
del_gid(), as they never need to update the DEFAULT property of the GID
entry while during GID table update.

As as result of this refactor, reserved GID 0:0:0:0:0:0:0:0 is no longer
searchable as described below.

A unicast GID entry of 0:0:0:0:0:0:0:0 is Reserved GID as per the IB
spec version 1.3 section 4.1.1, point (6) whose snippet is below.

"The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as
the Reserved GID. It shall never be assigned to any endport. It shall
not be used as a destination address or in a global routing header
(GRH)."

GID table cache now only stores valid GID entries. Before this patch,
Reserved GID 0:0:0:0:0:0:0:0 was searchable in the GID table using
ib_find_cached_gid_by_port() and other similar find routines.

Zero GID is no longer searchable as it shall not to be present in GRH or
path recored entry as described in IB spec version 1.3 section 4.1.1,
point (6), section 12.7.10 and section 12.7.20.

ib_cache_update() is simplified to check link layer once, use unified
locking scheme for all link layers, removed temporary gid table
allocation/free logic.

Additionally,
(a) Expand ib_gid_attr to store port and index so that GID query
routines can get port and index information from the attribute structure.
(b) Expand ib_gid_attr to store device as well so that in future code when
GID reference counting is done, device is used to reach back to the GID
table entry.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:50 -06:00
Parav Pandit
f35faa4ba9 IB/core: Simplify ib_query_gid to always refer to cache
Currently following inconsistencies exist.
1. ib_query_gid() returns GID from the software cache for a RoCE port
and returns GID from the HCA for an IB port.
This is incorrect because software GID cache is maintained regardless
of HCA port type.

2. GID is queries from the HCA via ib_query_gid and updated in the
software cache for IB link layer. Both of them might not be in sync.

ULPs such as SRP initiator, SRP target, IPoIB driver have historically
used ib_query_gid() API to query the GID. However CM used cached version
during CM processing, When software cache was introduced, this
inconsitency remained.

In order to simplify, improve readability and avoid link layer
specific above inconsistencies, this patch brings following changes.

1. ib_query_gid() always refers to the cache layer regardless of link
layer.

2. cache module who reads the GID entry from HCA and builds the cache,
directly invokes the HCA provider verb's query_gid() callback function.

3. ib_query_port() is being called in early stage where GID cache is not
yet build while reading port immutable property. Therefore it needs to
read the default GID from the HCA for IB link layer to publish the
subnet prefix.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:50 -06:00
Parav Pandit
0e1f9b9244 RDMA/providers: Simplify query_gid callback of RoCE providers
ib_query_gid() fetches the GID from the software cache maintained in
ib_core for RoCE ports.

Therefore, simplify the provider drivers for RoCE to treat query_gid()
callback as never called for RoCE, and only require non-RoCE devices to
implement it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:33:47 -06:00
Parav Pandit
72e1ff0fb7 RDMA/core: Update query_gid documentation for HCA drivers
query_gid() should return right GID value for iWarp and IB link layers.
It is a no-op for RoCE link layer.  Update the documentation to reflect
this.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:09:01 -06:00
Roland Dreier
8435168d50 RDMA/ucma: Don't allow setting RDMA_OPTION_IB_PATH without an RDMA device
Check to make sure that ctx->cm_id->device is set before we use it.
Otherwise userspace can trigger a NULL dereference by doing
RDMA_USER_CM_CMD_SET_OPTION on an ID that is not bound to a device.

Cc: <stable@vger.kernel.org>
Reported-by: <syzbot+a67bc93e14682d92fc2f@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03 21:06:14 -06:00
Linus Torvalds
17dec0a949 Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "There was a lot of work this cycle fixing bugs that were discovered
  after the merge window and getting everything ready where we can
  reasonably support fully unprivileged fuse. The bug fixes you already
  have and much of the unprivileged fuse work is coming in via other
  trees.

  Still left for fully unprivileged fuse is figuring out how to cleanly
  handle .set_acl and .get_acl in the legacy case, and properly handling
  of evm xattrs on unprivileged mounts.

  Included in the tree is a cleanup from Alexely that replaced a linked
  list with a statically allocated fix sized array for the pid caches,
  which simplifies and speeds things up.

  Then there is are some cleanups and fixes for the ipc namespace. The
  motivation was that in reviewing other code it was discovered that
  access ipc objects from different pid namespaces recorded pids in such
  a way that when asked the wrong pids were returned. In the worst case
  there has been a measured 30% performance impact for sysvipc
  semaphores. Other test cases showed no measurable performance impact.
  Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc
  performance both gave the nod that this is good enough.

  Casey Schaufler and James Morris have given their approval to the LSM
  side of the changes.

  I simplified the types and the code dealing with sysvipc to pass just
  kern_ipc_perm for all three types of ipc. Which reduced the header
  dependencies throughout the kernel and simplified the lsm code.

  Which let me work on the pid fixes without having to worry about
  trivial changes causing complete kernel recompiles"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ipc/shm: Fix pid freeing.
  ipc/shm: fix up for struct file no longer being available in shm.h
  ipc/smack: Tidy up from the change in type of the ipc security hooks
  ipc: Directly call the security hook in ipc_ops.associate
  ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces
  ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces
  ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.
  ipc/util: Helpers for making the sysvipc operations pid namespace aware
  ipc: Move IPCMNI from include/ipc.h into ipc/util.h
  msg: Move struct msg_queue into ipc/msg.c
  shm: Move struct shmid_kernel into ipc/shm.c
  sem: Move struct sem and struct sem_array into ipc/sem.c
  msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks
  shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks
  sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks
  pidns: simpler allocation of pid_* caches
2018-04-03 19:15:32 -07:00
Martin Brandenburg
dd0980226f orangefs: document package install and xfstests procedure
Unless one is working on the userspace code, there's no need to compile
OrangeFS.  The package works just fine.

(But leave the documentation for building from source since not everyone
uses distributions which include the package.)

Also document the process to run xfstests against OrangeFS.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-04-03 21:55:28 -04:00