Commit Graph

704772 Commits

Author SHA1 Message Date
Shaohua Li
8031c3ddc7 md/bitmap: copy correct data for bitmap super
raid5 cache could write bitmap superblock before bitmap superblock is
initialized. The bitmap superblock is less than 512B. The current code will
only copy the superblock to a new page and write the whole 512B, which will
zero the the data after the superblock. Unfortunately the data could include
bitmap, which we should preserve. The patch will make superblock read do 4k
chunk and we always copy the 4k data to new page, so the superblock write will
old data to disk and we don't change the bitmap.

Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24 10:04:54 -07:00
Wang Shilong
901ed070df ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
ext4_create                    1707443399           1440000      1185.72
_raw_spin_lock                 1317641501           180899929    7.28
jbd2__journal_start            287821030            1453950      197.96
jbd2_journal_get_write_access  33441470             73077185     0.46
ext4_add_nondir                29435963             1440000      20.44
ext4_add_entry                 26015166             1440049      18.07
ext4_dx_add_entry              25729337             1432814      17.96
ext4_mark_inode_dirty          12302433             5774407      2.13

most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.

Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
         DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
          Read Intensive SSD)

format command:
        mkfs.ext4 -J size=4096

test command:
        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 1 -v -p 10 -u #first run to load inode

        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 3 -v -p 10 -u

Kernel version: 4.13.0-rc3

Test  1,440,000 files with 48 directories by 48 processes:

Without patch:

File Creation   File removal
79,033          289,569 ops/per second
81,463          285,359
79,875          288,475

With patch:
File Creation   File removal
810669		301694
812805		302711
813965		297670

Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.

However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.

Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
Colin Ian King
c816c2558e netfilter: ebtables: fix indent on if statements
The returns on some if statements are not indented correctly,
add in the missing tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:56:17 +02:00
Bjorn Andersson
8c2a75e568 of/device: Fix of_device_get_modalias() buffer handling
of_device_request_module() calls of_device_get_modalias() with "len" 0,
to calculate the size of the buffer needed to store the result, but due
to integer promotion the ssize_t "len" will be compared as unsigned with
strlen(compat) and the loop will generally never break. This results in
a call to snprintf() with a negative len, which triggers below warning,
followed by a dereference of a invalid pointer:

  [    3.060067] WARNING: CPU: 0 PID: 51 at lib/vsprintf.c:2122 vsnprintf+0x348/0x6d8
  ...
  [    3.060301] [<ffffff800891ede8>] vsnprintf+0x348/0x6d8
  [    3.060308] [<ffffff800891f248>] snprintf+0x48/0x50
  [    3.060316] [<ffffff80086a7c80>] of_device_get_modalias+0x108/0x160
  [    3.060322] [<ffffff80086a7cf8>] of_device_request_module+0x20/0x88
  ...

Further more of_device_get_modalias() is supposed to return the number
of bytes needed to store the entire modalias, so the loop needs to
continue accumulate the total size even though the buffer is full.

Finally the function is not expected to ensure space for the NUL, nor
include it in the returned size, so only 1 should be added to the length
of "compat" in the loop (to account for the character 'C').

Fixes: bc575064d6 ("of/device: use of_property_for_each_string to parse compatible strings")
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Rob Herring <robh@kernel.org>
2017-08-24 11:52:51 -05:00
Bjorn Andersson
08ab58d9de of/device: Prevent buffer overflow in of_device_modalias()
As of_device_get_modalias() returns the number of bytes that would have
been written to the target string, regardless of how much did fit in the
buffer, it's possible that the returned index points beyond the buffer
passed to of_device_modalias() - causing memory beyond the buffer to be
null terminated.

Fixes: 0634c29589 ("of: Add function for generating a DT modalias with a newline")
Cc: Rob Herring <robh@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Rob Herring <robh@kernel.org>
2017-08-24 11:52:38 -05:00
Florian Westphal
b3480fe059 netfilter: conntrack: make protocol tracker pointers const
Doesn't change generated code, but will make it easier to eventually
make the actual trackers themselvers const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:33 +02:00
Florian Westphal
ea48cc83cf netfilter: conntrack: print_conntrack only needed if CONFIG_NF_CONNTRACK_PROCFS
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:33 +02:00
Florian Westphal
91950833dd netfilter: conntrack: place print_tuple in procfs part
CONFIG_NF_CONNTRACK_PROCFS is deprecated, no need to use a function
pointer in the trackers for this. Place the printf formatting in
the one place that uses it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal
036c69400a netfilter: conntrack: reduce size of l4protocol trackers
can use u16 for both, shrinks size by another 8 bytes.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal
09ec82f5af netfilter: conntrack: remove protocol name from l4proto struct
no need to waste storage for something that is only needed
in one place and can be deduced from protocol number.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal
a3134d537f netfilter: conntrack: remove protocol name from l3proto struct
no need to waste storage for something that is only needed
in one place and can be deduced from protocol number.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal
0d03510038 netfilter: conntrack: compute l3proto nla size at compile time
avoids a pointer and allows struct to be const later on.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Helge Deller
fd46cd55fb printk-formats.txt: Add examples for %pF and %pS usage
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24 18:49:52 +02:00
Nick Desaulniers
eee6ebbac1 netfilter: nf_nat_h323: fix logical-not-parentheses warning
Clang produces the following warning:

net/ipv4/netfilter/nf_nat_h323.c:553:6: error:
logical not is only applied to the left hand side of this comparison
  [-Werror,-Wlogical-not-parentheses]
if (!set_h225_addr(skb, protoff, data, dataoff, taddr,
    ^
add parentheses after the '!' to evaluate the comparison first
add parentheses around left hand side expression to silence this warning

There's not necessarily a bug here, but it's cleaner to return early,
ex:

if (x)
  return
...

rather than:

if (x == 0)
  ...
else
  return

Also added a return code check that seemed to be missing in one
instance.

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:48:05 +02:00
Helge Deller
d81f734462 parisc: Fix up devices below a PCI-PCI MegaRAID controller bridge
A MegaRAID PCI card in my rp5470 acts as PCI-PCI bridge.
Resource allocation for PCI devices behind such a bridge is quite incomplete,
so that syslog reports those warnings:

 LBA 0:10: PCI host bridge to bus 0000:50
 pci_bus 0000:50: root bus resource [io  0x80000-0x8ffff] (bus address [0x0000-0xffff])
 pci_bus 0000:50: root bus resource [mem 0xffffffff94000000-0xffffffff95ffffff] (bus address [0x94000000-0x95ffffff])
 pci_bus 0000:50: root bus resource [bus 50-57]
 pci 0000:50:00.0: [8086:0964] type 01 class 0x060400
 pci 0000:50:00.1: [8086:1960] type 00 class 0x0e0001
 pci 0000:50:00.1: reg 0x10: [mem 0x00000000-0x003fffff pref]
 pci 0000:50:00.1: reg 0x30: [mem 0x00000000-0x00007fff pref]
 pci 0000:50:00.0: Changing bridge control from 0x00000000 to 0x00000023
 pci_bus 0000:51: busn_res: can not insert [bus 51-ff] under [bus 50-57] (conflicts with (null) [bus 50-57])
 pci 0000:50:00.0: PCI bridge to [bus 51-ff]
 pci 0000:50:00.0:   bridge window [io  0x80000-0x80fff]
 pci 0000:50:00.0:   bridge window [mem 0x00000000-0x000fffff]
 pci 0000:50:00.0:   bridge window [mem 0x00000000-0x000fffff pref]
 pci 0000:50:00.0: can't claim BAR 14 [mem 0x00000000-0x000fffff]: no compatible bridge window
 pci 0000:50:00.0: can't claim BAR 15 [mem 0x00000000-0x000fffff pref]: no compatible bridge window
 pci 0000:50:00.0: can't claim BAR 16 [??? 0x00000000 flags 0x0]: no compatible bridge window
 pci_bus 0000:51: busn_res: [bus 51-ff] end is updated to 51
 pci 0000:50:00.0: BAR 16: [??? 0x00000000 flags 0x20000000] has bogus alignment
 pci 0000:50:00.1: BAR 0: assigned [mem 0xffffffff94000000-0xffffffff943fffff pref]
 pci 0000:50:00.0: BAR 14: assigned [mem 0xffffffff94400000-0xffffffff944fffff]
 pci 0000:50:00.0: BAR 15: assigned [mem 0xffffffff94500000-0xffffffff945fffff pref]
 pci 0000:50:00.1: BAR 6: assigned [mem 0xffffffff94600000-0xffffffff94607fff pref]
 pci 0000:50:00.0: PCI bridge to [bus 51]
 pci 0000:50:00.0:   bridge window [io  0x80000-0x80fff]
 pci 0000:50:00.0:   bridge window [mem 0xffffffff94400000-0xffffffff944fffff]
 pci 0000:50:00.0:   bridge window [mem 0xffffffff94500000-0xffffffff945fffff pref]

The patch below tries to improve the resource allocation.
Output is now:

 LBA 0:10: PCI host bridge to bus 0000:50
 pci_bus 0000:50: root bus resource [io  0x80000-0x8ffff] (bus address [0x0000-0xffff])
 pci_bus 0000:50: root bus resource [mem 0xffffffff94000000-0xffffffff95ffffff] (bus address [0x94000000-0x95ffffff])
 pci_bus 0000:50: root bus resource [bus 50-57]
 pci 0000:50:00.0: Changing bridge control from 0x00000000 to 0x00000023
 pci 0000:50:00.0: PCI bridge to [bus 51-ff]
 pci 0000:50:00.1: BAR 0: assigned [mem 0xffffffff94000000-0xffffffff943fffff pref]
 pci 0000:50:00.1: BAR 6: assigned [mem 0xffffffff94400000-0xffffffff94407fff pref]
 pci 0000:50:00.0: PCI bridge to [bus 51]
 pci 0000:50:00.0:   bridge window [io  0x80000-0x80fff]

Signed-off-by: Helge Deller <deller@gmx.de>
2017-08-24 18:46:44 +02:00
Waiman Long
1c08c22c87 cpuset: Fix incorrect memory_pressure control file mapping
The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7dbdb199d3 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
Cc: stable@vger.kernel.org # v4.4+
2017-08-24 09:42:28 -07:00
David S. Miller
fb3bbbda5f Merge branch 'mlxsw-ipv4-host-dpipe-table'
Jiri Pirko says:

====================
mlxsw: Add IPv4 host dpipe table

Arkadi says:

This patchset adds IPv4 host dpipe table support. This will provide the
ability to observe the hardware offloaded IPv4 neighbors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:17 -07:00
Arkadi Sharshevsky
a481d71323 mlxsw: spectrum_dpipe: Add support for controlling neighbor counters
Add support for controlling neighbor counters via dpipe.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
a86f030915 mlxsw: spectrum_dpipe: Add support for IPv4 host table dump
Add support for IPv4 host table dump.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
7cfcbc7591 mlxsw: spectrum_router: Add support for setting counters on neighbors
Add support for setting counters on neighbors based on dpipe's host table
counter status. This patch also adds the ability for getting the counter
value, which will be used by the dpipe host table implementation in the
next patches.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
6bba7e20da mlxsw: reg: Make flow counter set type enum to be shared
This is done as a preparation before introducing support for neighbor
counters. The flow counter's type enum is used by many registers, yet,
until now it was used only by mgpc and thus it was private. This patch
updates the namespace for more generic usage.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
6aecb36bc0 mlxsw: spectrum_dpipe: Add IPv4 host table initial support
Add IPv4 host table initial support.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
7e57ae9fc5 mlxsw: spectrum_dpipe: Fix label name
Change label name for case of erif table init failure.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
f17cc84d1c mlxsw: spectrum_router: Add helpers for neighbor access
This is done as a preparation before introducing the ability to dump the
host table via dpipe, and to count the table size. The mlxsw's neighbor
representative struct stays private to the router module.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
3580732448 devlink: Move dpipe entry clear function into devlink
The entry clear routine can be shared between the drivers, thus it is
moved inside devlink.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
ffd3cdccf2 devlink: Add support for dynamic table size
Up until now the dpipe table's size was static and known at registration
time. The host table does not have constant size and it is resized in
dynamic manner. In order to support this behavior the size is changed
to be obtained dynamically via an op.

This patch also adjust the current dpipe table for the new API.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
23ca5ec3af mlxsw: spectrum_dpipe: Fix erif table op name space
Fix ERIF's table operations name space.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
3fb886ecea devlink: Add IPv4 header for dpipe
This will be used by the IPv4 host table which will be introduced in the
following patches. This header is global and can be reused by many
drivers.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky
1177009131 devlink: Add Ethernet header for dpipe
This will be used by the IPv4 host table which will be introduced in the
following patches. This header is global and can be reused by many
drivers.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 09:33:15 -07:00
Dongdong Liu
9e16b8d68a PCI/DPC: Add local struct device pointers
Use a local "struct device *dev" for brevity and consistency in DPC driver.
No functional change intended.

Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-08-24 11:30:02 -05:00
Dongdong Liu
f20c4ea49e PCI/DPC: Add eDPC support
Add eDPC support. Get and print the RP PIO error information when the
trigger condition is RP PIO error.

For more information on eDPC, please see PCI Express Base Specification
Revision 3.1, section 6.2.10.3, or view the PCI-SIG eDPC ECN here:
https://pcisig.com/sites/default/files/specification_documents/ECN_Enhanced_DPC_2012-11-19_final.pdf

Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-08-24 11:28:44 -05:00
Alex Williamson
ea5311c7e7 PCI: Fix PCIe capability sizes
PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 defines the size of the PCIe capability
structure for v1 devices with link, but we also have a need in the vfio
code for sizing the capability for devices without link, such as Root
Complex Integrated Endpoints.  Create a separate define for this ending the
structure before the link fields.

Additionally, this reveals that PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 is currently
incorrect, ending the capability length before the v2 link fields.  Rename
this to specify an RC Integrated Endpoint (no link) capability length and
move PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 to include the link fields as we have
for the v1 version.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
[bhelgaas: add "_" in "PCI_CAP_EXP_RC ENDPOINT_SIZEOF_V2 44"]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
2017-08-24 11:24:59 -05:00
Rob Herring
b63773a801 PCI: Convert to using %pOF instead of full_name()
Now that we have a custom printf format specifier, convert users of
full_name() to use %pOF instead.  This is preparation for removing storing
of the full path string for each node.

Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Jonathan Hunter <jonathanh@nvidia.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
2017-08-24 11:24:59 -05:00
Bhumika Goyal
36b8518950 PCI: Constify endpoint pci_epf_type device_type
Make this const as it is only stored in the type field of a device
structure, which is const.  Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-24 11:22:54 -05:00
Wanpeng Li
bfcf83b144 KVM: nVMX: Fix trying to cancel vmlauch/vmresume
------------[ cut here ]------------
WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc4+ #11
RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
Call Trace:
 ? kvm_multiple_exception+0x149/0x170 [kvm]
 ? handle_emulation_failure+0x79/0x230 [kvm]
 ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
 ? check_chain_key+0x137/0x1e0
 ? reexecute_instruction.part.168+0x130/0x130 [kvm]
 nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
 ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
 vmx_queue_exception+0x197/0x300 [kvm_intel]
 kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
 ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
 ? preempt_count_sub+0x18/0xc0
 ? restart_apic_timer+0x17d/0x300 [kvm]
 ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
 ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
 kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
 ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
 ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]

The flag "nested_run_pending", which can override the decision of which should run
next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
is especially intended to avoid switching  to L1 in the injection decision-point.

This can be handled just like the other cases in vmx_check_nested_events, instead of
having a special case in vmx_queue_exception.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:22:21 +02:00
Varadarajan Narayanan
5d76117f07 PCI: qcom: Add support for IPQ8074 PCIe controller
Add support for the IPQ8074 PCIe controller.  IPQ8074 supports Gen 1/2, one
lane, two PCIe root complex with support for MSI and legacy interrupts, and
it conforms to PCI Express Base 2.1 specification.

The core init is the similar to the existing SoC, however the clocks and
reset lines differ.

Signed-off-by: smuthayy <smuthayy@codeaurora.org>
Signed-off-by: Varadarajan Narayanan <varada@codeaurora.org>
[bhelgaas: fix capitalization and "dev" usage to match existing style]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
2017-08-24 11:11:23 -05:00
Varadarajan Narayanan
8baf0151cd dt-bindings: PCI: qcom: Add support for IPQ8074
Add support for the IPQ8074 PCIe controller.  IPQ8074 supports Gen 1/2, one
lane, two PCIe root complex with support for MSI and legacy interrupts, and
it conforms to PCI Express Base 2.1 specification.

Signed-off-by: Varadarajan Narayanan <varada@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rob Herring <robh@kernel.org>
2017-08-24 11:10:19 -05:00
Varadarajan Narayanan
deff11f884 PCI: qcom: Use block IP version for operations
Presently, when support for a new SoC is added, the driver ops structures
and functions are versioned with plain 1, 2, 3 etc.  Instead use the block
IP version number.

Signed-off-by: Varadarajan Narayanan <varada@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
2017-08-24 11:09:33 -05:00
Philipp Zabel
244e00071f PCI: qcom: Explicitly request exclusive reset control
Commit a53e35db70 ("reset: Ensure drivers are explicit when requesting
reset lines") started to transition the reset control request API calls to
explicitly state whether the driver needs exclusive or shared reset control
behavior. Convert all drivers requesting exclusive resets to the explicit
API call so the temporary transition helpers can be removed.

No functional changes.

Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Stanimir Varbanov <svarbanov@mm-sol.com>
2017-08-24 11:09:22 -05:00
Wanpeng Li
664f8e26b0 KVM: X86: Fix loss of exception which has not yet been injected
vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue().  However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.

This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).

exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).

exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:19 +02:00
Wanpeng Li
274bba52a0 KVM: VMX: use kvm_event_needs_reinjection
Use kvm_event_needs_reinjection() encapsulation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:19 +02:00
Paolo Bonzini
09f037aa48 KVM: MMU: speedup update_permission_bitmask
update_permission_bitmask currently does a 128-iteration loop to,
essentially, compute a constant array.  Computing the 8 bits in parallel
reduces it to 16 iterations, and is enough to speed it up substantially
because many boolean operations in the inner loop become constants or
simplify noticeably.

Because update_permission_bitmask is actually the top item in the profile
for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
clock cycles, or up to 30%:

                                         before     after
   cpuid                                 35173      25954
   vmcall                                35122      27079
   inl_from_pmtimer                      52635      42675
   inl_from_qemu                         53604      44599
   inl_from_kernel                       38498      30798
   outl_to_kernel                        34508      28816
   wr_tsc_adjust_msr                     34185      26818
   rd_tsc_adjust_msr                     37409      27049
   mmio-no-eventfd:pci-mem               50563      45276
   mmio-wildcard-eventfd:pci-mem         34495      30823
   mmio-datamatch-eventfd:pci-mem        35612      31071
   portio-no-eventfd:pci-io              44925      40661
   portio-wildcard-eventfd:pci-io        29708      27269
   portio-datamatch-eventfd:pci-io       31135      27164

(I wrote a small C program to compare the tables for all values of CR0.WP,
CR4.SMAP and CR4.SMEP, and they match).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:18 +02:00
Fabio Estevam
a8c2038f61 PCI: qcom: Use gpiod_set_value_cansleep() to allow reset via expanders
The reset GPIO can be connected to a I2C or SPI IO expander, which may
sleep, so it is safer to use the gpiod_set_value_cansleep() variant
instead.

Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Stanimir Varbanov <svarbanov@mm-sol.com>
2017-08-24 11:09:17 -05:00
Yu Zhang
fd8cb43373 KVM: MMU: Expose the LA57 feature to VM.
This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
855feb6736 KVM: MMU: Add 5 level EPT & Shadow page table support.
Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.

Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
 Changing MAXPHYADDR invalidates the reserved bit bitmasks.
 - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
2a7266a8f9 KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
redefine it to 5 whenever a replacement is needed for 5 level
paging.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Yu Zhang
d1cd3ce900 KVM: MMU: check guest CR3 reserved bits based on its physical address width.
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Yu Zhang
e911eb3b34 KVM: x86: Add return value to kvm_cpuid().
Return false in kvm_cpuid() when it fails to find the cpuid
entry. Also, this routine(and its caller) is optimized with
a new argument - check_limit, so that the check_cpuid_limit()
fall back can be avoided.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:15 +02:00
Paolo Bonzini
3db134805c kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS
A guest may not be configured to support XSAVES/XRSTORS, even when the host
does. If the guest does not support XSAVES/XRSTORS, clear the secondary
execution control so that the processor will raise #UD.

Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
the VMCS02.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:13 +02:00
Tom Rini
9ce76511b6 ASoC: rt5677: Reintroduce I2C device IDs
Not all devices with ACPI and this combination of sound devices will
have the required information provided via ACPI.  Reintroduce the I2C
device ID to restore sound functionality on on the Chromebook 'Samus'
model.

[ More background note:
 the commit a36afb0ab6 ("ASoC: rt5677: Introduce proper table...")
 moved the i2c ID probed via ACPI ("RT5677CE:00") to a proper
 acpi_device_id table.  Although the action itself is correct per se,
 the overseen issue is the reference id->driver_data at
 rt5677_i2c_probe() for retrieving the corresponding chip model for
 the given id.  Since id=NULL is passed for ACPI matching case, we get
 an Oops now.

 We already have queued more fixes for 4.14 and they already address
 the issue, but they are bigger changes that aren't preferable for the
 late 4.13-rc stage.  So, this patch just papers over the bug as a
 once-off quick fix for a particular ACPI matching.  -- tiwai ]

Fixes: a36afb0ab6 ("ASoC: rt5677: Introduce proper table for ACPI enumeration")
Signed-off-by: Tom Rini <trini@konsulko.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-08-24 18:04:29 +02:00