ping localhost may default of IPv6 on modern systems, but
samples are trying to only parse IPv4. Force IPv4.
samples/bpf/tracex1_user.c doesn't interpret the packet so
we don't care which IP version will be used there.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Older GCC (<4.8) isn't smart enough to optimize !__builtin_constant_p()
branch in bpf_htons.
I recently fixed it for pkt_v4 and pkt_v6 in commit a0517a0f7e
("selftests/bpf: use __bpf_constant_htons in test_prog.c"), but
later added another bunch of bpf_htons in commit bf0f0fd939
("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow
dissector").
Fixes: bf0f0fd939 ("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This header defines the BPF functions enumerated in uapi/linux.bpf.h
in a callable format. Expand to include all registered functions.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
wrap bpf_stats_enabled sysctl with #ifdef
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 492ecee892 ("bpf: enable program stats")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
- Fix 16b cmpxchg() operations which could erroneously fail if bits 15:8
of the old value are non-zero. In practice I'm not aware of any actual
users of 16b cmpxchg() on MIPS, but this fixes the support for it was
was introduced in v4.13.
- Provide a struct device to dma_alloc_coherent for Lantiq XWAY systems
with a "Voice MIPS Macro Core" (VMMC) device.
- Provide DMA masks for BCM63xx ethernet devices, fixing a regression
introduced in v4.19.
- Fix memblock reservation for the kernel when the system has a non-zero
PHYS_OFFSET, correcting the memblock conversion performed in v4.20.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCXHhqjBUccGF1bC5idXJ0
b25AbWlwcy5jb20ACgkQPqefrLV1AN3ZaAD/SFgi3dS9bSWhDhiy83llLaWiCGPb
i09uzo3rpWoKSwQBAIwLEfmaHz/sYdliKRlE13uaxYWzwaN+VHXIPjlzbYMB
=cDlO
-----END PGP SIGNATURE-----
Merge tag 'mips_fixes_5.0_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A few more MIPS fixes:
- Fix 16b cmpxchg() operations which could erroneously fail if bits
15:8 of the old value are non-zero. In practice I'm not aware of
any actual users of 16b cmpxchg() on MIPS, but this fixes the
support for it was was introduced in v4.13.
- Provide a struct device to dma_alloc_coherent for Lantiq XWAY
systems with a "Voice MIPS Macro Core" (VMMC) device.
- Provide DMA masks for BCM63xx ethernet devices, fixing a regression
introduced in v4.19.
- Fix memblock reservation for the kernel when the system has a
non-zero PHYS_OFFSET, correcting the memblock conversion performed
in v4.20"
* tag 'mips_fixes_5.0_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: fix memory setup for platforms with PHYS_OFFSET != 0
MIPS: BCM63XX: provide DMA masks for ethernet devices
MIPS: lantiq: pass struct device to DMA API functions
MIPS: fix truncation in __cmpxchg_small for short values
Upon setting the cmode on 6390 and 6390X, the associated serdes
interfaces must be powered off/on.
Both 6390X and 6390 share code to do so, but it currently uses the 6390
specific helper mv88e6390_serdes_power() to disable and enable the
serdes interface.
This call will fail silently on 6390X when trying so set a 10G interface
such as XAUI or RXAUI, since mv88e6390_serdes_power() internally grabs
the lane number based on modes supported by the 6390, and returns 0 when
getting -ENODEV as a lane number.
Using mv88e6390x_serdes_power() should be safe here, since we explicitly
rule-out all ports but the 9 and 10, and because modes supported by 6390
ports 9 and 10 are a subset of those supported on 6390X.
This was tested on 6390X using RXAUI mode.
Fixes: 364e9d7776 ("net: dsa: mv88e6xxx: Power on/off SERDES on cmode change")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an entry to the quirks-table to for usb-audio to recognize the
Microbook II (although it only exposes vendor interfaces). A simple boot
quirk is also implemented to set up the sample rate and make sure that
no audio urbs are sent before the device is ready.
This patch only provides audio playback and capture at 96kHz sample
rate. Notice the following shortcomings:
- The sample rate is currently hardcoded to 96k although the device also
supports 48k and 44.1k.
- The various mixer controls of the MicroBook are not made available.
- The keep-iface control should be on by default because the device
shuts down whenever the altsetting is reset which is usually unwanted.
(I don't know the best way to do this)
- The communication format used by the MicroBook for sample rate setting
and also other setup has been reverse engineered by looking at the
usbmon output while running the windows driver through virtualbox. In
this patch the first byte of every message is set to \0 while in the
observed communications the first byte acts as a "message-counter"
increasing its value with every message sent. Leaving it at \0 does
not seem to affect the device.
Signed-off-by: Manuel Reinhardt <manuel.rhdt@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Replace hard coded function name register_blkdev with __func__, to
improve robustness and to conform to the Linux kernel coding
style. Issue found using checkpatch.
Signed-off-by: Keyur Patel <iamkeyur96@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For an AHCI controller with AHCI_HFLAG_MULTI_MSI flag set, we may get the
following log, regardless of whether a custom irq handler was implemented
or not:
[ 14.700238] ahci 0000:74:03.0: both AHCI_HFLAG_MULTI_MSI flag set and custom irq handler implemented
This is because we can set hpriv->irq_handler to
ahci_single_level_irq_intr() if not already set, in
ahci_init_one()->ahci_pci_save_initial_config()->ahci_save_initial_config().
Stop having this warn being misleading by adding a check for
hpriv->irq_handler != ahci_single_level_irq_intr.
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/block/floppy.c: In function 'request_done':
drivers/block/floppy.c:2233:24: warning:
variable 'q' set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
null_handle_bio() erroneously uses the bio_op macro
which masks respective request flag bits including REQ_FUA
out thus failing the check.
Fix by checking bio->bi_opf directly.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'ip' can switch network namespaces internally and then run a given
command relative to that namespace without the need to fork and exec
another ip instance. Update all references of the form:
ip netns exec "$testns" ip ...
to
ip -netns "$testns" ...
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
guard_bio_eod() can truncate a segment in bio to allow it to do IO on
odd last sectors of a device.
It already checks if the IO starts past EOD, but it does not consider
the possibility of an IO request starting within device boundaries can
contain more than one segment past EOD.
In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
underflow bvec->bv_len.
Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
This situation has been found on filesystems such as isofs and vfat,
which doesn't check the device size before mount, if the device is
smaller than the filesystem itself, a readahead on such filesystem,
which spans EOD, can trigger this situation, leading a call to
zero_user() with a wrong size possibly corrupting memory.
I didn't see any crash, or didn't let the system run long enough to
check if memory corruption will be hit somewhere, but adding
instrumentation to guard_bio_end() to check truncated_bytes size, was
enough to see the error.
The following script can trigger the error.
MNT=/mnt
IMG=./DISK.img
DEV=/dev/loop0
mkfs.vfat $IMG
mount $IMG $MNT
cp -R /etc $MNT &> /dev/null
umount $MNT
losetup -D
losetup --find --show --sizelimit 16247280 $IMG
mount $DEV $MNT
find $MNT -type f -exec cat {} + >/dev/null
Kudos to Eric Sandeen for coming up with the reproducer above
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Julian Wiedmann says:
====================
s390/qeth: updates 2019-02-28
please apply one more qeth patch series for net-next. This eliminates
some of the quirks in our reset code, and slims down the internal
state machine.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that qeth always uses dev_close() to shutdown the interface, we can
trust the locking and remove some custom state checks.
qeth_l?_stop_card() is no longer called for a card in UP state, so remove
the checks there too. This basically makes the UP state obsolete, so rip
out the whole thing (except for the sysfs-visible string).
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes no difference whether we
1. manually disarm the HW trap and call the offline code with
recovery_mode == 1, or
2. call the offline code with recovery_mode == 0, and let it disarm the
HW trap for us.
So consolidate the two code paths in the suspend callback.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qeth-wide workqueue is now only used by a single caller to schedule
close_dev work. Just put it on a system queue instead.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recovery code already runs in a kthread, we don't have to defer the
offlining further.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smatch complains that __qeth_l3_set_offline() first accesses card->dev,
and then later checks whether the pointer is valid.
Since commit d3d1b205e8 ("s390/qeth: allocate netdevice early"), the
pointer is _always_ valid - that patch merely missed to remove this one
check.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When resetting an interface ("recovery"), qeth currently attempts to
elide the call to dev_close(). We initially only call .ndo_close to
quiesce the data path, and then offline & online the ccwgroup device.
If the reset succeeded, a call to .ndo_open then resumes the data path
along with some internal setup (dev_addr validation, RX modeset) that
dev_open() would have usually triggered.
dev_close() only gets called (via the close_dev worker) if the reset
action fails.
It's unclear whether this was initially done due to locking concerns, or
rather to execute the reset transparently. Either way, temporarily
closing the interface without dev_close() is fragile, and means we're
susceptible to various races and unexpected behaviour. For instance:
- Bypassing dev_deactivate_many() means that the qdiscs are not set to
__QDISC_STATE_DEACTIVATED. Consequently any intermittent TX completion
can wake up the txq, resulting in calls to .ndo_start_xmit while the
data path is down. We have custom state checking to detect this case
and drop such packets.
- Because the IFF_UP flag doesn't reflect the interface's actual state
during a reset, we have custom state checking in .ndo_open and
.ndo_close to guard against invalid calls.
- Considering that the reset might take a considerable amount of time
(in particular if an IO fails and we end up waiting for its timeout), we
_do_ want NETDEV_GOING_DOWN and NETDEV_DOWN events so that components
like bonding, team, bridge, macvlan, vlan, ... can take appropriate
action.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In its attempt to run only the minimal amount of tear down steps,
qeth_l2_stop_card() fails to reset the "is dev_addr registered?" flag
in some rare scenarios. But a future change to the tear down sequence
would cause us to _always_ hit this issue, so patch it up before that
code lands.
Fix it by unconditionally clearing the flag bit. This also allows us to
remove the additional cleanup step in qeth_dev_layer2_store().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting a L2 qeth device online, enable the HW trap as soon as the
control plane is available. This allows us to catch any error that
occurs during the very first commands.
In the same spirit, the offline code should disable the HW trap as the
very first step of its processing.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offline code uses a specific RECOVER state to indicate that the
interface should be brought up when a qeth device is set online again.
Rather than having a specific card-state for this, just put it in an
internal flag bit and set the state to DOWN. When working with the
card's state transitions, this reduces the complexity quite a bit.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch maintains u64 counters for the number of octets sent and
received. These are kept as two u32's which need to be combined. Fix
the combing, which wrongly worked on u16's.
Fixes: 80c4627b27 ("dsa: mv88x6xxx: Refactor getting a single statistic")
Reported-by: Chris Healy <Chris.Healy@zii.aero>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Occasionally, during the disconnection procedure on XenBus which
includes hash cache deinitialization there might be some packets
still in-flight on other processors. Handling of these packets includes
hashing and hash cache population that finally results in hash cache
data structure corruption.
In order to avoid this we prevent hashing of those packets if there
are no queues initialized. In that case RCU protection of queues guards
the hash cache as well.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without hardware pnetid support there must currently be a pnet
table configured to determine the IB device port to be used for SMC
RDMA traffic. This patch enables a setup without pnet table, if
the used handshake interface belongs already to a RoCE port.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to only iterate in chunks of PAGE_SIZE or less in
bvec_iter_advance, given that the callers pass in the chunk length that
they are operating on - either that already is less than PAGE_SIZE
because they do classic page-based iteration, or it is larger because
the caller operates on multi-page bvecs.
This should help shaving off a few cycles of the I/O hot path.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dongdong reported a deadlock triggered by a hotplug event during a sysfs
"remove" operation:
pciehp 0000:00:0c.0:pcie004: Slot(0-1): Link Up
# echo 1 > 0000:00:0c.0/remove
PME and hotplug share an MSI/MSI-X vector. The sysfs "remove" side is:
remove_store
pci_stop_and_remove_bus_device_locked
pci_lock_rescan_remove
pci_stop_and_remove_bus_device
...
pcie_pme_remove
pcie_pme_suspend
synchronize_irq # wait for hotplug IRQ handler
pci_unlock_rescan_remove
The hotplug side is:
pciehp_ist
pciehp_handle_presence_or_link_change
pciehp_configure_device
pci_lock_rescan_remove # wait for pci_unlock_rescan_remove()
INFO: task bash:10913 blocked for more than 120 seconds.
# ps -ax |grep D
PID TTY STAT TIME COMMAND
10913 ttyAMA0 Ds+ 0:00 -bash
14022 ? D 0:00 [irq/745-pciehp]
# cat /proc/14022/stack
__switch_to+0x94/0xd8
pci_lock_rescan_remove+0x20/0x28
pciehp_configure_device+0x30/0x140
pciehp_handle_presence_or_link_change+0x324/0x458
pciehp_ist+0x1dc/0x1e0
# cat /proc/10913/stack
__switch_to+0x94/0xd8
synchronize_irq+0x8c/0xc0
pcie_pme_suspend+0xa4/0x118
pcie_pme_remove+0x20/0x40
pcie_port_remove_service+0x3c/0x58
...
pcie_port_device_remove+0x2c/0x48
pcie_portdrv_remove+0x68/0x78
pci_device_remove+0x48/0x120
...
pci_stop_bus_device+0x84/0xc0
pci_stop_and_remove_bus_device_locked+0x24/0x40
remove_store+0xa4/0xb8
dev_attr_store+0x44/0x60
sysfs_kf_write+0x58/0x80
It is incorrect to call pcie_pme_suspend() from pcie_pme_remove() for two
reasons.
First, pcie_pme_suspend() calls synchronize_irq(), which will wait for the
native hotplug interrupt handler as well as for the PME one, because they
share one IRQ (as per the spec). That may deadlock if hotplug is signaled
while pcie_pme_remove() is running and the latter calls
pci_lock_rescan_remove() before the former.
Second, if pcie_pme_suspend() figures out that wakeup needs to be enabled
for the port, it will return without disabling the interrupt as expected by
pcie_pme_remove() which was overlooked by commit c7b5a4e6e8 ("PCI / PM:
Fix native PME handling during system suspend/resume").
To fix that, rework pcie_pme_remove() to disable the PME interrupt, clear
its status and prevent the PME worker function from re-enabling it before
calling free_irq() on it, which should be sufficient.
Fixes: c7b5a4e6e8 ("PCI / PM: Fix native PME handling during system suspend/resume")
Link: https://lore.kernel.org/linux-pci/c7697e7c-e1af-13e4-8491-0a3996e6ab5d@huawei.com
Reported-by: Dongdong Liu <liudongdong3@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[bhelgaas: add URL and deadlock details from Dongdong]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Fix buffer overflow observed when running perf test.
The overflow is when trying to evaluate "1ULL << (64 - 1)" which is
resulting in -9223372036854775808 which overflows the 20 character
buffer.
If is possible this bug has been reported before but I still don't see
any fix checked in:
See: https://www.spinics.net/lists/linux-perf-users/msg07714.html
Reported-by: Michael Sartain <mikesart@fastmail.com>
Reported-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Tony Jones <tonyj@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Fixes: f7d82350e5 ("tools/events: Add files to create libtraceevent.a")
Link: http://lkml.kernel.org/r/20190228015532.8941-1-tonyj@suse.de
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement. Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.
Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).
However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.
To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
and then tell the new driver that it can bind to the device:
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.
This rebinding procedure is currently a one-way trip. Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.
The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver. There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly. Two, the kmem driver destroys data on the
device. Think of if you had good data on a pmem device. It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver. kmem would take over the device and
write volatile data all over your good data.
This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.
Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.
There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.
However, it currently only finds memory resources which are
"top-level" iomem_resources. Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):
a0000000-bfffffff : Persistent Memory (legacy)
a0000000-afffffff : System RAM
Changing the true->false here allows children to be searched
as well. We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The mm/resource.c code is used to manage the physical address
space. The current resource configuration can be viewed in
/proc/iomem. An example of this is at the bottom of this
description.
The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory". The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.
The request_resource_conflict() API only deals with the
top-level resources. Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.
We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space. But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.
This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
01000000-01e071d0 : Kernel code
01e071d1-027dfdff : Kernel data
02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there. It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.
When HMM consumes address space, it makes a best guess about
what to consume. However, it is possible that a future memory
or device hotplug can collide with the reserved area. In the
case of these conflicts, there is an error message in
register_memory_resource().
Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.
Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available. This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error. The memory hotplug does the following:
ret = walk_system_ram_range(..., func);
if (ret)
return ret;
and 'ret' makes it out to userspace, eventually. The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1. That leads to a very odd
-EPERM (-1) return code out to userspace.
Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.
This return code is compatible with all the callers that I
audited.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Zero-copy callback flag is not yet set on frag list skb at the moment
xenvif_handle_frag_list() returns -ENOMEM. This eventually results in
leaking grant ref mappings since xenvif_zerocopy_callback() is never
called for these fragments. Those eventually build up and cause Xen
to kill Dom0 as the slots get reused for new mappings:
"d0v0 Attempt to implicitly unmap a granted PTE c010000329fce005"
That behavior is observed under certain workloads where sudden spikes
of page cache writes coexist with active atomic skb allocations from
network traffic. Additionally, rework the logic to deal with frag_list
deallocation in a single place.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As per RFC 8033, it is sufficient for the drop probability
decay factor to have a value of (1 - 1/64) instead of 98%.
This avoids the need to do slow division.
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to Documentation/core-api/printk-formats.rst, size_t should be
printed with %zu, rather than %Zu.
In addition, using %Zu triggers a warning on clang (-Wformat-extra-args):
net/sctp/chunk.c:196:25: warning: data argument not used by format string [-Wformat-extra-args]
__func__, asoc, max_data);
~~~~~~~~~~~~~~~~^~~~~~~~~
./include/linux/printk.h:440:49: note: expanded from macro 'pr_warn_ratelimited'
printk_ratelimited(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
./include/linux/printk.h:424:17: note: expanded from macro 'printk_ratelimited'
printk(fmt, ##__VA_ARGS__); \
~~~ ^
Fixes: 5b5e0928f7 ("lib/vsprintf.c: remove %Z support")
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Matthias Maennich <maennich@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be reproduced by following steps:
1. virtio_net NIC is configured with gso/tso on
2. configure nginx as http server with an index file bigger than 1M bytes
3. use tc netem to produce duplicate packets and delay:
tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
4. continually curl the nginx http server to get index file on client
5. BUG_ON is seen quickly
[10258690.371129] kernel BUG at net/core/skbuff.c:4028!
[10258690.371748] invalid opcode: 0000 [#1] SMP PTI
[10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2
[10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
[10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
[10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
[10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
[10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
[10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
[10258690.372094] Call Trace:
[10258690.372094] <IRQ>
[10258690.372094] skb_to_sgvec+0x11/0x40
[10258690.372094] start_xmit+0x38c/0x520 [virtio_net]
[10258690.372094] dev_hard_start_xmit+0x9b/0x200
[10258690.372094] sch_direct_xmit+0xff/0x260
[10258690.372094] __qdisc_run+0x15e/0x4e0
[10258690.372094] net_tx_action+0x137/0x210
[10258690.372094] __do_softirq+0xd6/0x2a9
[10258690.372094] irq_exit+0xde/0xf0
[10258690.372094] smp_apic_timer_interrupt+0x74/0x140
[10258690.372094] apic_timer_interrupt+0xf/0x20
[10258690.372094] </IRQ>
In __skb_to_sgvec(), the skb->len is not equal to the sum of the skb's
linear data size and nonlinear data size, thus BUG_ON triggered.
Because the skb is cloned and a part of nonlinear data is split off.
Duplicate packet is cloned in netem_enqueue() and may be delayed
some time in qdisc. When qdisc len reached the limit and returns
NET_XMIT_DROP, the skb will be retransmit later in write queue.
the skb will be fragmented by tso_fragment(), the limit size
that depends on cwnd and mss decrease, the skb's nonlinear
data will be split off. The length of the skb cloned by netem
will not be updated. When we use virtio_net NIC and invoke skb_to_sgvec(),
the BUG_ON trigger.
To fix it, netem returns NET_XMIT_SUCCESS to upper stack
when it clones a duplicate packet.
Fixes: 35d889d1 ("sch_netem: fix skb leak in netem_enqueue()")
Signed-off-by: Sheng Lan <lansheng@huawei.com>
Reported-by: Qin Ji <jiqin.ji@huawei.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
i.MX8MQ has clock gate for each GPIO bank, add them
into clock tree for GPIO driver to manage.
Signed-off-by: Anson Huang <Anson.Huang@nxp.com>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
If we are not able to reach firmware, enter debugging mode that will
help us to get adapter logs.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
T6 adapters support outer UDP checksum offload for
encapsulated packets, hence enabling netdev feature flag
NETIF_F_GSO_UDP_TUNNEL_CSUM.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO is done by cxgb4/cxgb4vf. Hence set NETIF_F_GRO flag for
both cxgb4/cxgb4vf.
Cleaned up VLAN netdev features in cxgb4vf. Also fixed
NETIF_F_HIGHDMA being set unconditionally for vlan netdev
features.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Building a preprocessed source file for arm64 now always produces
a warning with clang because of the page_to_virt() macro assigning
a variable to itself.
Adding a new temporary variable avoids this issue.
Fixes: 2813b9c029 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Changed the driver to use the kernel's own implementation.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These changes fixed a checkpatch error for space required before the
open brace '{' as well as a warning for suspect code indent for
conditional statements.
Signed-off-by: Oscar Gomez Fuente <oscargomezf@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When ARCH_MXC get enabled, ARM64_ERRATUM_845719 will be selected and
this warning will happen when COMPAT isn't set.
WARNING: unmet direct dependencies detected for ARM64_ERRATUM_845719
Depends on [n]: COMPAT [=n]
Selected by [y]:
- ARCH_MXC [=y]
Rework to add 'if COMPAT' before ARM64_ERRATUM_845719 gets selected,
since ARM64_ERRATUM_845719 depends on COMPAT.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>