The BCM6358 SoC has only 38 available GPIOs. Fix it.
Signed-off-by: Daniel González Cabanelas <dgcbueu@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
It is possible for a KOBJ_REMOVE uevent to be sent to userspace way
after the files are actually gone from sysfs, due to how reference
counting for kobjects work. This should not be a problem, but it would
be good to properly send the information when things are going away, not
at some later point in time in the future.
Before this move, if a kobject's parent was torn down before the child,
when the call to kobject_uevent() happened, the parent walk to try to
reconstruct the full path of the kobject could be a total mess and cause
crashes. It's not good to try to tear down a kobject tree from top
down, but let's at least try to not to crash if a user does so.
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20200524153041.2361-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In order to be valid to dereference during the i915_fence_release, after
retiring the fence and releasing its refererences, we assume that
rq->engine can only be a real engine (that stay intact until the device
is shutdown after all fences have been flushed). However, due to a quirk
of preempt-to-busy, we may retire a request that still belongs to a
virtual engine and so eventually free it with rq->engine being invalid.
To avoid dereferencing that invalid engine, we look at the
execution_mask which if it indicates it may be executed on more than one
engine, we know it originated on a virtual engine and may still be on
one.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1906
Fixes: 43acd6516c ("drm/i915: Keep a per-engine request pool")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200521140617.30015-2-chris@chris-wilson.co.uk
(cherry picked from commit 32a4605b38)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Since the removal of the no-semaphore boosting, we rely on timeslicing to
reorder passed inter-dependency hogs across the engines. However, we
require preemption to support timeslicing into user payloads, and not all
machine support preemption so we do not universally enable timeslicing,
even when it would correctly preempt our own inter-engine semaphores.
Since timeslicing and semaphore priority deboosting is now disabled on
Broadwell/Braswell, we have to follow suite and not use semaphores.
Testcase: igt/gem_exec_schedule/semaphore-codependency # bdw/bsw
Fixes: 18e4af04d2 ("drm/i915: Drop no-semaphore boosting")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200521140617.30015-1-chris@chris-wilson.co.uk
(cherry picked from commit 0eb670aac2)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
It is not valid to cache/short out selection of the mux.
mux_control_select() only locks the mux until mux_control_deselect()
is called. mux_control_deselect() may put the mux in some low power
state or some other user of the mux might select it for other purposes.
These things are probably not happening in the original setting where
this driver was developed, but it is said to be a generic SPI mux.
Also, the mux framework will short out the actual low level muxing
operation when/if that is possible.
Fixes: e9e40543ad ("spi: Add generic SPI multiplexer")
Signed-off-by: Peter Rosin <peda@axentia.se>
Link: https://lore.kernel.org/r/20200525104352.26807-1-peda@axentia.se
Signed-off-by: Mark Brown <broonie@kernel.org>
Pointers should be casted to unsigned long to avoid "cast from pointer
to integer of different size" warnings.
drivers/iommu/intel-pasid.c:818:6: warning:
cast from pointer to integer of different size [-Wpointer-to-int-cast]
drivers/iommu/intel-pasid.c:821:9: warning:
cast from pointer to integer of different size [-Wpointer-to-int-cast]
drivers/iommu/intel-pasid.c:824:23: warning:
cast from pointer to integer of different size [-Wpointer-to-int-cast]
drivers/iommu/intel-svm.c:343:45: warning:
cast to pointer from integer of different size [-Wint-to-pointer-cast]
Fixes: b0d1f8741b ("iommu/vt-d: Add nested translation helper function")
Fixes: 56722a4398 ("iommu/vt-d: Add bind guest PASID support")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20200519013423.11971-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The .probe_finalize() call-back of some IOMMU drivers calls into
arm_iommu_attach_device(). This function will call back into the
IOMMU core code, where it tries to take group->mutex again, resulting
in a deadlock.
As there is no reason why .probe_finalize() needs to be called under
that mutex, move it after the lock has been released to fix the
deadlock.
Fixes: deac0b3bed ("iommu: Split off default domain allocation from group assignment")
Reported-by: Yong Wu <yong.wu@mediatek.com>
Tested-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Cc: Yong Wu <yong.wu@mediatek.com>
Link: https://lore.kernel.org/r/20200519132824.15163-1-joro@8bytes.org
Commit 17539f2f4f ("usb: musb: fix enumeration after resume") replaced
musb_start() in musb_resume() to not override softconnect bit, but it
doesn't restart the session for host port which was done in musb_start().
The session could be disabled in musb_suspend(), which leads the host
port doesn't stay in host mode.
So let's start the session specifically for host port in musb_resume().
Fixes: 17539f2f4f ("usb: musb: fix enumeration after resume")
Cc: stable@vger.kernel.org
Signed-off-by: Bin Liu <b-liu@ti.com>
Link: https://lore.kernel.org/r/20200525025049.3400-3-b-liu@ti.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a USB device attached to a hub got disconnected, MUSB controller
generates RXCSR_RX_ERROR interrupt for the 3-strikes-out error.
Currently the MUSB host driver returns -EPROTO in current URB, then the
USB device driver could immediately resubmit the URB which causes MUSB
generate RXCSR_RX_ERROR interrupt again. This circle causes interrupt
storm then the hub never got a chance to report the USB device detach.
To fix the interrupt storm, change the URB return code to -ESHUTDOWN for
MUSB_RXCSR_H_ERROR interrupt, so that the USB device driver will not
immediately resubmit the URB.
Signed-off-by: Bin Liu <b-liu@ti.com>
Link: https://lore.kernel.org/r/20200525025049.3400-2-b-liu@ti.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Felipe writes:
USB: changes for v5.8 merge window
Rather busy cycle. We have a total 99 non-merge commits going into v5.8
merge window. The majority of the changes are in dwc3 this around (31.7%
of all changes). It's composed mostly Thinh's recent updates to get dwc3
to behave correctly with stream transfers. We have also have Roger's for
Keystone platforms and Neil's updates for the meson glue layer.
Apart from those, we have the usual set of non-critical fixes, new
device IDs, spelling fixes all over the place.
Signed-off-by: Felipe Balbi <balbi@kernel.org>
* tag 'usb-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb: (99 commits)
usb: dwc3: keystone: Turn on USB3 PHY before controller
dt-bindings: usb: ti,keystone-dwc3.yaml: Add USB3.0 PHY property
dt-bindings: usb: convert keystone-usb.txt to YAML
usb: dwc3: gadget: Check for prepared TRBs
usb: gadget: Fix issue with config_ep_by_speed function
usb: cdns3: ep0: delete the redundant status stage
usb: dwc2: Update Core Reset programming flow.
usb: gadget: fsl: Fix a wrong judgment in fsl_udc_probe()
usb: gadget: fix potential double-free in m66592_probe.
usb: cdns3: Fix runtime PM imbalance on error
usb: gadget: lpc32xx_udc: don't dereference ep pointer before null check
usb: dwc3: Increase timeout for CmdAct cleared by device controller
USB: dummy-hcd: use configurable endpoint naming scheme
usb: cdns3: gadget: assign interrupt number to USB gadget structure
usb: gadget: core: sync interrupt before unbind the udc
arm64: dts: qcom: sc7180: Add interconnect properties for USB
arm64: dts: qcom: sdm845: Add interconnect properties for USB
dt-bindings: usb: qcom,dwc3: Introduce interconnect properties for Qualcomm DWC3 driver
ARM: dts: at91: Remove the USB EP child node
dt-bindings: usb: atmel: Mark EP child node as deprecated
...
Filesystems such as btrfs can perform direct I/O without holding the
inode->i_rwsem in some of the cases like writing within i_size. So,
remove the check for lockdep_assert_held() in iomap_dio_rw().
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This helps filesystems to perform tasks on the bio while submitting for
I/O. This could be post-write operations such as data CRC or data
replication for fs-handled RAID.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export generic_file_buffered_read() to be used to supplement incomplete
direct reads.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit adds support for AWINIC AW2013 3-channel LED driver.
The chip supports 3 PWM channels and is controlled with I2C.
Signed-off-by: Nikita Travkin <nikitos.tr@gmail.com>
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Since acpi_s2idle_wake() knows the category of wakeup causing the
system to resume from suspend-to-idle, make it print a unique message
for each of them to help diagnose wakeup issues.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This driver adds support for Dynamic Platform and Thermal Framework
battery participant device support.
These attributes are presented via sysfs interface under the platform
device for the battery participant:
$ls /sys/bus/platform/devices/INT3532:00/dptf_battery
current_discharge_capbility_ma
max_platform_power_mw
no_load_voltage_mv
high_freq_impedance_mohm
max_steady_state_power_mw
Refer to the documentation at
Documentation/ABI/testing/sysfs-platform-dptf
for details.
Here the implementation reuses existing dptf-power.c as the motivation and
processing is same. It also shares one ACPI method. Here this change is
using participant type, "PTYP" method to identify and do different
processing. By using participant type, create/delete either "dptf_power"
or "dptf_battery" attribute group and send notifications.
The particpant type for for the battery participant is 0x0C.
ACPI methods description:
PMAX (Intel(R) Dynamic Tuning Platform Max Power Supplied by Battery):
This object evaluates to the maximum platform power that can be supported
by the battery in milli watts.
PBSS (Intel(R) Dynamic Tuning Power Battery Steady State):
This object returns the max sustained power for battery in milli watts.
RBHF (Intel(R) Dynamic Tuning High Frequency Impedance):
This object returns high frequency impedance value that can be obtained
from battery fuel gauge.
VBNL (Intel(R) Dynamic Tuning No-Load Voltage)
This object returns battery instantaneous no-load voltage that can be
obtained from battery fuel gauge in milli volts
CMPP (Intel(R) Dynamic Tuning Current Discharge Capability)
This object returns battery discharge current capability obtained from
battery fuel gauge milli amps.
Notifications:
0x80: PMAX change. Used to notify Intel(R)Dynamic Tuning Battery
participant driver when the PMAX has changed by 250mw.
0x83: PBSS change. Used to notify Intel(R) Dynamic Tuning Battery
participant driver when the power source has changed.
0x85: RBHF change. Used to notify Intel(R)Dynamic Tuning Battery
participant driver when the RBHF has changed over a threshold by
5mOhm.
0x86: Battery Capability change. Used to notify Intel(R)Dynamic Tuning
Battery participant driver when the battery capability has changed.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add two additional attributes to the existing power participant driver:
rest_of_platform_power_mw: (RO) Shows the rest of worst case platform
power in mW outside of S0C. This will help in power distribution to SoC
and rest of the system. For example on a test system, this value is 2.5W
with a 15W TDP SoC. Based on the adapter rating (adapter_rating_mw), user
space software can decide on proper power allocation to SoC to improve
short term performance via powercap/RAPL interface.
prochot_confirm: (WO) Confirm EC about a prochot notification.
Also userspace is notified via sysfs_notify(), whenever power source or
rest of the platform power is changed. So user space can use poll()
system call on those attributes.
The ACPI methods used in this patch are as follows:
PROP
This object evaluates to the rest of worst case platform power in mW.
Bits:
23:0 Worst case rest of platform power in mW.
PBOK
PBOK is a method designed to provide a mechanism for OSPM to change power
setting before EC can de-assert a PROCHOT from a device. The EC may
receive several PROCHOTs, so it has a sequence number attached to PSRC
(read via existing attribute "platform_power_source"). Once OSPM takes
action for a PSRC change notification, it can call PBOK method to confirm
with the sequence number.
Bits:
3:0 Power Delivery State Change Sequence number
30 Reserved
31 0 – Not OK to de-assert PROCHOT
1 – OK to de-assert PROCHOT
PSRC (Platform Power Source): Not new in this patch but for
documentation for new bits
This object evaluates to an integer that represents the system power
source as well as the power delivery state change sequence number.
Bits:
3:0 The current power source as an integer for AC, DC, USB, Wireless.
0 = DC, 1 = AC, 2 = USB, 3 = Wireless Charging
7:4 Power Delivery State Change Sequence Number. Default value is 0
Notifications:
0x81: (Power State Change) Used to notify when the power source has
changed.
0x84: (PROP change) Used to notify when the platform rest of power has
changed.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject, minor ABI documentation edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, changing the brightness of the internal display of the Acer
TravelMate 5735Z does not work. Pressing the function keys or changing the
slider, GNOME Shell 3.36.2 displays the OSD (five steps), but the
brightness does not change.
The Acer TravelMate 5735Z shipped with Windows 7 and as such does not
trigger our "win8 ready" heuristic for preferring the native backlight
interface.
Still ACPI backlight control doesn't work on this model, where as the
native (intel_video) backlight interface does work by adding
`acpi_backlight=native` or `acpi_backlight=none` to Linux’ command line.
So, add a quirk to force using native backlight control on this model.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=207835
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Perhaps by some historical reasons the IRQ support has been allowed
only for built-in driver. However, there is nothing prevents us
to build it as module an use as IRQ chip.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
For multiple PLIC instances, the plic_init() is called once for each
PLIC instance. Due to this we have two issues:
1. cpuhp_setup_state() is called multiple times
2. plic_starting_cpu() can crash for boot CPU if cpuhp_setup_state()
is called before boot CPU PLIC handler is available.
Address both issues by only initializing the HP notifiers when
the boot CPU setup is complete.
Fixes: f1ad1133b1 ("irqchip/sifive-plic: Add support for multiple PLICs")
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200518091441.94843-3-anup.patel@wdc.com
Add support for Dell Wireless 5816e Download Mode (AKA boot & hold mode /
QDL download mode) to drivers/usb/serial/qcserial.c
This is required to update device firmware.
Signed-off-by: Matt Jolly <Kangie@footclan.ninja>
Cc: stable@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
For multiple PLIC instances, each PLIC can only target a subset of
CPUs which is represented by "lmask" in the "struct plic_priv".
Currently, the default irq affinity for each PLIC interrupt is all
online CPUs which is illegal value for default irq affinity when we
have multiple PLIC instances. To fix this, we now set "lmask" as the
default irq affinity in for each interrupt in plic_irqdomain_map().
Fixes: f1ad1133b1 ("irqchip/sifive-plic: Add support for multiple PLICs")
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200518091441.94843-2-anup.patel@wdc.com
(E)PPIs are per-CPU interrupts, so we want each CPU to go and enable them
via enable_percpu_irq(); this also means we want IRQ_NOAUTOEN for them as
the autoenable would lead to calling irq_enable() instead of the more
appropriate irq_percpu_enable().
Calling irq_set_percpu_devid() is enough to get just that since it trickles
down to irq_set_percpu_devid_flags(), which gives us IRQ_NOAUTOEN (and a
few others). Setting IRQ_NOAUTOEN *again* right after this call is just
redundant, so don't do it.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200521223500.834-1-valentin.schneider@arm.com
Since commit 1afb648e94 ("btrfs: use standard debug config option to
enable free-space-cache debug prints"), we started to log error messages
that were never logged before since there was no DEBUG macro defined
anywhere. This started to make test case btrfs/187 to fail very often,
as it greps for any btrfs error messages in dmesg/syslog and fails if
any is found:
(...)
btrfs/186 1s ... 2s
btrfs/187 - output mismatch (see .../results//btrfs/187.out.bad)
\--- tests/btrfs/187.out 2019-05-17 12:48:32.537340749 +0100
\+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
\@@ -1,3 +1,8 @@
QA output created by 187
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
+[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
+[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
+[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
+[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
btrfs/188 4s ... 2s
(...)
The space cache write failures happen due to ENOSPC when attempting to
update the free space cache items in the root tree. This happens because
when starting or joining a transaction we don't know how many block
groups we will end up changing (due to extent allocation or release) and
therefore never reserve space for updating free space cache items.
More often than not, the free space cache writeout succeeds since the
metadata space info is not yet full nor very close to being full, but
when it is, the space cache writeout fails with ENOSPC.
Occasional failures to write space caches are not considered critical
since they can be rebuilt when mounting the filesystem or the next
attempt to write a free space cache in the next transaction commit might
succeed, so we used to hide those error messages with a preprocessor
check for the existence of the DEBUG macro that was never enabled
anywhere.
A few other generic test cases also trigger the error messages due to
ENOSPC failure when writing free space caches as well, however they don't
fail since they don't grep dmesg/syslog for any btrfs specific error
messages.
So change the messages from 'error' level to 'debug' level, as it doesn't
make much sense to have error messages triggered only if the debug macro
is enabled plus, more importantly, the error is not serious nor highly
unexpected.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the error messages logged when we fail to write a free space
cache or an inode cache are not very useful as they don't mention what
was the error. So include the error number in the messages.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The label 'fail_unlock' is pointless, all it does is to jump to the label
'out', so just remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are currently treating any non-zero return value from btrfs_next_leaf()
the same way, by going to the code that inserts a new checksum item in the
tree. However if btrfs_next_leaf() returns an error (a value < 0), we
should just stop and return the error, and not behave as if nothing has
happened, since in that case we do not have a way to know if there is a
next leaf or we are currently at the last leaf already.
So fix that by returning the error from btrfs_next_leaf().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122f "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6 "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode lookup starting at btrfs_iget takes the full location key,
while only the objectid is used to match the inode, because the lookup
happens inside the given root thus the inode number is unique.
The entire location key is properly set up in btrfs_init_locked_inode.
Simplify the helpers and pass only inode number, renaming it to 'ino'
instead of 'objectid'. This allows to remove temporary variables key,
saving some stack space.
Signed-off-by: David Sterba <dsterba@suse.com>
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.
Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.
Signed-off-by: David Sterba <dsterba@suse.com>