Kernel logs are very important for the forensic investigations of the
issues in general make it easy to use it. This patch adds 'balance:'
prefix so that it can be easily searched.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* The simple 'flags' refer to the btrfs inode
* ... that's in 'binode
* the FS_*_FL variables are 'fsflags'
* the old copies of the variable are prefixed by 'old_'
* Struct inode flags contain 'i_flags'.
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new
flags and is extensible. Don't get fooled by the XATTR in the name, it
does not have anything in common with the extended attributes,
incidentally also abbreviated as XATTRs.
This patch allows to set the xflags portion of the fsxattr structure,
other items have no meaning and non-zero values will result in
EOPNOTSUPP.
Currently supported xflags:
- APPEND
- IMMUTABLE
- NOATIME
- NODUMP
- SYNC
The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but
is simpler on the flag setting side.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new
flags and is extensible. This patch allows to return the xflags portion
of the fsxattr structure, other items have no meaning for btrfs or can
be added later.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent. Several cleanups were necessary
to avoid confusion with other ioctls, as we have another flavor of
flags.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a prefix but we still
need to recognize them so the 'fsflags' should be closer to the naming
scheme but again the 'fs' part sounds like it's a filesystem flag. I
don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a variable name prefix
but we still need to recognize them so the 'fsflags' should be closer to
the naming scheme but again the 'fs' part sounds like it's a filesystem
flag. I don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
Use a local btrfs_fs_devices variable to access the structure.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delete the uuid_mutex lock here as this thread accesses the
btrfs_fs_devices::devices only (counters or called functions do a list
traversal). And the device_list_mutex lock is already taken.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_replace_finishing updates devices (soruce and target) which
are within the btrfs_fs_devices::devices or withint the cloned seed
devices (btrfs_fs_devices::seed::devices), so we don't need the global
uuid_mutex.
The device replace context is also locked by its own locks.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices
is just limited to openning all the devices under for given fsid, so we
don't need uuid_mutex.
Instead it should hold the device_list_mutex as it updates the members
of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
read_chunk_tree() calls read_one_dev(), but for seed device we have
to search the fs_uuids list, so we need the uuid_mutex. Add a comment
comment, so that we can improve this part.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of de-referencing the device->fs_devices use cur_devices
which points to the same fs_devices and does not change.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The generic block device lookup or cleanup does not need the uuid mutex,
that's only for the device_list_add.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the code assumes that there's an implied barrier by the
sequence of code preceding the wakeup, namely the mutex unlock.
As Nikolay pointed out:
I think this is wrong (not your code) but the original assumption that
the RELEASE semantics provided by mutex_unlock is sufficient.
According to memory-barriers.txt:
Section 'LOCK ACQUISITION FUNCTIONS' states:
(2) RELEASE operation implication:
Memory operations issued before the RELEASE will be completed before the
RELEASE operation has completed.
Memory operations issued after the RELEASE *may* be completed before the
RELEASE operation has completed.
(I've bolded the may portion)
The example given there:
As an example, consider the following:
*A = a;
*B = b;
ACQUIRE
*C = c;
*D = d;
RELEASE
*E = e;
*F = f;
The following sequence of events is acceptable:
ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
So if we assume that *C is modifying the flag which the waitqueue is checking,
and *E is the actual wakeup, then those accesses can be re-ordered...
IMHO this code should be considered broken...
---
To be on the safe side, add the barriers. The synchronization logic
around log using the mutexes and several other threads does not make it
easy to reason for/against the barrier.
CC: Nikolay Borisov <nborisov@suse.com>
Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.com
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Under the following case, qgroup rescan can double account cowed tree
blocks:
In this case, extent tree only has one tree block.
-
| transid=5 last committed=4
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 4).
| Scan it, set qgroup_rescan_progress to the last
| EXTENT/META_ITEM + 1
| now qgroup_rescan_progress = A + 1.
|
| fs tree get CoWed, new tree block is at A + 16K
| transid 5 get committed
-
| transid=6 last committed=5
| btrfs_qgroup_rescan_worker()
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 5).
| scan it using qgroup_rescan_progress (A + 1).
| found new tree block beyong A, and it's fs tree block,
| account it to increase qgroup numbers.
-
In above case, tree block A, and tree block A + 16K get accounted twice,
while qgroup rescan should stop when it already reach the last leaf,
other than continue using its qgroup_rescan_progress.
Such case could happen by just looping btrfs/017 and with some
possibility it can hit such double qgroup accounting problem.
Fix it by checking the path to determine if we should finish qgroup
rescan, other than relying on next loop to exit.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing qgroup rescan using the following script (modified from
btrfs/017 test case), we can sometimes hit qgroup corruption.
------
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f -n 64k $dev
mount $dev $mnt
extent_size=8192
xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
btrfs subvolume snapshot $mnt $mnt/snap
xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
btrfs quota enable $mnt
# -W is the new option to only wait rescan while not starting new one
btrfs quota rescan -W $mnt
btrfs qgroup show -prce $mnt
umount $mnt
# Need to patch btrfs-progs to report qgroup mismatch as error
btrfs check $dev || _fail
------
For fast machine, we can hit some corruption which missed accounting
tree blocks:
------
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 8.00KiB 0.00B none none --- ---
0/257 8.00KiB 0.00B none none --- ---
------
This is due to the fact that we're always searching commit root for
btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
from current transaction, not commit root.
And if our tree blocks get modified in current transaction, we won't
find any owner in commit root, thus causing the corruption.
Fix it by searching commit root for extent tree for
qgroup_rescan_leaf().
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]
The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete(). Which,
incidentally, is all btrfs_dentry_delete() does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_check_super_valid:
1) Rename it to btrfs_validate_mount_super()
Now it's more obvious when the function should be called.
2) Extract core check routine into validate_super()
Later write time check can reuse it, and if needed, we could also
use validate_super() to check each super block.
3) Add more comments about btrfs_validate_mount_super()
Mostly about what it doesn't check and when it should be called.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to validate_super ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_check_super_valid() before its single caller to avoid forward
declaration.
Though such code motion is not recommended as it pollutes git history,
in this case the following patches would need to add new forward
declarations for static functions that we want to avoid.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already contains a
reference to the fs_info. So use it and remove the extra function
argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function alreay takes a transaction handle which holds a reference
to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which holds a reference to
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already has a reference
to the fs_info. Use it and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which references the
fs_info structure. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction which has a reference to the
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to fs_info. So use that and kill the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a trans handle which contains a reference to
the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes trans handle from where fs_info can be
referenced. Remove the redundant parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We also pass in a transaction handle which has a reference to the
fs_info. Just remove the extraneous argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will be necessary for future cleanups which remove the fs_info
argument from some freespace tree functions.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when checking if a directory can be deleted, we always check
if all its children have been processed.
Example: A directory with 2,000,000 files was deleted
original: 1994m57.071s
patch: 1m38.554s
[FIX]
Instead of checking all children on all calls to can_rmdir(), we keep
track of the directory index offset of the child last checked in the
last call to can_rmdir(), and then use it as the starting point for
future calls to can_rmdir().
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move the allocation after the search when it's clear that the new entry
will be added.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:
qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved
is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_data_ref. Do so in the
following manner:
1. The common init function is put immediately after memory-to-be-initialized
is allocated, followed by the specific data ref initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is moved outside of the critical
section.
No functional changes, just an overall shorter critical section.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_tree_ref. Do so in the
following manner:
1. The comming init code is put immediately after memory-to-be-initialized
is allocated, followed by the ref-specific member initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is put outside of the critical
section as well.
The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the code
is unchanged.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced helper and remove the duplicate code. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced common helper. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
the common bits in init_delayed_ref_common. This function is going to be
used in future patches to clean that up. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to overwrite -ENOMEM using -EINVAL when failing from mount
option parsing, so just return original error code.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is always available from the context so we don't need to
store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging quota rescan race, some times btrfs rescan could account
some old (committed) leaf and then re-account newly committed leaf
in next generation.
This race needs extra transid to locate, so add @transid for
trace_btrfs_qgroup_account_extent() for such debug.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Trivial fix to spelling mistake of function name in btrfs_err message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Origin trace_qgroup_update_counters() only records qgroup id and its
reference count change.
It's good enough to debug qgroup accounting change, but when rescan race
is involved, it's pretty hard to distinguish which modification belongs
to which rescan.
So add old_rfer and old_excl trace output to help distinguishing
different rescan instance.
(Different rescan instance should reset its qgroup->rfer to 0)
For trace event parameter, it just changes from u64 qgroup_id to struct
btrfs_qgroup *qgroup, so number of parameters is not changed at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 so remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to
READA_FORWARD. But I think READA_BACK is correct.
Since:
1. key.offset is set to (u64)-1
2. after btrfs_search_slot, btrfs_previous_item is called
So, for readahead previous items, READA_BACK is the correct one.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add the following trace events:
1) btrfs_remove_block_group
For btrfs_remove_block_group() function.
Triggered when a block group is really removed.
2) btrfs_add_unused_block_group
Triggered which block group is added to unused_bgs list.
3) btrfs_skip_unused_block_group
Triggered which unused block group is not deleted.
These trace events is pretty handy to debug case related to block group
auto remove.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be extracted from btrfs_block_group_cache, and all
btrfs_block_group_cache is created by btrfs_create_block_group_cache()
with fs_info initialized, no need to worry about NULL pointer
dereference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the commit c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data"), parameter fs_info in alloc_reloc_control is
not used. So remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::mindev_error so that
btrfs_raid_array can maintain the error code to return if the minimum
number of devices condition is not met while trying to delete a device
in the given raid. And so we can drop btrfs_raid_mindev_error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::bg_flag so that
btrfs_raid_array can maintain the bit map flag of the raid type, and
so we can drop btrfs_raid_group.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::raid_name so that
btrfs_raid_array can maintain the name of the raid type, and so we can
drop btrfs_raid_type_names.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pretty handy if we can get the debug output for locking status of
an extent buffer, specially for race condition related debugging.
So add the following output for btrfs_print_tree() and
btrfs_print_leaf():
- refs
- write_locks (as w:%d)
- read_locks (as r:%d)
- blocking_writers (as bw:%d)
- blocking_readers (as br:%d)
- spinning_writers (as sw:%d)
- spinning_readers (as sr:%d)
- lock_owner
- current->pid
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is quite simple and I'd like to see the locking in the
caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the spinlock does not cause problems, using the mutex is more
correct and consistent with others. The global status of balance is eg.
checked from btrfs_pause_balance or btrfs_cancel_balance with mutex.
Resuming balance happens during mount or ro->rw remount. In the former
case, no other user of the balance_ctl exists, in the latter, balance
cannot run until the ro/rw transition is finished.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter controls locking of the stats part but we can lock it
unconditionally, as this only happens once when balance starts. This is
not performance critical.
Add the prefix for an exported function.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Balance cannot be started on a read-only filesystem and will have to
finish/exit before eg. going to read-only via remount.
In case the filesystem is forcibly set to read-only after an error,
balance will finish anyway and if the cancel call is too fast it will
just wait for that to happen.
The last case is when the balance is paused after mount but it's
read-only and cancelling would want to delete the item. The test is
moved after the check if balance is running at all, as it looks more
logical to report "no balance running" instead of "read-only
filesystem".
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently fs_info::balance_running is 0 or 1 and does not use the
semantics of atomics. The pause and cancel check for 0, that can happen
only after __btrfs_balance exits for whatever reason.
Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
times but will block on the balance_mutex that protects the
fs_info::flags bit.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00addc7 ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.
The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.
But we don't really need to track the same information in two ways.
1) After the previous cleanups, the main ioctl handlers for
add/del/resize copy the EXCL_OP bit next to the volume mutex, here
it's clearly safe.
2) Resuming balance during mount or after rw remount will set only the
EXCL_OP bit and the volume_mutex is held in the kernel thread that
calls btrfs_balance.
3) Resuming device replace during mount or after rw remount is done
after balance and is excluded by the EXCL_OP bit. It does not take
the volume_mutex at all and completely relies on the EXCL_OP bit.
4) The resuming of balance and dev-replace cannot hapen at the same time
as the ioctls cannot be started in parallel. Nevertheless, a crafted
image could trigger that and a warning is printed.
5) Balance is normally excluded by EXCL_OP and also uses own mutex to
protect against concurrent access to its status data. There's some
trickery to maintain the right lock nesting in case we need to
reexamine the status in btrfs_ioctl_balance. The volume_mutex is
removed and the unlock/lock sequence is left in place as we might
expect other waiters to proceed.
6) Similar to 5, the unlock/lock sequence is kept in
btrfs_cancel_balance to allow waiters to continue.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The volume mutex does not protect against anything in this case, the
comment about scrub is right but not related to locking and looks
confusing. The comment in btrfs_find_device_missing_or_by_path is wrong
and confusing too.
The device_list_mutex is not held here to protect device lookup, but in
this case device replace cannot run in parallel with device removal (due
to exclusive op protection), so we don't need further locking here.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function __cancel_balance name is confusing with the cancel
operation of balance and it really resets the state of balance back to
zero. The unset_balance_control helper is called only from one place and
simple enough to be inlined.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace a WARN_ON with a proper check and message in case something goes
really wrong and resumed balance cannot set up its exclusive status.
The check is a user friendly assertion, I don't expect to ever happen
under normal circumstances.
Also document that the paused balance starts here and owns the exclusive
op status.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device replace is paused by unmount or read only remount, and
resumed on next mount or write remount.
The exclusive status should be checked properly as it's a global
invariant and we must not allow 2 operations run. In this case, the
balance can be also paused and resumed under same conditions. It's
always checked first so dev-replace could see the EXCL_OP already taken,
BUT, the ioctl would never let start both at the same time.
Replace the WARN_ON with message and return 0, indicating no error as
this is purely theoretical and the user will be informed. Resolving that
manually should be possible by waiting for the other operation to finish
or cancel the paused state.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation
so it's obvious that the two happen at the same time.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function logically belongs there and there's only a single caller,
no need to export it. No code changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function will be used outside of volumes.c, the allocation
btrfs_alloc_device is also exported.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparatory cleanup that will make clear that the only
successful way out of btrfs_init_dev_replace_tgtdev will also set the
device_out to a valid pointer. With this guarantee, the callers can be
simplified.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is called once and is fairly small, we can merge it with
the caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function uses fs_info::fs_devices number of time, however we
declare and use it only at the end, instead do it in the beginning of
the function and use it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_device() declares struct list_head *head pointer and used only once,
instead just use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_open_devices() is un-exported drop __ prefix and rename it to
open_fs_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_close_devices() is un-exported, drop the __ prefix and rename it
to close_fs_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_open_devices() declares struct list_head *head, however head is
used only once, instead use btrfs_fs_devices::devices directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic
name 'list' makes it's search very difficult, rename it to fs_list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be directly referenced from the passed address_space so do that.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
list_empty_careful usually is a signal of something tricky going on. Its
usage in btrfs is actually not needed since both lists it's used on are
local to a function and cannot be modified concurrently. So switch to
plain list_empty. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used in the function so just remove it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behavior of rmdir(2) and allow it to delete an empty
subvolume by using btrfs_delete_subvolume() which is used by
btrfs_ioctl_snap_destroy().
This is a change in behaviour and has been requested by users. Deleting
the subvolume by ioctl requires root permissions while the rmdir way
does works with standard tools and syscalls for all users that can
access the subvolume.
The main usecase is to allow 'rm -rf /path/with/subvols' to simply work.
We were not able to find any nasty usability surprises, the intention is
to do the destructive rm. Without allowing rmdir, this would have to be
followed by the ioctl subvolume deletion, which is more of an annoyance.
Implementation details:
The required lock for @dir and inode of @dentry is already acquired in
vfs layer.
We need some check before deleting a subvolume. Permission check is done
in vfs layer, emptiness check is in btrfs_rmdir() and additional check
(i.e. neither the subvolume is a default subvolume nor send is in progress)
is in btrfs_delete_subvolume().
Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after
btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs
layer later.
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:
1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes
and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.
Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.
As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap
conversion"), there are no more callers to le_test_bit(). This patch
removes le_test_bit().
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Presently, convert_free_space_to_extents() does a linear scan of the
bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
Testing shows a 20-33% decrease in execution time for
convert_free_space_to_extents().
Since we change bitmap to be unsigned long, we have to do some casting
for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as
we are doing bit operations. Everywhere else, we're just using it for
pointer arithmetic and not directly accessing it, so char seems more
appropriate.
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can
get a better tracepoint message. Extent maps do not need fs_info for
anything so we only add a dummy one without any other initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The second if is really a subcase of ret being less than 0. So
introduce a generic if (ret < 0) check, and inside have another if
which explicitly handles the -ENOSPC and any other errors. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Locks should generally be released in the oppposite order they are
acquired. Generally lock acquisiton ordering is used to ensure
deadlocks don't happen. However, as becomes more complicated it's
best to also maintain proper unlock order so as to avoid possible dead
locks. This was found by code inspection and doesn't necessarily lead
to a deadlock scenario.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently __endio_write_update_ordered uses labels to implement
what is essentially a simple while loop. This makes the code more
cumbersome to follow than it actually has to be. No functional
changes. No xfstest regressions were found during testing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adds comments about BTRFS_FS_EXCL_OP to existing comments
about the device locks.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cce8 ("Btrfs: make filesystem read-only when submitting
barrier fails") which used it in disk-io.c.
However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.
So, make it static and rename it to be original name
get_block_group_info.
No functional change.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_pinned_bytes really cares whether the bytes being pinned are either
data or metadata. To that effect it checks whether the 'owner' argument
is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because
owner can really have 2 types of values:
a) For metadata extents it holds the level at which the parent is in
the btree. This amounts to owner having the values 0-7
b) In case of modifying data extents, owner is the inode number
to which those extents belongs.
Let's make this more explicit byt converting the owner parameter to a
boolean value and either pass it directly when we know the type of
extents we are working with (i.e. in btrfs_free_tree_block). In cases
when the parent function can be called on both metadata/data extents
perform the check in the caller. This hopefully makes the interface
of add_pinned_bytes more intuitive.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When specifying option 'prefix' multiple times, current option parsing
will cause memory leak. Hence, call kfree for previous one in this
case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had. The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.
With introduction of dcache the checks had been replaced with
checking the refcount of dentry. However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2). That had been fixed by
doing shrink_dcache_parent() just before the refcount check.
At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.
Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount. That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1. That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.
Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories. Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.
In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them. That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative. There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached. So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call. Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.
We should've done that call right *after* successful ->rmdir(). That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel. The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
what we want it for is actually updating inode metadata;
take _that_ into a separate helper and use it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we can acquire ctx_lock without spinning we can just remove our
iocb from the active_reqs list, and thus complete the iocbs from the
wakeup context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.
Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
With the current aio code there is no need for the magic KIOCB_CANCELLED
value, as a cancelation just kicks the driver to queue the completion
ASAP, with all actual completion handling done in another thread. Given
that both the completion path and cancelation take the context lock there
is no need for magic cmpxchg loops either. If we remove iocbs from the
active list after calling ->ki_cancel (but with ctx_lock still held), we
can also rely on the invariant thay anything found on the list has a
->ki_cancel callback and can be cancelled, further simplifing the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
as we can do that right after retrieving it from userspace. Also move the
KIOCB_KEY definition to aio.c as it is an internal value not used by any
other place in the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
->get_poll_head returns the waitqueue that the poll operation is going
to sleep on. Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events. But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Use straightline code with failure handling gotos instead of a lot
of nested conditionals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
No users outside of select.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
numbers (commonly 0) are misaligned.
Remove seq_put_decimal_ull_width()'s leftover optimization for single
digits: it's wrong now that num_to_str() takes care of the width.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
Fixes: d1be35cb6f ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two functions now trigger a warning when CONFIG_PROC_FS is disabled:
fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function]
static int xqmstat_proc_show(struct seq_file *m, void *v)
^~~~~~~~~~~~~~~~~
fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function]
static int xqm_proc_show(struct seq_file *m, void *v)
^~~~~~~~~~~~~
Previously, they were referenced from an unused 'static const' structure,
which is silently dropped by gcc.
We can address the warning by adding the same #ifdef around them that
hides the reference.
Fixes: 3f3942aca6 ("proc: introduce proc_create_single{,_data}")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Remove bouncing addresses from the MAINTAINERS file
- Kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers
- Various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers
- Two fixes for patches already sent during the merge window
- A long standing bug related to not decreasing the pinned pages count in the right
MM was found and fixed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJbByPQAAoJEDht9xV+IJsa164P/AihB/vbn9MBdK3pe1OSUGTm
tKZJ/Y6nY/Q/XTJSeM2wNECk8fOrZbKuLBz2XlPRsB2djp4ugC5WWfK9YbwWMGXG
I5B/lB8VTorQr8E5i9lqqMDQc8aF8VcGJtdqVE3nD4JsVTrQSGiSnw45/BARDUm3
OycJJMDOWhDj2wnNSa+JfjPemIMDM1jse7DnsJfDsGfTMS/G+6nyzjKIlEnnFZ8/
PBxhq0q7C5viNDwwn2GsAVUrATTlW48SY0WYhkgMdSl20d2th9wMZqNMqtniz8NP
lg87SrhzsAPOTlbSWlYYkAnzE7nEhfJyIfYUp2piNJeYuOohYPtO6w99Tqjl/GmU
uLIYIXtZCxAK1Zb/znc49HkRVL5YFDsQGXdtYy7tvRZPwwR32kowUtpKIWaZFz8O
BA/x+Zgqu9AlwqSWwQwxmMbUX42RRwhNJDVyTYlXQSSzhfgFaLIZARqb4K6HxeNN
vZN0BK+x6pX6FI7hpdsqNRtH1oo4SNUBxiuUsrZ7cy7GqYNdUJ6piygDgmERaJxU
svIUJof/+OoU1QyErQ0JgUEK/3jOHbjxSPb/rjQeqxAnCqhaGOuNGMtdfsGqgvBU
x/u3eDcbfi/LBErXR46gYtxnOQ8I2BB+m8erUc/GVvCzWrX+R7ELZYpBrP5Pcu/6
mr2D7hDqgZHbeU8aB8+D
=uFZh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is pretty much just the usual array of smallish driver bugs.
- remove bouncing addresses from the MAINTAINERS file
- kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and
hns drivers
- various small LOC behavioral/operational bugs in mlx5, hns, qedr
and i40iw drivers
- two fixes for patches already sent during the merge window
- a long-standing bug related to not decreasing the pinned pages
count in the right MM was found and fixed"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
RDMA/hns: Move the location for initializing tmp_len
RDMA/hns: Bugfix for cq record db for kernel
IB/uverbs: Fix uverbs_attr_get_obj
RDMA/qedr: Fix doorbell bar mapping for dpi > 1
IB/umem: Use the correct mm during ib_umem_release
iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()'
RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint
RDMA/i40iw: Avoid reference leaks when processing the AEQ
RDMA/i40iw: Avoid panic when objects are being created and destroyed
RDMA/hns: Fix the bug with NULL pointer
RDMA/hns: Set NULL for __internal_mr
RDMA/hns: Enable inner_pa_vld filed of mpt
RDMA/hns: Set desc_dma_addr for zero when free cmq desc
RDMA/hns: Fix the bug with rq sge
RDMA/hns: Not support qp transition from reset to reset for hip06
RDMA/hns: Add return operation when configured global param fail
RDMA/hns: Update convert function of endian format
RDMA/hns: Load the RoCE dirver automatically
RDMA/hns: Bugfix for rq record db for kernel
RDMA/hns: Add rq inline flags judgement
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt
WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR
LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb
7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm
RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x
+BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ
24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8
KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM
1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe
1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG
Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh
NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE=
=iYyx
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A one-liner that prevents leaking an internal error value 1 out of the
ftruncate syscall.
This has been observed in practice. The steps to reproduce make a
common pattern (open/write/fync/ftruncate) but also need the
application to not check only for negative values and happens only for
compressed inlined files.
The conditions are narrow but as this could break userspace I think
it's better to merge it now and not wait for the merge window"
* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix error handling in btrfs_truncate()
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).
To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:
int main(void) {
char buf[256] = { 0 };
int ret;
int fd;
fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);
return EXIT_SUCCESS;
}
Fixes: ddfae63cc8 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().
Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).
Cc: stable@kernel.org
Fixes: 0460fef2a9 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and take the "check if file is open, pick ->f_mode" into a helper;
tid_fd_revalidate() can use it.
The next patch will get rid of tid_fd_revalidate() calls in instantiate
callbacks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, calling pid_revalidate() in the end of <pid>/* lookups
is *not* about closing any kind of races; that used to be true once
upon a time, but these days those comments are actively misleading.
Especially since pid_revalidate() doesn't even do d_drop() on
failure anymore. It doesn't matter, anyway, since once
pid_revalidate() starts returning false, ->d_delete() of those
dentries starts saying "don't keep"; they won't get stuck in
dcache any longer than they are pinned.
These calls cannot be just removed, though - the side effect of
pid_revalidate() (updating i_uid/i_gid/etc.) is what we are calling
it for here.
Let's separate the "update ownership" into a new helper (pid_update_inode())
and use it, both in lookups and in pid_revalidate() itself.
The comments in pid_revalidate() are also out of date - they refer to
the time when pid_revalidate() used to call d_drop() directly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That's one case when unlink() destroys a subtree, thanks to "resource
fork" idiocy. We might forcibly evict that shit on unlink(2), but
for now let's just disallow overmounting; as it is, anything that
plays games with those would leak mounts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>