Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.
This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add a new structure, extent_change_set, to record how many bytes are
changed in one set/clear_extent_bits() operation, with detailed changed
ranges info.
This provides the needed facilities for later qgroup reserve framework.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
reada is using -1 instead of the -ENOMEM defined macro to specify that
a buffer allocation failed. Since the error number is propagated, the
caller will get a -EPERM which is the wrong error condition.
Also, updating the caller to return the exact value from
reada_add_block.
Smatch tool warning:
reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
check-integrity is using -1 instead of the -ENOMEM defined macro to
specify that a buffer allocation failed. Since the error number is
propagated, the caller will get a -EPERM which is the wrong error
condition.
Also, the smatch tool complains with the following warnings:
btrfsic_process_superblock() warn: returning -1 instead of -ENOMEM is sloppy
btrfsic_read_block() warn: returning -1 instead of -ENOMEM is sloppy
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Below variables are defined per compress type.
- struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]
- spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]
- int comp_num_workspace[BTRFS_COMPRESS_TYPES]
- atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]
- wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]
BTW, while accessing one compress type of these variables, the next or
before address is other compress types of it.
So this patch puts these variables in a struct to make cache friendly.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch eliminates the last item of prop_handlers array which is used
to check end of array and instead uses ARRAY_SIZE macro.
Though this is a very tiny optimization, using ARRAY_SIZE macro is a
good practice to iterate array.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just fix a typo in the code comment.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rsv_count ultimately gets passed to start_transaction() which
now takes an unsigned int as its num_items parameter.
The value of rsv_count should always be positive so declare it
as being unsigned.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of num_items that start_transaction() ultimately
always takes is a small one, so a 64 bit integer is overkill.
Also change num_items for btrfs_start_transaction() and
btrfs_start_transaction_lflush() as well.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Improve readability by generalizing the profile validity checks.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit b37392ea86 ("Btrfs: cleanup unnecessary parameter
and variant of prepare_pages()") makes it redundant.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Shan Hai <haishan.bai@hotmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_raid_array[] holds attributes of all raid types.
Use btrfs_raid_array[].devs_min is best way for request
in btrfs_reduce_alloc_profile(), instead of use complex
condition of each raid types.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_raid_array[] is used to define all raid attributes, use it
to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(),
instead of complex condition in function.
It can make code simple and auto-support other possible raid-type in
future.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This array is used to record attributes of each raid type,
make it public, and many functions will benifit with this array.
For example, num_tolerated_disk_barrier_failures(), we can
avoid complex conditions in this function, and get raid attribute
simply by accessing above array.
It can also make code logic simple, reduce duplication code, and
increase maintainability.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rather than have three separate if() statements for the same outcome
we should just OR them together in the same if() statement.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use memset() to null out the btrfs_delayed_ref_root of
btrfs_transaction instead of setting all the members to 0 by hand.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can safely iterate whole list items, without using list_del macro.
So remove the list_del call.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no removing list element while iterating over list.
So, replace list_for_each_entry_safe to list_for_each_entry.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc().
We're just initializing most fields to 0, false and NULL later on
_anyway_, so to make the code mode readable and potentially gain
a bit of performance (completely untested claim), we should fill our
btrfs_trans_handle with zeros on allocation then just initialize
those five remaining fields (not counting the list_heads) as normal.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
old_len is used to store the return value of btrfs_item_size_nr().
The return value of btrfs_item_size_nr() is of type u32.
To improve code correctness and avoid mixing signed and unsigned
integers I've changed old_len to be of type u32 as well.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of
type u32. To avoid mixing signed and unsigned integers we should also
declare dsize and last_off to be of type u32.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.
Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:
1) User A creates a file which consists of an inline and compressed
extent with a size of 2000 bytes - the file is not accessible to
any other users (no read, write or execution permission for anyone
else);
2) The user truncates the file to a size of 1000 bytes;
3) User A makes the file world readable;
4) User B creates a file consisting of an inline extent of 2000 bytes;
5) User B issues a clone operation from user A's file into its own
file (using a length argument of 0, clone the whole range);
6) User B now gets to see the 1000 bytes that user A truncated from
its file before it made its file world readbale. User B also lost
the bytes in the range [1000, 2000[ bytes from its own file, but
that might be ok if his/her intention was reading stale data from
user A that was never supposed to be public.
Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.
This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").
So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.
The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our test files. File foo is going to be the source of a clone operation
# and consists of a single inline extent with an uncompressed size of 512 bytes,
# while file bar consists of a single inline extent with an uncompressed size of
# 256 bytes. For our test's purpose, it's important that file bar has an inline
# extent with a size smaller than foo's inline extent.
$XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \
-c "pwrite -S 0x2a 128 384" \
$SCRATCH_MNT/foo | _filter_xfs_io
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io
# Now durably persist all metadata and data. We do this to make sure that we get
# on disk an inline extent with a size of 512 bytes for file foo.
sync
# Now truncate our file foo to a smaller size. Because it consists of a
# compressed and inline extent, btrfs did not shrink the inline extent to the
# new size (if the extent was not compressed, btrfs would shrink it to 128
# bytes), it only updates the inode's i_size to 128 bytes.
$XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo
# Now clone foo's inline extent into bar.
# This clone operation should fail with errno EOPNOTSUPP because the source
# file consists only of an inline extent and the file's size is smaller than
# the inline extent of the destination (128 bytes < 256 bytes). However the
# clone ioctl was not prepared to deal with a file that has a size smaller
# than the size of its inline extent (something that happens only for compressed
# inline extents), resulting in copying the full inline extent from the source
# file into the destination file.
#
# Note that btrfs' clone operation for inline extents consists of removing the
# inline extent from the destination inode and copy the inline extent from the
# source inode into the destination inode, meaning that if the destination
# inode's inline extent is larger (N bytes) than the source inode's inline
# extent (M bytes), some bytes (N - M bytes) will be lost from the destination
# file. Btrfs could copy the source inline extent's data into the destination's
# inline extent so that we would not lose any data, but that's currently not
# done due to the complexity that would be needed to deal with such cases
# (specially when one or both extents are compressed), returning EOPNOTSUPP, as
# it's normally not a very common case to clone very small files (only case
# where we get inline extents) and copying inline extents does not save any
# space (unlike for normal, non-inlined extents).
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now because the above clone operation used to succeed, and due to foo's inline
# extent not being shinked by the truncate operation, our file bar got the whole
# inline extent copied from foo, making us lose the last 128 bytes from bar
# which got replaced by the bytes in range [128, 256[ from foo before foo was
# truncated - in other words, data loss from bar and being able to read old and
# stale data from foo that should not be possible to read anymore through normal
# filesystem operations. Contrast with the case where we truncate a file from a
# size N to a smaller size M, truncate it back to size N and then read the range
# [M, N[, we should always get the value 0x00 for all the bytes in that range.
# We expected the clone operation to fail with errno EOPNOTSUPP and therefore
# not modify our file's bar data/metadata. So its content should be 256 bytes
# long with all bytes having the value 0xbb.
#
# Without the btrfs bug fix, the clone operation succeeded and resulted in
# leaking truncated data from foo, the bytes that belonged to its range
# [128, 256[, and losing data from bar in that same range. So reading the
# file gave us the following content:
#
# 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
# *
# 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
# *
# 0000400
echo "File bar's content after the clone operation:"
od -t x1 $SCRATCH_MNT/bar
# Also because the foo's inline extent was not shrunk by the truncate
# operation, btrfs' fsck, which is run by the fstests framework everytime a
# test completes, failed reporting the following error:
#
# root 5 inode 257 errors 400, nbytes wrong
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.
Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.
Fix this by leaving the unlock exclusively to the caller.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.
Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M > N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).
So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.
Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.
The following test case for fstests exercises all these cases that result
in file corruption and data loss:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
rm -f $seqres.full
test_cloning_inline_extents()
{
local mkfs_opts=$1
local mount_opts=$2
_scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
_scratch_mount $mount_opts
# File bar, the source for all the following clone operations, consists
# of a single inline extent (50 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
| _filter_xfs_io
# Test cloning into a file with an extent (non-inlined) where the
# destination offset overlaps that extent. It should not be possible to
# clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo data after clone operation:"
# All bytes should have the value 0xaa (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io
# Test cloning the inline extent against a file which has a hole in its
# first 4K followed by a non-inlined extent. It should not be possible
# as well to clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo2 data after clone operation:"
# All bytes should have the value 0x00 (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo2
$XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io
# Test cloning the inline extent against a file which has a size of zero
# but has a prealloc extent. It should not be possible as well to clone
# the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "First 50 bytes of foo3 after clone operation:"
# Should not be able to read any bytes, file has 0 bytes i_size (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo3
$XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io
# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size not greater than the size of
# bar's inline extent (40 < 50).
# It should be possible to do the extent cloning from bar to this file.
$XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4
# Doing IO against any range in the first 4K of the file should work.
echo "File foo4 data after clone operation:"
# Must match file bar's content.
od -t x1 $SCRATCH_MNT/foo4
$XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io
# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size greater than the size of bar's
# inline extent (60 > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5
# Reading the file should not fail.
echo "File foo5 data after clone operation:"
# Must have a size of 60 bytes, with all bytes having a value of 0x03
# (the clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo5
# Test cloning the inline extent against a file which has no extents but
# has a size greater than bar's inline extent (16K > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6
# Reading the file should not fail.
echo "File foo6 data after clone operation:"
# Must have a size of 16K, with all bytes having a value of 0x00 (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo6
# Test cloning the inline extent against a file which has no extents but
# has a size not greater than bar's inline extent (30 < 50).
# It should be possible to clone the inline extent from file bar into
# this file.
$XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7
# Reading the file should not fail.
echo "File foo7 data after clone operation:"
# Must have a size of 50 bytes, with all bytes having a value of 0xbb.
od -t x1 $SCRATCH_MNT/foo7
# Test cloning the inline extent against a file which has a size not
# greater than the size of bar's inline extent (20 < 50) but has
# a prealloc extent that goes beyond the file's size. It should not be
# possible to clone the inline extent from bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" \
-c "pwrite -S 0x88 0 20" \
$SCRATCH_MNT/foo8 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8
echo "File foo8 data after clone operation:"
# Must have a size of 20 bytes, with all bytes having a value of 0x88
# (the clone operation did not modify our file).
od -t x1 $SCRATCH_MNT/foo8
_scratch_unmount
}
echo -e "\nTesting without compression and without the no-holes feature...\n"
test_cloning_inline_extents
echo -e "\nTesting with compression and without the no-holes feature...\n"
test_cloning_inline_extents "" "-o compress"
echo -e "\nTesting without compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" ""
echo -e "\nTesting with compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" "-o compress"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
This fixes a regression introduced by 37b8d27d between v4.1 and v4.2.
When a snapshot is received, its received_uuid is set to the original
uuid of the subvolume. When that snapshot is then resent to a third
filesystem, it's received_uuid is set to the second uuid
instead of the original one. The same was true for the parent_uuid.
This behaviour was partially changed in 37b8d27d, but in that patch
only the parent_uuid was taken from the real original,
not the uuid itself, causing the search for the parent to fail in
the case below.
This happens for example when trying to send a series of linked
snapshots (e.g. created by snapper) from the backup file system back
to the original one.
The following commands reproduce the issue in v4.2.1
(no error in 4.1.6)
# setup three test file systems
for i in 1 2 3; do
truncate -s 50M fs$i
mkfs.btrfs fs$i
mkdir $i
mount fs$i $i
done
echo "content" > 1/testfile
btrfs su snapshot -r 1/ 1/snap1
echo "changed content" > 1/testfile
btrfs su snapshot -r 1/ 1/snap2
# works fine:
btrfs send 1/snap1 | btrfs receive 2/
btrfs send -p 1/snap1 1/snap2 | btrfs receive 2/
# ERROR: could not find parent subvolume
btrfs send 2/snap1 | btrfs receive 3/
btrfs send -p 2/snap1 2/snap2 | btrfs receive 3/
Signed-off-by: Robin Ruede <rruede+git@gmail.com>
Fixes: 37b8d27de5 ("Btrfs: use received_uuid of parent during send")
Cc: stable@vger.kernel.org # v4.2+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Ed Tomlinson <edt@aei.ca>
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull scheduler fix from Thomas Gleixner:
"Fix a long standing state race in finish_task_switch()"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix TASK_DEAD race in finish_task_switch()
Pull perf fix from Thomas Glexiner:
"Fix build breakage on powerpc in perf tools"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix build break on powerpc due to sample_reg_masks
Pull maintainer email update from Thomas Gleixner:
"Change Matt Fleming's email address in the maintainers file"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Change Matt Fleming's email address
Pull irq fixes from Thomas Gleixner:
"Three trivial commits:
- Fix a kerneldoc regression
- Export handle_bad_irq to unbreak a driver in next
- Add an accessor for the of_node field so refactoring in next does
not depend on merge ordering"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Add an accessor for the of_node field
genirq: Fix handle_bad_irq kerneldoc comment
genirq: Export handle_bad_irq
This is a set of three bug fixes, two of which are regressions from recent
updates (the 3ware one from 4.1 and the device handler fixes from 4.2).
Signed-off-by: James Bottomley <JBottomley@Odin.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEbBAABAgAGBQJWGgjcAAoJEDeqqVYsXL0MDjIH+N1poa/RB6jU163q1bMnJlp6
6ygEqLtrJH7FNjadiRuYjsZaSWfuabbEyj0zV8t6S7Z2wWQmXG7Qfkic3W0CZa9u
qCa3aPmRB06Al9nwmEDlgilBfFP5hWkWldiPF7uEBqcCsBtzT3cxxUnyoCS1fy28
USMHSQ4fnQDFdaTafXqDCRrMjXIJLeRY1Gg7YuiG7l6h4YK5qPC+0cCpiIeDyDyI
WjTr/SbFzIyDg0r/SNwjZqbhY2+s4a2/4GcAmjpBMWvg2GnXDGt6vxibRvrwZfHf
PtUvsVS7eJhAOAKs67KOJavP6kvheXkj/QTZWq5Y/DqDrd/14qIgjOPpIHYyig==
=20ji
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of three bug fixes, two of which are regressions from
recent updates (the 3ware one from 4.1 and the device handler fixes
from 4.2)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
3w-9xxx: don't unmap bounce buffered commands
scsi_dh: Use the correct module name when loading device handler
libiscsi: Fix iscsi_check_transport_timeouts possible infinite loop
Very careless bug earler in 4.3-rc, now fixed :-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWGYrdAAoJEDnsnt1WYoG56vcP/03E7DoycDQ7uH46i2szf3M/
isy0dk+2S9D87iBz/xf4RXvAY7FTHy9/vIG/o6UKYkhSRzm6T6xCbrwd+duS2TSc
yQM33BJ1VdM+trGj5ywrdF8guwRMjW4NFPnez16moVSVZDbNK2pUdZiw8kGSi39n
hpjftyefojISG6rbDGBGK2JiVTNOqDjMH2Ny8MhX2J5ryQQOsd6+9ojgri3nfTbP
6PmP08QyVxdYA3ZUlTZaKUNZ8AQHgoydhiEyGbdCewcE8pYaeEUqvcBi4DrDOil8
9BGHnf755Wl3k26P8uBsvri9zp+SZl0LEZLhSpyFpRmCTaFGn0pnSKJ0intnRTPc
JZ7gTY6q1Bt5DXToZw7hHVWgxjos8aweS2JLzSJloB6FFlDCckypkvSR4GQL7R9N
jIYntfwaQaJIgUSzVo/Aw6vjBTWbqyLHf3DP8ImsPSe/z0gjtRiyPkjoZgthuYp2
ErLoVe/JgKstR0gmobbdRhShIfXMFVAIwasXOXfq4Ye4LRwvfAwP2UDHrC25mb0O
IJi6fMqf3bWxmLIzFUcTe8Z2nzuKolAgP2rcd6kb0bbLxE4Y5xtzCV8fgnhk2obw
HvP4zZnacLKx8Nvet+YGUKjVJU3wx4RTgyGLU4WqC13fwZREeJLWwxgK859ZJ8yl
k+TQud5fKgfkX20+eTA0
=qdcM
-----END PGP SIGNATURE-----
Merge tag 'md/4.3-rc4-fix' of git://neil.brown.name/md
Pull md bugfix from Neil Brown:
"One bug fix for raid1/raid10.
Very careless bug earler in 4.3-rc, now fixed :-)"
* tag 'md/4.3-rc4-fix' of git://neil.brown.name/md:
crash in md-raid1 and md-raid10 due to incorrect list manipulation
My Intel email address will soon expire. Replace it with my
personal address so people still know where to send patches.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444494136-10333-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Here are some small USB and PHY fixes and quirk updates for 4.3-rc5.
Nothing major here, full details in the shortlog, and all of these have
been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlYZOFUACgkQMUfUDdst+ymLSACeLNl7IWSxq2acJ5rhUl5+LRxp
KtsAn3lMXJryk4xw2WpfJg30TXpWXnNM
=n9ei
-----END PGP SIGNATURE-----
Merge tag 'usb-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB and PHY fixes and quirk updates for 4.3-rc5.
Nothing major here, full details in the shortlog, and all of these
have been in linux-next for a while"
* tag 'usb-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: Add device quirk for Logitech PTZ cameras
USB: chaoskey read offset bug
USB: Add reset-resume quirk for two Plantronics usb headphones.
usb: renesas_usbhs: Add support for R-Car H3
usb: renesas_usbhs: fix build warning if 64-bit architecture
usb: gadget: bdc: fix memory leak
phy: berlin-sata: Fix module autoload for OF platform driver
phy: rockchip-usb: power down phy when rockchip phy probe
phy: qcom-ufs: fix build error when the component is built as a module
Here are a few bug fixes for the tty core that resolve reported issues,
and some serial driver fixes as well (including the much-reported imx
driver problem.)
All of these have been in linux-next with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlYZOMsACgkQMUfUDdst+ykHVQCdEiZhPNcK5Y6E80pzvhGSSkHa
s+kAoJLT4hqzehZkP1hVImxKmWQpq1d1
=8zsH
-----END PGP SIGNATURE-----
Merge tag 'tty-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a few bug fixes for the tty core that resolve reported
issues, and some serial driver fixes as well (including the
much-reported imx driver problem)
All of these have been in linux-next with no reported problems"
* tag 'tty-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
drivers/tty: require read access for controlling terminal
serial: 8250: add uart_config entry for PORT_RT2880
tty: fix data race on tty_buffer.commit
tty: fix data race in tty_buffer_flush
tty: fix data race in flush_to_ldisc
tty: fix stall caused by missing memory barrier in drivers/tty/n_tty.c
serial: atmel: fix error path of probe function
tty: don't leak cdev in tty_cdev_add()
Revert "serial: imx: remove unbalanced clk_prepare"
Here are two tiny staging tree fixes for 4.3-rc5.
One fixes the broken speakup subsystem as reported by a user, and the
other removes an entry in the MAINTAINERS file for a developer that
doesn't want to be listed anymore.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlYZOUkACgkQMUfUDdst+yk/wACgiCc+ChkU/WzC5eH+yACHOYwy
ouQAoNZgUZvmXQs9+X3VqTo/hAGa0/ru
=5fAh
-----END PGP SIGNATURE-----
Merge tag 'staging-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are two tiny staging tree fixes for 4.3-rc5.
One fixes the broken speakup subsystem as reported by a user, and the
other removes an entry in the MAINTAINERS file for a developer that
doesn't want to be listed anymore"
* tag 'staging-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: speakup: fix speakup-r regression
MAINTAINERS: Remove myself as nvec co-maintainer
Here are some small fixes for some misc drivers that resolve some
reported issues. All of these have been linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlYZOecACgkQMUfUDdst+ykwmwCgpzrjwT0xxZRsDxk+ZeANvmlM
F+MAoJ0e+QZ0AeoADE9PjDgk5LHIUffH
=yAgG
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small fixes for some misc drivers that resolve some
reported issues. All of these have been linux-next for a while"
* tag 'char-misc-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mcb: Fix error handling in mcb_pci_probe()
mei: hbm: fix error in state check logic
nvmem: sunxi: Check for memory allocation failure
nvmem: core: Fix memory leak in nvmem_cell_write
nvmem: core: Handle shift bits in-place if cell->nbits is non-zero
nvmem: core: fix the out-of-range leak in read/write()
Pull MIPS fixes from Ralf Baechle:
- MIPS didn't define the new ioremap_uc. Defined it as an alias for
ioremap_uncached.
- Replace workaround for MIPS16 build issue with a correct one.
* git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Define ioremap_uc
MIPS: UAPI: Ignore __arch_swab{16,32,64} when using MIPS16
Revert "MIPS: UAPI: Fix unrecognized opcode WSBH/DSBH/DSHD when using MIPS16."
Pull swiotlb fixlet from Konrad Rzeszutek Wilk:
"Enable the SWIOTLB under 32-bit PAE kernels.
Nowadays most distros enable this due to CONFIG_HYPERVISOR|XEN=y which
select SWIOTLB. But for those that are not interested in
virtualization and wanting to use 32-bit PAE kernels and wanting to
have working DMA operations - this configures it for them"
* 'stable/for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: Enable it under x86 PAE
Leandro Awa writes:
"After switching to version 4.1.6, our parallelized and distributed
workflows now fail consistently with errors of the form:
T34: ./regex.c:39:22: error: config.h: No such file or directory
From our 'git bisect' testing, the following commit appears to be the
possible cause of the behavior we've been seeing: commit 766c4cbfacd8"
Al Viro says:
"What happens is that 766c4cbfac got the things subtly wrong.
We used to treat d_is_negative() after lookup_fast() as "fall with
ENOENT". That was wrong - checking ->d_flags outside of ->d_seq
protection is unreliable and failing with hard error on what should've
fallen back to non-RCU pathname resolution is a bug.
Unfortunately, we'd pulled the test too far up and ran afoul of
another kind of staleness. The dentry might have been absolutely
stable from the RCU point of view (and we might be on UP, etc), but
stale from the remote fs point of view. If ->d_revalidate() returns
"it's actually stale", dentry gets thrown away and the original code
wouldn't even have looked at its ->d_flags.
What we need is to check ->d_flags where 766c4cbfac does (prior to
->d_seq validation) but only use the result in cases where we do not
discard this dentry outright"
Reported-by: Leandro Awa <lawa@nvidia.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfac ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa <lawa@nvidia.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Removing barriers is scary, but a call to atomic_dec_and_test implies
a barrier, so we don't need to issue another one.
Signed-off-by: David Sterba <dsterba@suse.com>