When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.
This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.
The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.
The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete. There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened with no transation writers running.
With the btrfs fsync log and other async transaction updates, this
has been racey for some time. This adds a mutex to protect
the device list. The existing volume mutex could not be reused due to
transaction lock ordering requirements.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS. If they are all
non-rotating, SSD mode is enabled by default.
If the FS was mounted with -o nossd, the non-rotating
flag is ignored.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previously, we updated a device's size prior to attempting a shrink
operation. This patch moves the device resizing logic to only happen if
the shrink completes successfully. In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.
Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.
This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes seed device possible to be shared by
multiple mounted file systems. The sharing is achieved
by cloning seed device's btrfs_fs_devices structure.
Thanks you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.
The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.
This patch does the following:
1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.
2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.
3) make btrfs_find_block_group not return NULL when all
block groups are read-only.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The multi-bio code is responsible for duplicating blocks in raid1 and
single spindle duplication. It has counters to make sure all of
the locations for a given extent are properly written before io completion
is returned to the higher layers.
But, it didn't always complete the same bio it was given, sometimes a
clone was completed instead. This lead to problems with the async
work queues because they saved a pointer to the bio in a struct off
bi_private.
The fix is to remember the original bio and only complete that one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system. But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.
This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up. The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.
The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines. Two worker pools are created,
one for writes and one for endio processing. The two could deadlock if
we tried to service both from a single pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Devices can change after the scan ioctls are done, and btrfs_open_devices
needs to be able to verify them as they are opened and used by the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This required a few structural changes to the code that manages bdev pointers:
The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev. This allows us to avoid swapping the super block bdev pointer
around at run time.
The code to read in the super block no longer goes through the extent
buffer interface. Things got ugly keeping the mapping constant.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows other code that needs to walk every device in the FS to do so
without locking against allocations.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Block headers now store the chunk tree uuid
Chunk items records the device uuid for each stripes
Device extent items record better back refs to the chunk tree
Block groups record better back refs to the chunk tree
The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.
This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.
Signed-off-by: Chris Mason <chris.mason@oracle.com>