This will be used when we need to re-hash a directory tree when setting
flags.
It is not possible to have concurrent btree_trans on a thread.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache()
doesn't handle the case where i_count is already 0, we need to grab and
put the inode in order for it to be dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds support for nocow mode, where we do writes in-place when
possible. Patch components:
- New boolean filesystem and inode option, nocow: note that when nocow
is enabled, data checksumming and compression are implicitly disabled
- To prevent in-place writes from racing with data moves
(data_update.c) or bucket reuse (i.e. a bucket being reused and
re-allocated while a nocow write is in flight, we have a new locking
mechanism.
Buckets can be locked for either data update or data move, using a
fixed size hash table of two_state_shared locks. We don't have any
chaining, meaning updates and moves to different buckets that hash to
the same lock will wait unnecessarily - we'll want to watch for this
becoming an issue.
- The allocator path also needs to check for in-place writes in flight
to a given bucket before giving it out: thus we add another counter
to bucket_alloc_state so we can track this.
- Fsync now may need to issue cache flushes to block devices instead of
flushing the journal. We add a device bitmask to bch_inode_info,
ei_devs_need_flush, which tracks devices that need to have flushes
issued - note that this will lead to unnecessary flushes when other
codepaths have already issued flushes, we may want to replace this with
a sequence number.
- New nocow write path: look up extents, and if they're writable write
to them - otherwise fall back to the normal COW write path.
XXX: switch to sequence numbers instead of bitmask for devs needing
journal flush
XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to
run in process context - see if we can improve this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a unique lock used for controlling adding to the pagecache: the
lock has two states, where both states are shared - the lock may be held
multiple times for either state - but not both states at the same time.
This is exactly what we need for nocow mode locking, so this patch pulls
it out of fs.c into its own file.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Quota support was disabled when snapshots were released, because of some
tricky interactions with snpashots. We're sidestepping that for now -
we're simply disabling quota accounting on snapshot subvolumes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new assertion to be used by bch2_inode_update_after_write(),
which updates the VFS inode based on the update to the btree inode we
just did - we require that the btree inode still be locked when we do
that update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that we're recording in each inode the journal sequence number of
the most recent update, fsync becomes a lot simpler and we can delete
all the plumbing for ei_journal_seq.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This fixes building in userspace - code that's coupled to the kernel VFS
interface should live in fs.c
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Snapshot deletion needs to become a multi step process, where we unlink,
then tear down the page cache, then delete the subvolume - the deleting
flag is equivalent to an inode with i_nlink = 0.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is the final patch in the patch series implementing snapshots.
This patch implements two new ioctls that work like creation and
deletion of directories, but fancier.
- BCH_IOCTL_SUBVOLUME_CREATE, for creating new subvolumes and snaphots
- BCH_IOCTL_SUBVOLUME_DESTROY, for deleting subvolumes and snapshots
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
To implement snapshots, we need every filesystem btree operation (every
btree operation without a subvolume) to start by looking up the
subvolume and getting the current snapshot ID, with
bch2_subvolume_get_snapshot() - then, that snapshot ID is used for doing
btree lookups in BTREE_ITER_FILTER_SNAPSHOTS mode.
This patch adds those bch2_subvolume_get_snapshot() calls, and also
switches to passing around a subvol_inum instead of just an inode
number.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Filesystem operations generally operate within a subvolume: at the start
of every btree transaction we'll be looking up (and locking) the
subvolume to get the current snapshot ID, which we then use for our
other btree lookups in BTREE_ITER_FILTER_SNAPSHOTS mode.
But inodes don't record what subvolume they're in - they can't, because
if they did we'd have to update every single inode within a subvolume
when taking a snapshot in order to keep that field up to date. So it
needs to be tracked in memory, based on how we got to that inode.
Hence this patch adds a subvolume field to ei_inode_info, and switches
to iget5() so we can index by it in the inode hash table.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
After the v5.12 rebase, we started oopsing when truncate was passed
ATTR_MODE, due to not passing mnt_userns to setattr_copy(). This
refactors things so that truncate/extend finish by using
bch2_setattr_nonsize(), which solves the problem.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
On write error, the vfs inode's i_size may be inconsistent with the
btree inode's i_size - flag this so we don't have spurious assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the dio write path, when get_user_pages() invokes the fault handler
we have a recursive locking situation - we have to handle the lock
ordering ourselves or we have a deadlock: this patch addresses that by
checking for locking ordering violations and doing the unlock/relock
dance if necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This refactoring makes the code easier to understand by separating the
bcachefs btree transactional code from the linux VFS code - but more
importantly, it's also to share code with the fuse port.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.
Website: https://bcachefs.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>