linux/fs/xfs
Dave Chinner 32baa63d82 xfs: logging the on disk inode LSN can make it go backwards
When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c4 ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c4, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c4 or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c4 ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
..
libxfs xfs: logging the on disk inode LSN can make it go backwards 2021-07-29 09:27:29 -07:00
scrub xfs: detect misaligned rtinherit directory extent size hints 2021-07-15 09:58:42 -07:00
Kconfig xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n 2020-10-16 15:34:28 -07:00
kmem.c xfs: remove kmem_realloc() 2020-09-06 18:05:51 -07:00
kmem.h xfs: Remove kmem_zalloc_large() 2020-09-15 20:52:41 -07:00
Makefile xfs: refactor log recovery item sorting into a generic dispatch structure 2020-05-08 08:49:58 -07:00
mrlock.h
xfs_acl.c xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_acl.h fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
xfs_aops.c fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
xfs_aops.h
xfs_attr_inactive.c xfs: Add delay ready attr remove routines 2021-06-01 10:49:47 -07:00
xfs_attr_list.c xfs: rename and simplify xfs_bmap_one_block 2021-04-15 09:35:50 -07:00
xfs_bio_io.c xfs: async blkdev cache flush 2021-06-21 10:05:51 -07:00
xfs_bmap_item.c treewide: Change list_sort to use const pointers 2021-04-08 16:04:22 -07:00
xfs_bmap_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_bmap_util.c New code for 5.14: 2021-07-02 14:30:27 -07:00
xfs_bmap_util.h
xfs_buf_item_recover.c xfs: fix finobt btree block recovery ordering 2020-09-30 07:28:52 -07:00
xfs_buf_item.c xfs: remove dead stale buf unpin handling code 2021-06-21 10:14:24 -07:00
xfs_buf_item.h xfs: move the buffer retry logic to xfs_buf.c 2020-09-15 20:52:38 -07:00
xfs_buf.c xfs: remove xfs_blkdev_issue_flush 2021-06-21 10:05:46 -07:00
xfs_buf.h xfs: remove ->b_offset handling for page backed buffers 2021-06-07 11:49:50 +10:00
xfs_dir2_readdir.c xfs: remove XFS_IFINLINE 2021-04-15 09:35:51 -07:00
xfs_discard.c xfs: convert allocbt cursors to use perags 2021-06-02 10:48:24 +10:00
xfs_discard.h
xfs_dquot_item_recover.c xfs: rename the ondisk dquot d_flags to d_type 2020-07-28 20:24:14 -07:00
xfs_dquot_item.c xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_dquot_item.h xfs: factor out quotaoff intent AIL removal and memory free 2020-03-18 08:12:23 -07:00
xfs_dquot.c xfs: move the XFS_IFEXTENTS check into xfs_iread_extents 2021-04-15 09:35:50 -07:00
xfs_dquot.h xfs: refactor default quota grace period setting code 2020-09-15 20:52:40 -07:00
xfs_error.c xfs: add error injection for per-AG resv failure 2021-03-25 16:47:53 -07:00
xfs_error.h xfs: xfs_buf_corruption_error should take __this_address 2020-03-12 07:58:12 -07:00
xfs_export.c xfs: Fix fall-through warnings for Clang 2021-05-26 14:51:26 -05:00
xfs_export.h
xfs_extent_busy.c xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extent_busy.h xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extfree_item.c treewide: Change list_sort to use const pointers 2021-04-08 16:04:22 -07:00
xfs_extfree_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_file.c New code for 5.14: 2021-07-02 14:30:27 -07:00
xfs_filestream.c xfs: move xfs_perag_get/put to xfs_ag.[ch] 2021-06-02 10:48:24 +10:00
xfs_filestream.h xfs: move the di_flags field to struct xfs_inode 2021-04-07 14:37:05 -07:00
xfs_fsmap.c xfs: remove agno from btree cursor 2021-06-02 10:48:24 +10:00
xfs_fsmap.h xfs: fix deadlock and streamline xfs_getfsmap performance 2020-10-07 08:40:29 -07:00
xfs_fsops.c xfs: shorten the shutdown messages to a single line 2021-06-21 10:14:13 -07:00
xfs_fsops.h xfs: get rid of xfs_growfs_{data,log}_t 2021-02-03 09:18:50 -08:00
xfs_globals.c xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_health.c xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-08 09:30:20 -07:00
xfs_icache.c xfs: fix type mismatches in the inode reclaim functions 2021-06-21 10:12:46 -07:00
xfs_icache.h xfs: fix type mismatches in the inode reclaim functions 2021-06-21 10:12:46 -07:00
xfs_icreate_item.c xfs: Remove kmem_zone_zalloc() usage 2020-07-28 20:24:14 -07:00
xfs_icreate_item.h
xfs_inode_item_recover.c xfs: logging the on disk inode LSN can make it go backwards 2021-07-29 09:27:29 -07:00
xfs_inode_item.c xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_inode_item.h xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_inode.c xfs: reset child dir '..' entry when unlinking child 2021-07-15 09:58:42 -07:00
xfs_inode.h xfs: get rid of xfs_dir_ialloc() 2021-06-02 10:48:24 +10:00
xfs_ioctl32.c xfs: convert to fileattr 2021-04-12 15:04:29 +02:00
xfs_ioctl32.h xfs: convert to fileattr 2021-04-12 15:04:29 +02:00
xfs_ioctl.c xfs: don't expose misaligned extszinherit hints to userspace 2021-07-15 09:58:42 -07:00
xfs_ioctl.h xfs: convert to fileattr 2021-04-12 15:04:29 +02:00
xfs_iomap.c xfs: Fix fall-through warnings for Clang 2021-05-26 14:51:26 -05:00
xfs_iomap.h
xfs_iops.c xfs: clean up open-coded fs block unit conversions 2021-06-01 12:53:59 -07:00
xfs_iops.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_itable.c xfs: move the di_crtime field to struct xfs_inode 2021-04-07 14:37:05 -07:00
xfs_itable.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_iwalk.c xfs: use perag for ialloc btree cursors 2021-06-02 10:48:24 +10:00
xfs_iwalk.h
xfs_linux.h xfs: async blkdev cache flush 2021-06-21 10:05:51 -07:00
xfs_log_cil.c xfs: fix ordering violation between cache flushes and tail updates 2021-07-29 09:27:28 -07:00
xfs_log_priv.h xfs: fix ordering violation between cache flushes and tail updates 2021-07-29 09:27:28 -07:00
xfs_log_recover.c xfs: force the log offline when log intent item recovery fails 2021-06-21 10:14:24 -07:00
xfs_log.c xfs: avoid unnecessary waits in xfs_log_force_lsn() 2021-07-29 09:27:28 -07:00
xfs_log.h xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_message.c xfs: refactor ratelimited buffer error messages into helper 2020-05-07 08:27:46 -07:00
xfs_message.h once: implement DO_ONCE_LITE for non-fast-path "do once" functionality 2021-06-28 15:54:57 -07:00
xfs_mount.c xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes 2021-06-21 10:14:24 -07:00
xfs_mount.h xfs: move perag structure and setup to libxfs/xfs_ag.[ch] 2021-06-02 10:48:24 +10:00
xfs_mru_cache.c xfs: set WQ_SYSFS on all workqueues in debug mode 2021-02-03 09:18:49 -08:00
xfs_mru_cache.h
xfs_ondisk.h xfs: rename struct xfs_legacy_ictimestamp 2021-04-22 18:29:25 -07:00
xfs_pnfs.c xfs: move the di_size field to struct xfs_inode 2021-04-07 14:37:03 -07:00
xfs_pnfs.h
xfs_pwork.c xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_pwork.h xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_qm_bhv.c xfs: move the di_projid field to struct xfs_inode 2021-04-07 14:37:03 -07:00
xfs_qm_syscalls.c xfs: move the quotaoff dqrele inode walk into xfs_icache.c 2021-06-03 15:56:02 -07:00
xfs_qm.c xfs: get rid of xfs_dir_ialloc() 2021-06-02 10:48:24 +10:00
xfs_qm.h xfs: move the quotaoff dqrele inode walk into xfs_icache.c 2021-06-03 15:56:02 -07:00
xfs_quota.h xfs: remove xfs_qm_vop_chown_reserve 2021-02-03 09:18:49 -08:00
xfs_quotaops.c xfs: move the di_nblocks field to struct xfs_inode 2021-04-07 14:37:03 -07:00
xfs_refcount_item.c treewide: Change list_sort to use const pointers 2021-04-08 16:04:22 -07:00
xfs_refcount_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_reflink.c xfs: convert refcount btree cursor to use perags 2021-06-02 10:48:24 +10:00
xfs_reflink.h xfs: move helpers that lock and unlock two inodes against userspace IO 2020-07-06 10:46:57 -07:00
xfs_rmap_item.c treewide: Change list_sort to use const pointers 2021-04-08 16:04:22 -07:00
xfs_rmap_item.h xfs: refactor intent item RECOVERED flag into the log item 2020-05-08 08:50:01 -07:00
xfs_rtalloc.c xfs: fix an integer overflow error in xfs_growfs_rt 2021-07-15 09:58:42 -07:00
xfs_rtalloc.h xfs: remove xfs_buf_t typedef 2020-12-16 16:07:34 -08:00
xfs_stats.c xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_stats.h xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_super.c xfs: remove xfs_blkdev_issue_flush 2021-06-21 10:05:46 -07:00
xfs_super.h xfs: remove xfs_blkdev_issue_flush 2021-06-21 10:05:46 -07:00
xfs_symlink.c xfs: get rid of xfs_dir_ialloc() 2021-06-02 10:48:24 +10:00
xfs_symlink.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_sysctl.c xfs: restore speculative_cow_prealloc_lifetime sysctl 2021-02-24 10:16:08 -08:00
xfs_sysctl.h xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_sysfs.c
xfs_sysfs.h xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init 2020-08-07 11:50:17 -07:00
xfs_trace.c xfs: move perag structure and setup to libxfs/xfs_ag.[ch] 2021-06-02 10:48:24 +10:00
xfs_trace.h xfs: fix type mismatches in the inode reclaim functions 2021-06-21 10:12:46 -07:00
xfs_trans_ail.c xfs: delete duplicated words + other fixes 2020-08-05 08:49:58 -07:00
xfs_trans_buf.c xfs: Fix fall-through warnings for Clang 2021-05-26 14:51:26 -05:00
xfs_trans_dquot.c xfs: shut down the filesystem if we screw up quota reservation 2021-02-03 09:18:49 -08:00
xfs_trans_priv.h xfs: refactor adding recovered intent items to the log 2020-05-08 08:50:00 -07:00
xfs_trans.c xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_trans.h xfs: xfs_log_force_lsn isn't passed a LSN 2021-06-21 10:12:33 -07:00
xfs_xattr.c xfs: prevent metadata files from being inactivated 2021-03-25 16:47:50 -07:00
xfs.h