bianbu-linux-6.6/fs/btrfs
Josef Bacik 907717eea1 btrfs: drop the backref cache during relocation if we commit
[ Upstream commit db7e68b522c01eb666cfe1f31637775f18997811 ]

Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.

This works as long as we're only ever modifying blocks, not changing the
structure of the tree.

However relocation does in fact change the structure of the tree.  For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent.  We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.

But, a key feature of do_relocation() is the pending list.  This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.

The purpose of this is to make sure we don't break sharing unless we
absolutely have to.  Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path.  If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally.  To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.

Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.

The problem with this is that we will never reflect this in the backref
cache.  If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree.  This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.

In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.

Worst case you will double free some random block and re-use it when
there's still references to the block.

This is extremely subtle, and the consequences are quite bad.  There
isn't a way to make sure our backref cache is consistent between
transid's.

In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's.  This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.

This has existed since relocation was added, and is a pretty critical
bug.  There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.

Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.

We have a reproducer that reproduced the corruption within a few minutes
of running.  With this patch it survives several iterations/hours of
running the reproducer.

Fixes: 3fd0a5585e ("Btrfs: Metadata ENOSPC handling for balance")
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-10 11:58:06 +02:00
..
tests btrfs: tests: allocate dummy fs_info and root in test_find_delalloc() 2024-08-29 17:33:37 +02:00
accessors.c
accessors.h btrfs: use helper sizeof_field in struct accessors 2023-08-21 14:52:13 +02:00
acl.c
acl.h
async-thread.c btrfs: use alloc_ordered_workqueue() to create ordered workqueues 2023-06-19 13:59:30 +02:00
async-thread.h btrfs: use alloc_ordered_workqueue() to create ordered workqueues 2023-06-19 13:59:30 +02:00
backref.c btrfs: drop the backref cache during relocation if we commit 2024-10-10 11:58:06 +02:00
backref.h btrfs: fix unwritten extent buffer after snapshotting a new subvolume 2023-10-23 17:17:30 +02:00
bio.c btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() 2024-09-04 13:28:18 +02:00
bio.h btrfs: add an ordered_extent pointer to struct btrfs_bio 2023-06-19 13:59:36 +02:00
block-group.c btrfs: zoned: fix zone_unusable accounting on making block group read-write again 2024-08-11 12:47:24 +02:00
block-group.h btrfs: add and use helper to check if block group is used 2024-02-23 09:24:47 +01:00
block-rsv.c btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve 2024-03-26 18:19:13 -04:00
block-rsv.h btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve 2024-03-26 18:19:13 -04:00
btrfs_inode.h btrfs: fix race setting file private on concurrent lseek using same fd 2024-10-04 16:29:59 +02:00
check-integrity.c btrfs: rename __btrfs_map_block to btrfs_map_block 2023-06-19 13:59:34 +02:00
check-integrity.h
compression.c btrfs: zlib: fix and simplify the inline extent decompression 2024-08-29 17:33:32 +02:00
compression.h btrfs: zlib: fix and simplify the inline extent decompression 2024-08-29 17:33:32 +02:00
ctree.c btrfs: replace BUG_ON() with error handling at update_ref_for_cow() 2024-09-12 11:11:37 +02:00
ctree.h btrfs: fix race setting file private on concurrent lseek using same fd 2024-10-04 16:29:59 +02:00
defrag.c btrfs: defrag: change BUG_ON to assertion in btrfs_defrag_leaves() 2024-08-29 17:33:37 +02:00
defrag.h
delalloc-space.c btrfs: don't reserve space for checksums when writing to nocow files 2024-02-23 09:24:48 +01:00
delalloc-space.h
delayed-inode.c btrfs: change BUG_ON to assertion when checking for delayed_node root 2024-08-29 17:33:37 +02:00
delayed-inode.h btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() 2023-10-11 11:37:19 +02:00
delayed-ref.c btrfs: prevent transaction block reserve underflow when starting transaction 2023-09-20 20:42:18 +02:00
delayed-ref.h btrfs: prevent transaction block reserve underflow when starting transaction 2023-09-20 20:42:18 +02:00
dev-replace.c btrfs: dev-replace: properly validate device names 2024-03-06 14:48:40 +00:00
dev-replace.h
dir-item.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
dir-item.h
discard.c btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
discard.h btrfs: unexport btrfs_run_discard_work and make it static 2023-06-19 13:59:25 +02:00
disk-io.c btrfs: wait for fixup workers before stopping cleaner kthread during umount 2024-10-10 11:57:58 +02:00
disk-io.h btrfs: fix double free of anonymous device after snapshot creation failure 2024-03-06 14:48:40 +00:00
export.c btrfs: export: handle invalid inode or root reference in btrfs_get_parent() 2024-04-13 13:07:32 +02:00
export.h
extent-io-tree.c btrfs: make find_first_extent_bit() return a boolean 2023-08-21 14:52:12 +02:00
extent-io-tree.h btrfs: make find_first_extent_bit() return a boolean 2023-08-21 14:52:12 +02:00
extent-tree.c btrfs: always update fstrim_range on failure in FITRIM ioctl 2024-10-04 16:29:54 +02:00
extent-tree.h btrfs: wait on uncached block groups on every allocation loop 2023-08-21 14:54:47 +02:00
extent_io.c btrfs: replace sb::s_blocksize by fs_info::sectorsize 2024-08-29 17:33:43 +02:00
extent_io.h btrfs: zoned: introduce block group context to btrfs_eb_write_context 2023-08-21 14:52:19 +02:00
extent_map.c btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range() 2024-05-02 16:32:44 +02:00
extent_map.h btrfs: pass the new logical address to split_extent_map 2023-06-19 13:59:33 +02:00
file-item.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
file-item.h btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
file.c btrfs: fix race setting file private on concurrent lseek using same fd 2024-10-04 16:29:59 +02:00
file.h
free-space-cache.c btrfs: zoned: properly take lock to read/update block group's zoned variables 2024-08-29 17:33:16 +02:00
free-space-cache.h btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c 2023-06-19 13:59:24 +02:00
free-space-tree.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
free-space-tree.h btrfs: make clear_cache mount option to rebuild FST without disabling it 2023-05-10 14:51:27 +02:00
fs.c
fs.h btrfs: zoned: activate metadata block group on write time 2023-08-21 14:52:19 +02:00
inode-item.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
inode-item.h btrfs: move split_flags/combine_flags helpers to inode-item.h 2023-06-19 13:59:25 +02:00
inode.c btrfs: update target inode's ctime on unlink 2024-09-18 19:24:05 +02:00
ioctl.c btrfs: always update fstrim_range on failure in FITRIM ioctl 2024-10-04 16:29:54 +02:00
ioctl.h
Kconfig MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org 2023-09-08 14:21:27 +02:00
locking.c btrfs: add block-group tree to lockdep classes 2023-06-19 13:59:35 +02:00
locking.h btrfs: do not block starts waiting on previous transaction commit 2023-09-08 14:10:49 +02:00
lru_cache.c
lru_cache.h btrfs: remove btrfs_lru_cache_is_full() inline function 2023-04-17 18:01:18 +02:00
lzo.c btrfs: disable allocation warnings for compression workspaces 2023-06-19 13:59:34 +02:00
Makefile
messages.c btrfs: remove v0 extent handling 2023-08-21 14:54:48 +02:00
messages.h btrfs: remove v0 extent handling 2023-08-21 14:54:48 +02:00
misc.h minmax: add in_range() macro 2023-08-24 16:20:18 -07:00
ordered-data.c btrfs: set correct ram_bytes when splitting ordered extent 2024-05-17 12:02:30 +02:00
ordered-data.h btrfs: add a btrfs_finish_ordered_extent helper 2023-06-19 13:59:37 +02:00
orphan.c
orphan.h
print-tree.c btrfs: avoid using fixed char array size for tree names 2024-08-14 13:58:59 +02:00
print-tree.h btrfs: print-tree: pass const extent buffer pointer 2023-06-19 13:59:22 +02:00
props.c
props.h
qgroup.c btrfs: run delayed iputs when flushing delalloc 2024-09-04 13:28:18 +02:00
qgroup.h btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve() 2024-01-01 12:42:24 +00:00
raid56.c btrfs: scrub: avoid unnecessary csum tree search preparing stripes 2023-08-21 14:54:48 +02:00
raid56.h btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING 2023-08-21 14:52:12 +02:00
rcu-string.h
ref-verify.c btrfs: ref-verify: free ref cache before clearing mount opt 2024-01-31 16:19:06 -08:00
ref-verify.h
reflink.c btrfs: replace sb::s_blocksize by fs_info::sectorsize 2024-08-29 17:33:43 +02:00
reflink.h
relocation.c btrfs: drop the backref cache during relocation if we commit 2024-10-10 11:58:06 +02:00
relocation.h btrfs: relocation: constify parameters where possible 2024-10-10 11:58:06 +02:00
root-tree.c btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations 2024-04-17 11:19:33 +02:00
root-tree.h btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations 2024-04-17 11:19:33 +02:00
scrub.c btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warning 2024-07-11 12:49:09 +02:00
scrub.h btrfs: scrub: remove scrub_bio structure 2023-04-17 18:01:24 +02:00
send.c btrfs: send: fix invalid clone operation for file that got its size decreased 2024-10-10 11:57:58 +02:00
send.h
space-info.c btrfs: do not subtract delalloc from avail bytes 2024-08-11 12:47:24 +02:00
space-info.h btrfs: zoned: fix zone_unusable accounting on making block group read-write again 2024-08-11 12:47:24 +02:00
subpage.c btrfs: subpage: fix the bitmap dump which can cause bitmap corruption 2024-10-04 16:29:59 +02:00
subpage.h btrfs: stop setting PageError in the data I/O path 2023-06-19 13:59:35 +02:00
super.c btrfs: replace sb::s_blocksize by fs_info::sectorsize 2024-08-29 17:33:43 +02:00
super.h
sysfs.c btrfs: sysfs: validate scrub_speed_max value 2024-01-31 16:18:49 -08:00
sysfs.h
transaction.c btrfs: always clear PERTRANS metadata during commit 2024-05-17 12:02:14 +02:00
transaction.h btrfs: fix race between direct IO write and fsync when using same fd 2024-09-12 11:11:45 +02:00
tree-checker.c btrfs: tree-checker: fix the wrong output of data backref objectid 2024-10-04 16:29:53 +02:00
tree-checker.h btrfs: move btrfs_verify_level_key into tree-checker.c 2023-06-19 13:59:25 +02:00
tree-log.c btrfs: use NOFS context when getting inodes during logging and log replay 2024-07-05 09:33:48 +02:00
tree-log.h btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool 2023-06-19 13:59:26 +02:00
tree-mod-log.c btrfs: avoid tree mod log ENOMEM failures when we don't need to log 2023-06-19 13:59:38 +02:00
tree-mod-log.h
ulist.c
ulist.h
uuid-tree.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
uuid-tree.h
verity.c btrfs: convert btrfs_read_merkle_tree_page() to use a folio 2023-09-13 18:40:54 +02:00
verity.h
volumes.c btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks() 2024-05-17 12:02:30 +02:00
volumes.h btrfs: add a helper to read the superblock metadata_uuid 2023-08-21 14:54:48 +02:00
xattr.c btrfs: abort transaction on generation mismatch when marking eb as dirty 2023-11-28 17:19:35 +00:00
xattr.h
zlib.c btrfs: zlib: fix and simplify the inline extent decompression 2024-08-29 17:33:32 +02:00
zoned.c btrfs: zoned: fix use-after-free due to race with dev replace 2024-06-21 14:38:44 +02:00
zoned.h btrfs: zoned: reserve zones for an active metadata/system block group 2023-08-21 14:52:19 +02:00
zstd.c btrfs: disable allocation warnings for compression workspaces 2023-06-19 13:59:34 +02:00