OpenZFS #15526 was fixed by improving dnode_is_dirty
check in #15571. This note deep-dives into the root cause, triggers, history, and remaining problems with that fix.
As described in @rincebrain's explanation, tools like cp
use llseek(..., SEEK_DATA or SEEK_HOLE)
to preserve sparseness in files. Due to a race checking if a file is dirty, llseek
could return incorrect offsets that skip over data in newly-written files. When this happens, cp
records holes where there should be data and the resulting file is incorrect. #15571 updates the dirty-object check dnode_is_dirty
to fix this problem.
Update: #15615 is a new PR by @robn that uses dn_dirty_txg
to detect dirty dnodes.
Copying tools that preserve sparseness can use llseek
in a loop alternating between SEEK_HOLE
and SEEK_DATA
to find the each hole or data block in the file. When llseek
returns ENXIO
error for SEEK_DATA
, this signals that there is no further data in the file after the given offset. If SEEK_DATA
returns an incorrect offset or ENXIO
error before the end of data is reached, a hole will be reported where there is data. The copying tool will treat the skipped part of the file as a hole. (The opposite case also exists, but treating a hole as data has fewer consequences.)
zpl_llseek
implements llseek
by calling dmu_offset_next
via zfs_holey
and zfs_holey_common
. dmu_offset_next
is itself a wrapper around dnode_next_offset
which locates the next data block (or hole) in an object.
dnode_next_offset
works by inspecting the block pointer tree to find the next pointer at each level of the tree that points to data or a hole (possibly indirectly). It calls dnode_next_offset_level
to scan block pointers each level stopping at the first non-hole pointer (if looking for data) or pointer with blk_fill
less than the maximum number of leaf blocks for that level (if looking for holes). The search starts at the L1 pointer of the leaf block containing the starting offset
and proceeds as an in-order traversal over the tree. The search visits each level of the tree at most twice since block pointers at each level are sufficient to tell if there are L0 leaf blocks (or holes) reachable via those pointers.
Block pointers are only updated from sync context while writing out in-memory data. Thus dnode_next_offset
only returns correct results if the on-disk tree matches the file structure including number of levels, recordsize, etc. The results will be incorrect if there are dirty buffers that add (or free) blocks that have not been synced out.
dmu_offset_next
avoids calling dnode_next_offset
while dirty data exists by forcing a pool sync whenever dnode_is_dirty
detects the object is dirty and zfs_dmu_offset_next_sync=1
. Otherwise dmu_offset_next
returns EBUSY
to give up on hole detection, in which case llseek
returns offset
for SEEK_DATA
or the file size for SEEK_HOLE
indicating that the remaining file is entirely data. Additionally, dmu_offset_next
cannot tolerate concurrent writes to the object that would re-dirty it, so zfs_holey_common
also locks the entire file with a range lock.
Before #15571, dnode_is_dirty
would in rare cases return FALSE
even when the object was not fully synced out. This leads to dnode_next_offset
computing offsets using on-disk block pointers before (or while) they are updated during sync. Depending on the state of the on-disk pointers, the results can skip over data.
If dnode_next_offset
is called on an object before it is synced, the results depend on whether or not the object has been synced before, whether any blocks have been added or freed, and if so whether or not sync has updated the block pointers.
- If the object is new and never synced, then the object will appear to contain no data.
- If called after an object has been synced once, then
dnode_next_offset
will reflect the on-disk blocks ignoring in-memory dirty writes or frees. - If no blocks have been added (or freed) vs. on-disk state, then the results will be correct even if some blocks are dirty.
- If blocks have been added (or freed) but the block pointers have not been updated,
dnode_next_offset
will skip over those blocks (or holes). - If blocks have been added (or freed) and the sync has reached the point of updating all block pointers, then the results will be correct even if sync has not completed.
If an object is dirtied in multiple txgs, the above applies for each txg. The results of llseek
will be incorrect until the most-recently dirtied txg reaches sync and updates the block pointers if any txg adds or frees blocks.
Block pointers get updated in the write pipeline before vdev writes, so the window of time that results are incorrect is not dependent on storage performance.
Note that dnode_next_offset
reads of block pointers are atomic for indirect blocks as they are guarded by db_rwlock
. For block pointers in the dnode itself, dnode_next_offset
does not hold locks for dn_phys->dn_blkptr
and these are racy reads. On weakly ordered machines, it is theoretically possible that dnode_next_offset_level
observes partially written block pointers even if sync is already completed.
In open context, dirtied dnodes in each object set are added to os_dirty_dnodes
. Dirtied data and indirect blocks are added to trees of dirty records in each dnode rooted by dn_dirty_records
. These trees follow the same structure as the to-be-written block trees with the dnode pointing at dirty records for top-level blocks and dirty indirect blocks pointing to their dirty children recursively in the corresponding indirect buffer dr_children
list.
In sync context, dirty dnodes are written by dmu_objset_sync_dnodes
. This first removes each dnode from os_dirty_dnodes
and adds it to os_synced_dnodes
using the same link (dn_dirty_link
). Then it calls dnode_sync
to write the object including updating dn_phys
, growing indirection, freeing leaf ranges, and writing dirty buffers.
Dirty buffers are issued by dnode_sync
calling dbuf_sync_list
for each dn_dirty_records
recursively issuing writes for all blocks of the tree. dbuf_sync_list
removes each dirty record from the parent dnode or indirect child list then issues the leaf write or recursive indirect block writes. Sync context does not hold any locks for the dirty record tree traversal because all updates are completed in quiescing context before sync starts.
Note that the dnode is removed from os_dirty_dnodes
before syncing the dnode itself and dirty buffers are removed from dn_dirty_records
before starting writes for blocks pointed to by the dnode. Block pointers in the dnode or indirect blocks are not updated in memory until the ZIO pipeline asynchronously compresses the block, allocates space, and invokes the dbuf_write_ready
callback.
All write ZIOs created for a dnode are children of dnode's buffer's ZIO previously created when dmu_objset_sync
calls dnode_sync
on the meta-dnode. These ZIOs are in turn children of the logical root ZIO created in dsl_pool_sync
. After issuing writes for all datasets, dsl_pool_sync
waits on the root ZIO which in turn waits for all child writes including all dnode, data, and indirect block writes. After this wait, all dataset block pointers are guaranteed to be updated and dsl_pool_sync
calls dmu_objset_sync_done
for each dataset which updates userquota (if enabled) and removes each synced dnode from the os_synced_dnodes
list.
The root cause of #15526 is a race in dnode_is_dirty
between checking if a dnode is dirty and scanning its block tree in dnode_next_offset
.
Before #15571, dnode_is_dirty
checks only that dn_dirty_link
is active to determine if the dnode has dirty buffers. This link is used in both the os_dirty_dnodes
and os_synced_dnode
lists.
Checking dn_dirty_link
is almost correct. This link is set when the dnode is dirtied in dnode_setdirty
by adding the node to os_dirty_dnodes
. During dmu_objset_sync_dnodes
the link is reused for os_synced_dnodes
. It is cleared by dmu_objset_sync_done
removing the dnode from os_synced_dnodes
after the waiting on the root ZIO. This guarantees the block pointers are updated.
However there is a small window of time during dmu_objset_sync_dnodes
after which dn_dirty_link
is removed from os_dirty_dnodes
but before it is added to os_synced_dnodes
. This is the root cause of the issue.
In this window of time dnode_is_dirty
may observe dn_dirty_link
as inactive and thus return that the object is clean when it is not. The move between lists is not protected by any mutex held by dnode_is_dirty
or dmu_offset_next
: dmu_objset_sync_dnodes
only holds the sublist mutex for os_dirty_dnodes
at this point and dnode_is_dirty
holds dn_struct_rwlock
and dn_mtx
.
The time period where dn_dirty_link
is inactive while moving between lists appears to be only a few instruction cycles long, but os_synced_dnodes
is a multilist with its own child list mutexes that must be acquired for multilist_insert
. List insertion is never contended as each multilist child list of os_dirty_dnodes
is synced in a separate task which in turn writes to the same child list index of os_synced_dnodes
as both use dnode_hash
. However, on Linux kernels compiled with preemption support, the mutex acquistion for the child list mutex in multilist_insert
is a preemption point even when uncontended and may sleep. Under sufficient load the kernel may preempt the sync thread for one or more scheduling quanta. This leaves a much longer time window in which the dn_dirty_link
is not in either list making the window significantly easier to hit.
The mutex sleep is a confounding cause and explains why reproducing the race is non-deterministic. The kernel's scheduler plays a role in preempting dmu_objset_sync_dnodes
unpredictably, and this preemption is triggered by other load on the system. This explains why multiple file or threads are necessary to reproduce: one thread triggers sync and another calls llseek
while the corresponding file object sync thread is preempted.
Beyond this, the plain reads of dn_dirty_link
in dnode_is_dirty
are also racy. dnode_is_dirty
holds dn_mtx
while dnode_setdirty
holds the os_dirty_dnodes
sublist lock for insertion, dmu_objset_sync_dnodes
holds os_dirty_dnodes
for removal before holding os_synced_dnodes
sublist lock for reinsertion, and dmu_objset_sync_done
holds the os_synced_dnodes
sublist lock for final removal. The file-level rangelock synchronizes the call to dnode_is_dirty
vs. dnode_setdirty
, but no locks guarantee the ordering of observing an inactive dn_dirty_link
vs. writes to dnode or indirect block pointers.
PR #15571 mitigates the problem by checking both dn_dirty_link
and dn_dirty_records
in dnode_is_dirty
. This is effective because dn_dirty_records
is always non-empty during the window of time in dmu_objset_sync_dnodes
where dn_dirty_link
is inactive if there are any leaf blocks being added. Thus the case where dn_dirty_link
is temporarily non-active while preempted is covered by the dirty record check.
Note that dn_dirty_records
still becomes empty before the sync updates all indirect pointers as the ZIO pipeline updates them asynchronously in callbacks. The fix is effective only because dn_dirty_records
is non-empty in the window where dn_dirty_link
is temporarily inactive, and dn_dirty_link
is providing the main synchronization with zio_wait
in dsl_pool_sync
.
Also note that while dn_mtx
protects dn_dirty_records
and is held by both dbuf_dirty
and dnode_is_dirty
, dnode_sync
does not hold dn_mtx
while dbuf_sync_list
removes records from dn_dirty_records
. So the read in dnode_is_dirty
is still racy with respect to sync context clearing it despite holding dn_mtx
.
This means even that after the fix, dnode_is_dirty
still contains data races. While the file-level rangelock and dn_mtx
guarantee that dnode_is_dirty
will observe dbuf_dirty
writes to both dn_dirty_link
and dn_dirty_records
, there are no guarantees that the thread calling dnode_is_dirty
will not observe the write clearing dn_dirty_records
after observing the first write clearing dn_dirty_link
but before the second write setting dn_dirty_link
again even on strongly-ordered architectures.
However while the data races remain, the large window of opportunity caused by mutex preemption is fixed. The remaining data races are theoretically problematic but much less likely to occur in practice especially given the number of stores (and locks) that occur between setting dn_dirty_link
for os_synced_dnodes
and clearing dn_dirty_records
even in the trivial case of a 1-block file.
The details surrounding dnode_is_dirty
have gone through many iterations in OpenZFS history:
-
34dc7c2f25
c. 2008: Initial Linux ZFS GIT Repodmu_offset_next
checksdn_dirty_link
without holding any locks.dmu_objset_sync_dnodes
clearsdn_dirty_link
before callingdnode_sync
then releases the sync txg hold on the dnode.- In this version, the window for the race is quite wide, and
dmu_offset_next
will return incorrect results anytime after sync starts until all block pointers are updated. - But in this version
llseek
isn't implemented - only_FIO_SEEK_HOLE
.
-
9babb37438
c. 2009: Rebase to b117- Adds userquota support.
- With userquota enabled,
dmu_objset_sync_dnodes
moves the dnode to a new list reusingdn_dirty_link
until afterzio_wait
. Otherwisedmu_objset_sync_dnodes
clearsdn_dirty_link
beforednode_sync
. - Reduces the race for userquota datasets to a few instructions wide. (There is no mutex preemption point yet.)
-
3558fd73b5
c. 2011: Prototype/structure update for Linux- Removes
_FIO_SEEK_HOLE
and relatedzfs_holey
code.
- Removes
-
802e7b5feb
c. 2013: Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()- Adds
llseek
support for seeking data/holes. - Prior commit
cf91b2b6b2
brings backzfs_holey
.
- Adds
-
64fc776208
2017-03-20: OpenZFS 7968 - multi-threaded spa_sync())- Introduces multilists and parallelizes
dnode_sync
during sync. - After this
os_dirty_dnodes
is protected by multilist lock instead ofos_lock
. os_synced_dnodes
becomes a multilist exposing the mutex preemption opportunity.- Widens the race window for userquota datasets on kernels with preemptible
mutex_enter
. - This version also forgets to update
dn_dirty_link
checks tomultilist_link_active
(butlist_link_active
happens to be equivalent).
- Introduces multilists and parallelizes
-
66aca24730
2017-04-13: SEEK_HOLE should not block on txg_wait_synced())- Changes the
dmu_offset_next
dirty check todn_dirtyctx
anddn_dirty_records
. - Makes
txg_wait_synced
conditional onzfs_dmu_offset_next_sync
. - Both reads read are racy (should hold
dn_mtx
), but sync does not holddn_mtx
anyway. dn_dirtyctx
is not a reliable signal of dirtiness as it stays set until the in-memory dnode object is destroyed. (Its purpose is for assertion checks that objects are dirtied in exactly one of open or sync context.)- Greatly widens the race for userquota datasets:
dn_dirty_records
is cleared immediately after issuing writes vs. after userquota updates clearsdn_dirty_link
(after rootzio_wait
). - The race exists even with
zfs_dmu_offset_next_sync=0
as the dirty check only bypasses sync only when required, but the test is still faulty.
- Changes the
-
454365bbaa
2017-11-15: Fix dirty check in dmu_offset_next()- First attempt at fixing reports of files filled with
\0
s. - Removes the
dn_dirtyctx
check and moves back todn_dirty_link
test. - Narrows the race for userquota datasets back to the mutex preemption window present in
64fc776208
.
- First attempt at fixing reports of files filled with
-
edc1e713c2
2018-04-10: Fix race in dnode_check_slots_free()- Part of a chain of fixes for similar sync-vs.-open context race reallocating freed dnodes.
- Interestingly, while dnode reallocation is fundamentally the same (open context needs to synchronize with sync context), none of the fixes were applied to
dmu_offset_next
. These fixes ultimately introduceddn_dirty_txg
and matchingDNODE_IS_DIRTY
macro comparingdn_dirty_txg
tospa_synced_txg
inedc1e713c2
. (It should have been usingdn_holds == 0
which was added back in58769a4ebd
, which is unfortunately not suitable fordnode_is_dirty
.) - Incidentally fixes the misuse of
list_link_active
for multilist linkdn_dirty_link
indmu_offset_next
(butmultilist_link_active
is equivalent).
-
ec4f9b8f30
2019-03-21: Report holes when there are only metadata changes- Attempts to implement hole reporting when only the dnode metadata (atime) changes.
- Switches to checking
dn_dirty_link
and thatdn_dirty_records
contains a record other than spill and bonus buffers. - Adds
dn_mtx
lock todmu_offset_next
for the first time. - Adds check in
dmu_offset_next
for non-emptydn_free_ranges
(only for L0 buffers referenced by the dnode - indirect blocks are dirtied indnode_free_range
). - This makes the race trigger if either
dn_dirty_records
is cleared or if the mutex preemption case happens fordn_dirty_link
. - Widens the race window again for userquota datasets to happen anytime after sync starts (like for non-userquota datasets).
-
2531ce3720
2019-05-30: Revert "Report holes when there are only metadata changes"- Reverted
ec4f9b8f30
after #8816 reports more\0
filled files.
- Reverted
-
ba67d82142
2020-12-11: Improve zfs receive performance with lightweight write- Lightweight writes are not related but this changes
dmu_objset_sync_dnodes
to unconditionally useos_synced_dnodes
. - This makes the race the same for userquota and non-userquota datasets (narrow mutex preemption window).
- Lightweight writes are not related but this changes
-
de198f2d9
2021-11-07: Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency- This primarily fixes missing
mmap
support inzfs_holey_common
by flushing beforedmu_offset_next
. - Introduces
dnode_is_dirty
factored out ofdmu_offset_next
, but changes the implementation back to usingdn_dirty_records
. - Updates
dmu_offset_next
to re-check if the object is dirty again aftertxg_wait_synced
. (Rangelocks should make this unncessary?) - Re-widens the race for all datasets to any time after
dnode_sync
clearsdn_dirty_records
.
- This primarily fixes missing
-
c23803be84
2021-11-10: Restore dirty dnode detection logic- Updates
dnode_is_dirty
to go back todn_dirty_link
. - Narrows the race back to the mutex preemption window.
- Updates
-
05b3eb6d23
2021-11-30: Default to zfs_dmu_offset_next_sync=1- Changes to prefer blocking on sync over reporting no holes.
- Motivated by confusion over the expected behavior.
- No effect on the race except perhaps forcing more syncing in the presence of
SEEK_HOLE
-heavy workloads.
-
519851122b
2023-03-14: ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced- Adds back special-casing for the bonus and spill blocks like
ec4f9b8f30
attempted. - Switches to
dn_dirty_records
check, re-widening the race.
- Adds back special-casing for the bonus and spill blocks like
-
accfdeb948
2023-04-18: Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced"- #14753 reports data corruption.
- Reverts back to
dn_dirty_link
check restoring the narrower race condition.
-
30d581121b
2023-11-28: dnode_is_dirty: check dnode and its data for dirtiness- The fix for #15526.
And prior to the OpenZFS fork, see also c. 2006 illumos c543ec060d which introduces dn_dirty_link
. Previously dmu_offset_next
checked dn_dirtyblksz
without locks suggesting that the race exists in that version as well.
Adding a delay to dmu_objset_sync_dnodes
between removing the dnode from os_dirty_nodes
and inserting into os_synced_dnodes
will reliably recreate the race using @tonyhutter's reproducer.sh modified to write 10 files instead of 1000 (because the delay makes syncing very slow). This proves that mutex preemption is the trigger for the race.
diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index f098e1daa..8f560bb13 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -1582,6 +1582,8 @@ dmu_objset_sync_dnodes(multilist_sublist_t *list, dmu_tx_t *tx)
*/
multilist_t *newlist = &dn->dn_objset->os_synced_dnodes;
(void) dnode_add_ref(dn, newlist);
+
+ delay(MSEC_TO_TICK(100));
multilist_insert(newlist, dn);
dnode_sync(dn, tx);
Adding a delay during the async issue ZIO pipeline stage will reliably reproduce the race if dnode_is_dirty
or dmu_offset_next
checks only dn_dirty_records
. This simulates the race in older versions and proves that the race is dependent on the ZIO pipeline computing block pointers.
diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 029d9df8a..f2efe9c2c 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1786,7 +1786,7 @@ dnode_is_dirty(dnode_t *dn)
mutex_enter(&dn->dn_mtx);
for (int i = 0; i < TXG_SIZE; i++) {
- if (multilist_link_active(&dn->dn_dirty_link[i])) {
+ if (!list_is_empty(&dn->dn_dirty_records[i])) {
mutex_exit(&dn->dn_mtx);
return (B_TRUE);
}
diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 5161326c0..9c7d02e3d 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -1241,12 +1241,41 @@ spa_taskq_dispatch_select(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
return (tq);
}
+typedef struct issue_delay {
+ taskq_t *tq;
+ task_func_t *func;
+ void *arg;
+ uint_t flags;
+ taskq_ent_t *ent;
+} issue_delay_t;
+
+static void
+spa_taskq_dispatch_delay(void *arg)
+{
+ issue_delay_t *d = arg;
+ taskq_dispatch_ent(d->tq, d->func, d->arg, d->flags, d->ent);
+ kmem_free(d, sizeof(*d));
+}
+
void
spa_taskq_dispatch_ent(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
task_func_t *func, void *arg, uint_t flags, taskq_ent_t *ent,
zio_t *zio)
{
taskq_t *tq = spa_taskq_dispatch_select(spa, t, q, zio);
+ if (q == ZIO_TASKQ_ISSUE) {
+ issue_delay_t *d = kmem_alloc(sizeof (issue_delay_t), KM_NOSLEEP);
+ if (d != NULL) {
+ d->tq = tq;
+ d->func = func;
+ d->arg = arg;
+ d->flags = flags;
+ d->ent = ent;
+ clock_t delay = ddi_get_lbolt() + MSEC_TO_TICK(100);
+ taskq_dispatch_delay(tq, spa_taskq_dispatch_delay, d, flags, delay);
+ return;
+ }
+ }
taskq_dispatch_ent(tq, func, arg, flags, ent);
}
dnode_next_offset
is used in other callsites, and it's unclear if those have issues or not.
dmu_object_alloc_impl
could miss already-allocated pages leading to allocation collisionsdmu_objset_space_upgrade
could miss dirty objects in the meta-dnodedmu_tx_count_free
will not count blocks dirtied in previous txgsdmu_free_long_range_impl
will size chunks without considering blocks dirtied in previous txgs
dnode_next_offset
forcing txg syncs remains a surprising performance issue. coreutils-9.2 defaults to using sparse file detection which exacerbates this. Furthermore ZFS reports only on-disk size in stat
, so all newly-written and compressed files will appear sparse even if they contain no holes.
- The underlying race has existed in some form for the entire history of OpenZFS.
- The window of opportunity for the race has varied, but was approximately one of:
- a very small number of instruction clock cycles
- possible mutex preemption during
dmu_objset_sync_dnodes
- anytime after
dnode_sync
until all block pointers are updated
- Before
ba67d82142
, the window was smaller for userquota datasets vs. non-userquota. - The race is dependent on scheduling delays and compute performance (compression).
- The race is not dependent on I/O performance (block pointers update before vdev issue).
dnode_next_offset
does not return correct results until all block pointers are updated.dnode_is_dirty
anddnode_next_offset_level
still have data races (in theory).