1 Locking and refcounts....
4 currently, when a block has a refcount it does not appear on LRU.
5 This is bad as it means that async checkpoint won't find it and so will not
8 Also datablocks with a refcount are treated differently if checkpoint_lock,
11 And lots of things aren't locked that should be.
14 This is for tracking blocks that need to be written at some stage.
15 All data blocks are tracked through the page cache, so we don't need to
16 track them explicitly.
17 Index blocks are always pinned and it only makes sense to write a block
18 with no pinned children (as pinned children will very likely cause updates,
19 and the block cannot be freed while it has children anyway).
21 There are several lists that a block can be 'lru' on. We need a common lock
22 for all of these, so it must be a lock in 'fs'. 'fs->lock'.
25 These are pinned blocks in phase X which have no pinned
26 children in the same phase. We need them for checkpointing.
28 These are pinned blocks which are being reallocated for cleaning.
29 They have no pinned children. Phase is not an issue here as we
30 abort all cleaning during a phase change.
32 These are blocks that have been submitted for allocation
33 to a cluster. lafs_refile doesn't touch these.
35 These are blocks that have been written, and we are waiting for
36 the write to complete. lafs_refile doesn't touch these either.
38 These index blocks are clean, not pinned, and can be freed
39 if their refcount is zero
41 When an index block is created, it is on freelist.lru.
42 When a block is first pinned, it is placed on phase_leafs (or clean leafs).
43 When a child gets pinned, the parent is removed from any list. This could be
45 When we want to write out, either for a check point or to free memory,
46 we take blocks from phase_leafs (or clean_leafs). If clean they can be
47 freed (freelist.lru) or moved to next phase (phase_leafs[1-X]).
48 Otherwise we complete any incorporation and allocate to a cluster, which
50 When clusterflush happens, they are moved to pending_blocks. When write
51 completes they go back to whereever is appropriate.
54 Some of these transitions need further locking to make sure they don't happen
56 We could use inode->i_sem but that is a little heavy handed where not needed.
57 We could have a lock bit which ensures single use:
60 if (! test_and_set(lock, flags))
61 do stuff single threaded
64 check is more stuff might be needed.
66 'stuff' would include:
68 cluster allocation - that is locked in cluster_allocate
70 adding an address to an uninc_table is awkward as it needs to be done
71 quickly, but cannot be done while incorporation is happening, which can
74 So really it is just address incorporation, and sometimes we need to wait if
75 we cannot do it immediately, so taking i_sem is probably a good thing.
78 Index blocks always get a pincnt before being pinned. Datablocks don't
80 They need to get pinned before they can be dirtied, but it is largely the 'dirty'
81 state that keeps them pinned.
82 We only have a block in this pinned-not-dirty state while holding i_sem,
83 so there can only be one of them. So pincnt can be a single bit.
92 Note that inode datablocks have slightly different exclusivity rules,
93 but still only one thread can set pin_pending at a time.
95 check_point lock has a similar effect for index blocks.
97 So: the state transition for blocks and flags and ->lru is:
101 First it gets 'Pinned' when either marked dirty or children are pinned.
102 As children get written out they get allocated and added to the inc list
103 When there are no more pinned children (in this phase) we land up on
104 the leafs list. This happens by refile being called on a parent whenever
105 a block flips phase or becomes unpinned
106 Once on the leafs list we are likely to have 'incorporate' called which
107 is likely to make us dirty.
108 If dirty we get allocated which sets B_Alloc and moves us to the cluster list
109 This also sets B_WritePhase1 as appropriate.
110 Once the cluster is flushed, we get moved to the write-pending list.
111 Once the write is complete, we get refiled.
113 It seems like we need a flag to say it is in the being-written stage.
114 This should be B_Alloc.
116 A block can be updated while writes are pending so it can move among the
117 cluster list and write-pending list.
119 So A block is first marked B_Dirty. It eventually get passed to
120 lafs_cluster_allocate which set B_Alloc.
121 Once a cluster is full it is flushed which clears B_Dirty but leave B_Alloc.
122 When write completes, B_Alloc is cleared.
123 Alternately, the block can be marked B_Dirty again in which case B_Alloc is
124 cleared and it is removed from the lru list.
125 When we mark it clean we copy B_Phase1 to B_WritePhase1. We can not then
126 mark it dirty unless B_Alloc is clear, or B_Phase1 matches B_WritePhase1.
127 Note that the phase can change as soon as B_Dirty is cleared