2 The 'cleaner' thread runs whenever the array is read-write.
5 1/ It runs a checkpoint whenever requested. This request comes:
6 - when sys_sync is called
7 - when cluster_flush notices that more than some configurable
8 amount of data has been written since the last checkpoint
10 2/ ?? Calls cluster_flush if a cluster has been pending for too long ??
11 This shouldn't happen. We need a more direct flushing of clusters.
12 ext3 has this trick where the first thread that enters the 'flush it out'
13 code calls 'schedule' in a loop until nothing more has been added to
14 the journal. I wonder how this scales to mult-processors with
15 multiple free cores...
16 Maybe if we knew then bdflush has finished with us, we could sync then.
17 Alternately we could be proactive. When a page is flushed, we look for
18 nearby pages and flush them too, maybe doing bdflush's work for it.
20 3/ Run any pending orphan handlers.
22 4/ Progress with any needed cleaning
24 5/ decay the youth file as needed
26 6/ scan the segment usage files for likely segments to reuse or clean.
28 All of this needs to be asynchronous - we never wait for IO.
33 The work-flow of the cleaner is:
34 a/ identify a segment to be cleaned.
35 b/ Read the first/next cluster header
36 c/ Any blocks that could be live, we walk the index tree to find them
37 and mark anything that lives in the target segment as needing
39 d/ repeat for all clusters in the segment
40 e/ flush out cleanup cluster
42 The places where this can block are:
43 b - reading cluster headers
44 c - walking index trees
46 For b, we read into our own buffers and are notified of completion, so that
47 should be easy to handle.
48 For c, we can flag the block as 'async' to get a general wakeup. We need to
49 hold a reference on the block to make sure it doesn't disappear.
50 So the index-walking code must be able to work in 'async' mode where
51 if it finds that it needs to wait for an index block, it sets the async flag
52 and returns a counted reference to it.
53 The cleaner stores some maximum number of such counted references.
54 When some (all?) have completed io, we retry from the top.
56 This will allow truncation to be fully async too.
58 If we find a block that is in a cleaning segment and is clean, we
59 mark it for B_Realloc and so it gets written to the cleaning
61 If the block is already dirty, we just leave it alone.
62 If we want to dirty a block that is B_Realloc, we need to clear
63 B_Realloc. But if it has already been scheduled, we need
64 to wait for the write to complete.
65 When refcount hits zero, presence of B_Realloc will move us to
66 clean_leafs where we can be found for cluster allocation.
68 ------------------------
70 youth decay and segment scanning happen in the same task.
72 We have an allocate 'block' for temporary storage which records the
73 max usage for each of a collection of segments.
74 We read a block from the youth file and the snapshot-0 usage block.
75 When they arrive, the youth is decayed if needed, then we check which
76 snapshot could be using these segments based on youth and schedule reads
78 Then we copy the ss-0 usage into the temp block.
79 As other usage arrives, we merge it with the temp block and find the max
80 usage for each. We also schedule more reads if necessary.
81 When all have arrived, we calculate the weight for each segment
82 and merge with our table.
83 The table is unsorted. When it gets full, we sort and discard half.
85 How fast should we scan? Certainly when a youth-decay is pending,
86 we might go a bit faster. And as the table gets empty we might
89 At mount time we really need to read everything. Then we process one
90 block for every few segments allocated??
92 -------------------------------------
94 A segment cannot be reused until it is known to be clean in a previous checkpoint.
95 So we need to record when we first notice a segment to be clean.
96 Note that we cannot do this when we set the usage of a segment to be zero, because
97 it may still be non-zero in another snapshot - we need to check all snapshots as
100 So when we notice that the segment is clean, we schedule an update of
101 the youth number to 0 after the next checkpoint much as we schedule updates
102 to segment usage tables during a checkpoint.
103 This updates is stored in the segment list along side the cleanable segments and the
107 When we purge half of the cleanable list, we remember the largest
108 score and don't add anything with a larger score. If the list drops
109 to one quarter of its max size, we zero that largest score and add
110 anything that comes along. When we sort the list, we record it's new
111 size. As we remove from the head we decrement the count. If the
112 count drops below half the active size of the list, we re-sort.