cleaner.doc

   1
   2 The 'cleaner' thread runs whenever the array is read-write.
   3 It has several tasks.
   4
   5  1/ It runs a checkpoint whenever requested.  This request comes:
   6       - when sys_sync is called
   7       - when cluster_flush notices that more than some configurable
   8         amount of data has been written since the last checkpoint
   9
  10  2/ ?? Calls cluster_flush if a cluster has been pending for too long ??
  11     This shouldn't happen.  We need a more direct flushing of clusters.
  12     ext3 has this trick where the first thread that enters the 'flush it out'
  13     code calls 'schedule' in a loop until nothing more has been added to
  14     the journal.  I wonder how this scales to mult-processors with
  15     multiple free cores...
  16     Maybe if we knew then bdflush has finished with us, we could sync then.
  17     Alternately we could be proactive.  When a page is flushed, we look for
  18     nearby pages and flush them too, maybe doing bdflush's work for it.
  19
  20  3/ Run any pending orphan handlers.
  21
  22  4/ Progress with any needed cleaning
  23
  24  5/ decay the youth file as needed
  25
  26  6/ scan the segment usage files for likely segments to reuse or clean.
  27
  28 All of this needs to be asynchronous - we never wait for IO.
  29
  30
  31
  32 Async cleaner:
  33  The work-flow of the cleaner is:
  34    a/ identify a segment to be cleaned.
  35    b/ Read the first/next cluster header
  36    c/ Any blocks that could be live, we walk the index tree to find them
  37          and mark anything that lives in the target segment as needing
  38          cleaning
  39    d/ repeat for all clusters in the segment
  40    e/ flush out cleanup cluster
  41
  42  The places where this can block are:
  43     b - reading cluster headers
  44     c - walking index trees
  45
  46  For b, we read into our own buffers and are notified of completion, so that
  47    should be easy to handle.
  48  For c, we can flag the block as 'async' to get a general wakeup.  We need to
  49    hold a reference on the block to make sure it doesn't disappear.
  50    So the index-walking code must be able to work in 'async' mode where
  51    if it finds that it needs to wait for an index block, it sets the async flag
  52    and returns a counted reference to it.
  53   The cleaner stores some maximum number of such counted references.
  54    When some (all?) have completed io, we retry from the top.
  55
  56   This will allow truncation to be fully async too.
  57
  58  If we find a block that is in a cleaning segment and is clean, we
  59    mark it for B_Realloc and so it gets written to the cleaning
  60    write cluster.
  61    If the block is already dirty, we just leave it alone.
  62    If we want to dirty a block that is B_Realloc, we need to clear
  63      B_Realloc.  But if it has already been scheduled, we need
  64      to wait for the write to complete.
  65    When refcount hits zero, presence of B_Realloc will move us to
  66      clean_leafs where we can be found for cluster allocation.
  67
  68 ------------------------
  69 scanning:
  70   youth decay and segment scanning happen in the same task.
  71
  72  We have an allocate 'block' for temporary storage which records the
  73   max usage for each of a collection of segments.
  74  We read a block from the youth file and the snapshot-0 usage block.
  75  When they arrive, the youth is decayed if needed, then we check which
  76  snapshot could  be using these segments based on youth and schedule reads
  77  for some of them.
  78  Then we copy the ss-0 usage into the temp block.
  79  As other usage arrives, we merge it with the temp block and find the max
  80  usage for each. We also schedule more reads if necessary.
  81  When all have arrived, we calculate the weight for each segment
  82  and merge with our table.
  83  The table is unsorted. When it gets full, we sort and discard half.
  84
  85  How fast should we scan?  Certainly when a youth-decay is pending,
  86   we might go a bit faster.  And as the table gets empty we might
  87   pick up the pace.
  88
  89  At mount time we really need to read everything.  Then we process one
  90   block for every few segments allocated??
  91
  92 -------------------------------------
  93
  94 A segment cannot be reused until it is known to be clean in a previous checkpoint.
  95 So we need to record when we first notice a segment to be clean.
  96 Note that we cannot do this when we set the usage of a segment to be zero, because
  97 it may still be non-zero in another snapshot - we need to check all snapshots as
  98 the scanner does.
  99
 100 So when we notice that the segment is clean, we schedule an update of
 101 the youth number to 0 after the next checkpoint much as we schedule updates
 102 to segment usage tables during a checkpoint.
 103 This updates is stored in the segment list along side the cleanable segments and the
 104 free segments.
 105
 106
 107 When we purge half of the cleanable list, we remember the largest
 108 score and don't add anything with a larger score.  If the list drops
 109 to one quarter of its max size, we zero that largest score and add
 110 anything that comes along.  When we sort the list, we record it's new
 111 size.  As we remove from the head we decrement the count.  If the
 112 count drops below half the active size of the list, we re-sort.
 113