locking.doc

   1 Locking and refcounts....
   2
   3
   4 currently, when a block has a refcount it does not appear on LRU.
   5 This is bad as it means that async checkpoint won't find it and so will not
   6 complete properly.
   7
   8 Also datablocks with a refcount are treated differently if checkpoint_lock,
   9 and that is ugly.
  10
  11 And lots of things aren't locked that should be.
  12
  13 So. lru.
  14 This is for tracking blocks that need to be written at some stage.
  15 All data blocks are tracked through the page cache, so we don't need to
  16 track them explicitly.
  17 Index blocks are always pinned and it only makes sense to write a block
  18 with no pinned children (as pinned children will very likely cause updates,
  19 and the block cannot be freed while it has children anyway).
  20
  21 There are several lists that a block can be 'lru' on.  We need a common lock
  22 for all of these, so it must be a lock in 'fs'.  'fs->lock'.
  23 The lists are:
  24    phase_leafs[X=0,1]
  25           These are pinned blocks in phase X which have no pinned
  26           children in the same phase.  We need them for checkpointing.
  27    clean_leafs
  28           These are pinned blocks which are being reallocated for cleaning.
  29           They have no pinned children.  Phase is not an issue here as we
  30           abort all cleaning during a phase change.
  31    wc[]->clhead
  32           These are blocks that have been submitted for allocation
  33           to a cluster.  lafs_refile doesn't touch these.
  34    wc[]->pending_blocks
  35           These are blocks that have been written, and we are waiting for
  36           the write to complete. lafs_refile doesn't touch these either.
  37    freelist.lru
  38           These index blocks are clean, not pinned, and can be freed
  39           if their refcount is zero
  40
  41 When an index block is created, it is on freelist.lru.
  42 When a block is first pinned, it is placed on phase_leafs (or clean leafs).
  43 When a child gets pinned, the parent is removed from any list.  This could be
  44    done lazily.
  45 When we want to write out, either for a check point or to free memory,
  46 we take blocks from phase_leafs (or clean_leafs).  If clean they can be
  47 freed (freelist.lru) or moved to next phase (phase_leafs[1-X]).
  48 Otherwise we complete any incorporation and allocate to a cluster, which
  49 moves them to hlhead.
  50 When clusterflush happens, they are moved to pending_blocks.  When write
  51 completes they go back to whereever is appropriate.
  52
  53
  54 Some of these transitions need further locking to make sure they don't happen
  55 twice at once.
  56  We could use inode->i_sem  but that is a little heavy handed where not needed.
  57  We could have a lock bit which ensures single use:
  58     change state
  59     barrier
  60     if (! test_and_set(lock, flags))
  61             do stuff single threaded
  62             clear lock bit
  63             barrier
  64             check is more stuff might be needed.
  65
  66  'stuff' would include:
  67        address incorporation
  68        cluster allocation - that is locked in cluster_allocate
  69
  70        adding an address to an uninc_table is awkward as it needs to be done
  71        quickly, but cannot be done while incorporation is happening, which can
  72        be slow.
  73
  74  So really it is just address incorporation, and sometimes we need to wait if
  75  we cannot do it immediately, so taking i_sem is probably a good thing.
  76
  77
  78 Index blocks always get a pincnt before being pinned. Datablocks don't
  79 have a pincnt.
  80 They need to get pinned before they can be dirtied, but it is largely the 'dirty'
  81 state that keeps them pinned.
  82 We only have a block in this pinned-not-dirty state while holding i_sem,
  83 so there can only be one of them.  So pincnt can be a single bit.
  84 i.e.
  85    down(i_sem)
  86    set pin_pending
  87    pin block if needed
  88    maybe set dirty
  89    clear pin_pending
  90    up(i_sem)
  91
  92 Note that inode datablocks have slightly different exclusivity rules,
  93 but still only one thread can set pin_pending at a time.
  94
  95 check_point lock has a similar effect for index blocks.
  96
  97 So: the state transition for blocks and flags and ->lru is:
  98
  99 Index block:
 100
 101    First it gets 'Pinned' when either marked dirty or children are pinned.
 102    As children get written out they get allocated and added to the inc list
 103    When there are no more pinned children (in this phase) we land up on
 104    the leafs list.  This happens by refile being called on a parent whenever
 105          a block flips phase or becomes unpinned
 106    Once on the leafs list we are likely to have 'incorporate' called which
 107          is likely to make us dirty.
 108    If dirty we get allocated which sets B_Alloc and moves us to the cluster list
 109      This also sets B_WritePhase1 as appropriate.
 110    Once the cluster is flushed, we get moved to the write-pending list.
 111    Once the write is complete, we get refiled.
 112
 113    It seems like we need a flag to say it is in the being-written stage.
 114    This should be B_Alloc.
 115
 116    A block can be updated while writes are pending so it can move among the
 117    cluster list and write-pending list.
 118
 119    So A block is first marked B_Dirty.  It eventually get passed to
 120    lafs_cluster_allocate which set B_Alloc.
 121    Once a cluster is full it is flushed which clears B_Dirty but leave B_Alloc.
 122    When write completes, B_Alloc is cleared.
 123    Alternately, the block can be marked B_Dirty again in which case B_Alloc is
 124    cleared and it is removed from the lru list.
 125    When we mark it clean we copy B_Phase1 to B_WritePhase1.  We can not then
 126    mark it dirty unless B_Alloc is clear, or B_Phase1 matches B_WritePhase1.
 127    Note that the phase can change as soon as B_Dirty is cleared