README update

[LaFS.git] / README
diff --git a/README b/README

index cb8952000529bcb64ea73d1b4cd8fc6d52aa94fc..1be7c53279754e70e37889eee32827d108ad2e50 100644 (file)
--- a/README
+++ b/README
@@ -56,11 +56,6 @@ FIXME
    index pages never get put on an LRU - how is this supposed to work?
  
  
-
-
-
-
-
  --------------------------
  Thoughts:
    Inodes live in an address-space, much like a file.  To load the
@@ -834,7 +829,7 @@ DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
        ->waiting access with fs->lock after changing it to ->lru
  DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
          only when *all* have finished.
-  n/ on phase change, uninc_next blocks need to be shared out.
+DONE  n/ on phase change, uninc_next blocks need to be shared out.
  NO 3o/ Make sure lafs_refile can be called from irq context.
   3p/   lock all lru accesses.
   3q/ Lock those index blocks!!!
@@ -876,9 +871,9 @@ DONE     - use s_blocksize / s_blocksize_bits rather than fs->
  DONE 17/ Make sure create inherits uid etc from process.
   18/ consider ranges of holes in pending_addr.
  
- 20/ Implement rest of "incorporate"
- 21/ Implement staged truncate
-         use for setattr and delete_inode
+DONE 20/ Implement rest of "incorporate"
+DONE 21/ Implement staged truncate
+DONE         use for setattr and delete_inode
  DONE 22/ block usage counts.
   23/ review segment usage /youth handling and make a todo list.
        a/ Understand ref counting on segments and get it right.
@@ -1456,3 +1451,6086 @@ Later:
   FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
  
   FIXME make sure empty files have depth of 1.
+
+ FIXME Truncate proceeds lazily. All data blocks need to be gone
+
+26aug2008
+ If I call lafs_erase_dblock while a write is underway, we have a problem.
+  We need to wait potentially for a checkpoint to let go of the block and
+   a write to complete.
+    This should be done with waiting for PG_writeback on the page to disappear.
+  Check this out.
+
+  When end_page_writeback is called, we must have dropped all references to the
+   page.
+  When we commit to writing a block, we have to set PG_writeback on the page
+   so that truncate et al can wait for it.  Before we have committed, truncate
+   can just remove the page.  Internally we differentiate by B_Alloc.
+  So before setting B_Allocated we need to test_set_page_writeback(page).
+  Be careful of races.
+  I don't think we can ensure all references are dropped.  After all, that is
+  the point of refcounts.  So dblock array must exist without page!
+  But we need to ensure that we don't start a writeout after truncate
+  has done wait_on_page_writeback.
+  This is done with the page locked so when we want to write a page
+    in a checkpoint, we need to lock the page first.  Once we have the lock,
+    we check if the page is still dirty.  If it has been truncated it 
+    will be clean.
+   But how do we safely reference the page if b->page can be cleared?
+    How about:
+      When we clear PagePrivate, we take a counted reference to the page
+      for db->page.  This is dropped when the page is freed by lafs_refile.
+      But while it is held, it is still safe for db->page to be dereferenced.
+    So before we commence writeout we have to lock the page and set
+     PG_writeback.  After locking, we need to test if writeback is still 
+     appropriate. 
+
+  Maybe not.  I think we can submit blocks for writeout without setting the
+  page to writeback.  If we do, then we need to be sure those writes
+  finish before invalidatepage calls releasepage (block_invalidatepage
+  calls discard_buffer which calls lock_buffer which waits).
+  In our case invalidatepage need to make sure that no new write commenses.
+  Maybe we should lafs_iolock_block before we allocate to a cluster and check
+  again if the block is dirty.
+
+  So:
+    lafs_cluster_allocate does:
+       lafs_iolock_block
+       check if still dirty.  If not, unlock and return
+       set allocate flag
+       allocate and write
+       when write completes, allocate is cleared.
+                    unlock block
+
+    invalidatepage does
+       lafs_iolock_block
+       clear Valid,Dirty,Realloc
+       lafs_iounlock_block
+
+
+
+2008 aug 28 - happy birthday.
+FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
+lafs_prealloc complains.
+
+  mark_cleaning does too, but cleaning only happens well away from a checkpoint
+  lock.
+segsum_find is being called to reference a new segment when we flush a cluster.
+ segment usage blocks are special.  Their index information doesn't
+need to be written out in the current checkpoint.  We can do that, but
+the backstop is to write just the data block in the tail of the
+checkpoint and write indexing information later.
+
+2008sep10
+ unlink is getting "No space left on device".  This is when trying to
+ pin the directoory block, the physaddr is 0, so it looks like we want
+ NewSpace.  But we should even be trying to prealloc in that case becase
+ there should already be a prealloc on the block.  i.e. there should be
+ credits.
+ Hmmm. after multiple 'syncs' how can the block not be written out.
+ Maybe it is embedded in the inode?
+ When we pin a block that was embedded in the inode it isn't clear what to
+ do.  If we might grow the file so it doesn't fit any more, we need to
+ allocate NewSpace.  If we know it won't grow. we use Release.
+  This still needs a proper fix.
+
+ Cleaning seems to be working nicely.  However we don't get all the space
+ back that we should because lots of blocks still have credits that
+ aren't being returned.
+
+ So when should credits be returned?
+ They are set when a block is pinned.  It then gets dirtied which
+ consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
+ then it doesn't need any credits.
+
+
+ It seems that cluster_flush is not always writing things in the correct
+  order.  Root gets written before some other things below it.
+   Maybe they are temporarily out of the loop??
+ No.  There are dirty blocks which one checkpoint doesn't pick up, but
+  they aren't holding the index block pinned. so they lose allocation.
+
+ But they must hold the indexblock pinned, even though they aren't pinned
+ themselves.  We maybe do this just with the refcnt... maybe.  That will cause
+ it to phase-flip rather than drop pinning, which I think is right.
+
+ So: too many credits remain allocated.  Where are they?  There are 1464
+   outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
+   But things removed from the tree have credits removed.
+
+
+
+FIXME roll forward ignores inodes.  But what about an inode that contains
+   data.  Should that be ignored?  I think not.
+FIXME delete adir/big2 then delete adir and it cannot release:
+  Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
+ presumably there is orphan processing or something to complete???
+FIXME when files are deleted, the space isn't returned!
+   This seems to be mostly fixed - need to test.
+FIXME when I "rm [b-z]*" it waits for writeback on something???
+   zfile again!!!  OK, I think that is fixed.
+
+
+12sep2008
+  Current problem:
+    seg_apply_all dirties dblocks.  When should they be reserved?
+    The originally get reserved by a lafs_reserve_block call in
+    segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
+    However: that block might get written before *and* after a checkpoint.
+    So we need N* Credits.  These are usually only used for Index blocks.
+    We can set these easily enough if inode type is TypeSegmentMap.
+    We move them across to Credit in seg_apply_all.
+    But when to we clear them if they aren't needed?  I guess
+     when we drop the last segref.  Yes, we already do that.
+    FIXME need to make sure these get flushed on next checkpoint
+     if we cannot allocate new credits after a checkpoint.
+
+  New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
+    Think that is fixed.
+
+  Some problems.
+    1/ see above:  rm x/y; rmdir x -> BUG - FIXED
+    2/ Spins on 'CURRENT=1' ??
+    3/ if alloc_space gives EAGAIN while deleting, we don't survive.
+    4/ When I create/delete a file, ablocks_used increments by one.
+        The inode hasn't been allocated yet, so it seems the deallocation
+         isn't adjusting ablocks_used??
+    5/ open_namei (for dd) got caught on a mutex_lock.
+    6/ When a large file is shrunk we don't reduct the level of the InoIdx block
+       I'm not sure where we should and am not thinking very clearly.
+       Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
+    7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
+       the lock?  Writeout that hasn't completed?
+       Yes.  writepage calls lafs_allocated_block without calling flush.
+       So the block could be sitting waiting for a flush.  How long do we
+       wait??
+    8/ It seems that some datablock can need NCredits.  Make sure these
+       are handled properly re flush-or-refill after checkpoint and
+       flip_phase rather than unpin.
+    9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
+       enough, and we lock up (see 7).  Need to flush the first block
+       straight away, and the next one as soon as the first finishes, etc.
+       Or something like that.  Then remove the comment from lafs_writepage.
+
+8th December 2008
+
+  I seem to be getting only 4 blocks to a cluster at the moment.
+   This is good as it motivates the code to handle block splitting in
+   the Btree.   But it shouldn't happen.
+
+  ....
+  Block spliting might work - it doesn't crash at least.
+  But
+  After deleting all files, the tree is full of stuff.
+  Lots of inode data/InoIdx blocks.
+  Many but not all a Pinned.  The others are OnFree
+  The Pinned ones have outstanding references.
+  Others
+
+  ....
+  Problem with the block splitting, when adding an index block.
+  The index block is initially empty - we need to find things by looking
+  at children.  But we don't.  We BUG_ON the iphys==0.
+  In general, when we add a block below and index block and before we incorporate,
+  the block must be found by finding the first indexed block and looking to
+  see if there is a 'next' block that contains the address we need.
+  FIXED
+
+  But if we truncate a file while an index block is pinned and dirty,
+  we spin on trying to incorporate it, which should make it empty.
+
+11th December 2008
+  deadlock.
+  sync is trying to get lock in lafs_cluster_flush
+  pdflush holds the lock and is stuck in cluster_flush_0xa40
+    some wait_event I expect.
+    Maybe we need an unplug ??
+
+ - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
+   This is in clean_free.  We try to update the 'youth' to mark
+   the segment as free, and we don't have a reservation to do it.
+   Maybe just reserve it there and then.
+
+
+12th December 2008
+  When doing a lookup in an index block, we need to check the unincorp
+  address list.  It isn't enough to look for unincorp blocks as they
+  might have disappeared.
+  For INDIRECT and EXTENT this is easy enough as full information is in
+  'uninc'.
+  For INDEX it is a little tricky as we need to look at the full set of
+   addresses to know where a particular address fits.
+   We could force and incorporate first, but that has awkward implications
+    if it requires a split.
+   Maybe if we get from the lookup "start+range"....
+     That is not enough as the 'start' might get zeroed by an update.
+
+
+   rm adir/* doen't work as readdir doesn't get all the entries
+    for some reason.
+   Reason is that they are being put in the wrong block.
+   lafs_find_next doesn't correctly find the 'next' block if it 
+   hasn't been incorporated yet.
+   Block can be:
+     in index tree -- easy to find
+     in uninc_table -- not too hard
+     in only in the ->children list, or attached to a page.
+   It would be nice to use find_get_pages but that isn't exported so try
+    something else for now.
+   For index blocks
+        Look in index block for 'next
+
+15th December 2008
+   FIXME when we split an index block, we need to hold a reference to
+   the original so it doesn't disappear until the split-off copy is
+   written.  This is because we search from an index block to find
+   split-off copies.
+   [ note from Feb09.  This should be OK now. Both will need
+   incorporation, and we now hold on to blocks until they are
+   incorporated.]
+   
+
+
+23rd February 2009
+  - index block.  What changes are allowed exactly.
+     - splitting certainly makes sense.
+     - merging two adjacent blocks is fine, of which a special case
+       is finding that a block is empty and so removing it.
+     - What about a 2->3 split which would require removing a block
+        and adding another at the same time?
+       or noticing that the first blocks addressed are all missing, so
+       moving the index forward?
+       In each case, searching down by indexes will find a block that
+       has been replaced by a later address.  We could manage that as
+       long as the new block is attached after the replaced block.
+       So we cannot move a block.  We must delete and replace.
+
+  - unincorporated index blocks..
+    unincorporated data blocks are not pinned in memory.  Once they have
+    been written out, they can be freed.  Their address is stored in the
+    uninc-table.  This means we can delay incorporation while many
+    extents are written out and freed.  When we come to incorporated, we
+    may have many hundred of address in a few extents that can be incorporated
+    efficiently without holding all that data pinned in memory.
+    The same scale doesn't apply to index blocks.  An index block can
+    reference only 102 blocks (for 1K block size).  And the uninc table can
+    hold far fewer so we will naturally incorporate more often.
+    So keeping index/indirect/extent blocks pinned until they are incorporated
+    is reasonable.  And it makes lookup a lot easier, as we have
+    guarantees about ordering of block in the children list that we
+    don't have in the uninc table.
+
+    Incorporation could have some atomicity issues.  There is no
+    concern about bad stuff appearing on disk as the phase-change
+    process handles that.  In memory it might be awkward if we split
+    an index block before incorporating a block what would span them.
+    That could conceivably happen if we only incorporate 8 blocks
+    (size of uninc table) at a time.
+    So maybe we should incorporate a full uninc list (not table) at
+    a time.
+    This means quite different code paths for incorporating leaf
+    and internal index blocks....
+
+
+  - uninc_table lists are a real problem.
+    They can only be created during roll-forward so they hardly ever
+    happen.
+    But if the block is split while processing earlier things on the
+    list, then splitting an uninc table would be very messy.
+    Is there any way around this?
+    Why not just do incorporation during roll-forward?
+    We only need to incorporate leafs, not internal blocks because we
+    don't use uninc_table for internal blocks any more.
+    So during roll forward, all index blocks that are touched need to
+    be held in cache...
+    I think we live with that.  If it every becomes a problem, we will
+    need to perform the roll-forward twice.  The first time collects
+    the usage information so that we know where we can start writing,
+    then the second just applies all the changes. to the rest of the
+    filesystem.
+
+
+   So:
+     uninc table only used for leaves, and has no linked list
+     unincorporated index block are stored on a list, which we
+     sort before applying.
+     All uninc index blocks are therefore kept in the index tree.
+     Their order on the children list allows us to find the correct
+     index. Each block for which the fileaddr is in the parent is
+     followed by any blocks that have been split off and end after
+     this one starts.  Blocks that have been emptied are Hole and are
+     skipped over when looking for a block.
+
+     When we split an internal block, the remaining uninc blocks
+     must not start with a Hole.
+
+   FIXME: what locking do I need around lafs_incorporate?
+      i_mutex?? i_alloc_sem??
+      i_alloc_sem is imposed by truncate (inode_setattr) and
+         direct_io possibly.  So it is really about adding/removing
+        blocks.  Not updating internals.
+        Maybe our own mutex.  Could even be per-index-block !!
+      Whatever it is, we need to protect walking ->children too.
+
+
+24th February 2008
+  "rm -r" problem from 12/dec/2008 fixed now.
+  incorporate code got a make-over and is probably much better.
+
+  New problems:  After test runs, cannot create files due to no space
+     on devices!!  But directory tree is empty.
+  I can see:
+
+    free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
+
+  The problem is that we think 1425 has been allocated to data that
+  might still need to be written, leaving not enough room for more.
+  Index Dump shows
+  ====================414 credits ==============================
+  which doesn't explain everything, but does explain a lot.  There
+  really should be nothing in the Index tree (except fs-root and
+  tree-root)
+  There is also:
+  Some inodes which are OnFree and hold no credits.
+    0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
+    52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
+
+  Some other inodes which are pinned with lots of credits and are
+    on the phase_leaf list
+    0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
+   299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
+
+  And that is about it.  some are not Valid, some are...
+  checkpoint just wants to 'flip' them.
+  They mostly have a refcnt of 1... I wonder who is holding that....
+  The reference of on the dblock is held by the iblock.
+  But what is the iblock remaining?  Who holds that reference?
+
+  I restored some code to clean iblock, and now:
+  free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
+  ====================244 credits ==============================
+  which saved 130 credits.  That helps.
+  There seem to be many fewer of the many-credits blocks
+  Lot of index blocks in tree are 'OnFree' and have a
+  0 refcnt, but haven't been removed.  Why?
+  It seems that the have ->parent == NULL, so lafs_refile never
+  bothers to remove them.  I guess it should...
+  OK, lots of InoIdx block have gone now with their DATA blocks.
+
+  So, remaining blocks are pinned to their phase with lots of Credits,
+    have not pincnt, mostly have physaddr==0.
+   It is just the stray refcnt that keeps them there..
+   inums are 40, 56, 62-73, 275-278, 280
+    40 is f22
+    56 is first adir
+    63-69 are directories 2/3/4/5/6/7/8/9
+    70-73 are looooong symlinks
+    275 is cfile
+    276 is dfile - same as cfile but truncated.
+      Then some nbfile-X that were big enough.
+
+   So: what do they have in common:
+     Several only use the in-inode data block, but
+       probably not all
+
+    Can it be that it is refcounted on the Leaf list, and so
+    cannot get off??  Yes, I think so!
+    We only unpin things that have a zero refcount.
+
+    So: what to do?
+      checkpoint takes it off the list, then flips the phase and puts it
+      on the other list with refile.  During that time it has a refcount
+      it doesn't lose the pinning.
+      Do we want to:
+        1/ Not have it on the list despite being pinned.
+       2/ Drop the PIN despite the refcnt.
+        3/ have refile do the phase_flip so it has a chance to
+           notice the refcount has hit zero.
+
+      2 isn't really an option.  We need PIN to persist whenver we have
+       a reference.  We could possibly use PinPending for index blocks too,
+       but that would require a lot of thinking.
+      1 requires another criterea for being on the list.  I suspect that would
+       get messy fast.
+      3 we used to do I think... But refile is in a big lock, and we
+        cannot really do a phase_flip under that.. and phase flip calls
+         refile anyway so we would get recursion.
+      So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
+
+      FIXME use kzalloc where appropriate.
+
+      FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
+
+25th February 2009
+  Good progress.
+  Only 54 credits in Index Tree now.
+  Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
+  plus '74', which seems to be schedules for deletion - root has uninc_table.
+   ... and 'sync' got rid of that and left 44 credits.
+  Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
+    50  link
+    55  zfile
+    72  long84
+    73  long85
+    74  adir
+  These seem to be the files that used data-in-the-inode
+  They still have a refcnt of 1 (or 2 for adir).
+  ... OK, that's gone now.  I fould a refcount leak.
+
+  So now:  42 Credits in Index Dump.   No stray files.
+
+  df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
+  So we still seem to have 1085 blocks allocated.  42 are accounted
+  for, so 1043 still missing... either we lost the count, or lost the tree.
+
+  create a finy file, remove, and sync, now
+  df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
+
+  so I lost 15, b ut now 48 are in tree.  Lets try again...
+  df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
+  and 44 in tree
+  and again:
+  df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
+
+  Definitely losing more thant the difference in the tree.
+
+  Try creating empty files...
+df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
+
+ very strong pattern there.
+ What about 2 files at a time.
+df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
+
+  Slightly different pattern - not as bad.
+  Have to try 4 now.
+df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
+
+  Strange, isn't it....
+
+  Making sure we clear UnincCredit... result looks worse.
+
+26th February 2009
+  I fixed up the credit accounting 'incorporate' and then fixed a couple
+  more little bugs.  And now:
+
+
+
+====================48 credits ==============================
+df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
+
+So we still have 720 allocated credits that aren't accounted for.
+But we are nicely under 100...
+
+.... and now
+
+
+====================76 credits ==============================
+df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
+
+That is different.  The count of missing blocks is way down,
+but there is some extra cruft in the index tree.
+Quite a few like
+    0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
+    0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
+and even one
+    0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
+   330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
+Time for a commit though....
+
+and now
+====================46 credits ==============================
+df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
+
+so the strays in The index tree are gone. but still have 159 outstanding
+credits.
+Now change but now
+====================36 credits ==============================
+df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
+
+
+That is a little weird...
+Hmmm. back to 
+====================48 credits ==============================
+df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
+
+Oh well.
+====================34 credits ==============================
+df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
+
+It seems that the unaccounted blocks are (or can be) created by
+writing to a file then removing the file without a sync.
+..but why is cb (cblocks_used) so high?
+
+27th February 2009
+
+ Got onto a bit of a tangent...
+ What happens if we truncate a block while it is on a list to
+ be cleaned?  Clearly we want to cleaner to drop it ASAP.
+ But what if invalidate_page wants to drop it *now*
+ Hopefully it is either still on clean_leafs and we can remove it,
+ or it is now iolocked and we can wait for it.  So should be OK.
+
+ I keep getting caught in "looping on..."
+ We are truncating an inode and some index block which is now empty
+ is not getting removed from the tree because there is an outstanding
+ reference.... 327/0 depth=1.  I guess I turn on the tracing.
+
+ ... and it seems that it is in the process of checkpointing.
+ I guess I need to lock against that ... maybe with the iolock.
+
+Credits = -1, rv=2
+ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
+
+ -------
+ Every time I create/delete a file, I get an extra 'ab' which disappears
+ on 'sync'.
+   ablocks_used is:
+     decremented when +ve summary_update on non-index
+     increased on lafs_summary_allocate... should not be done for index blocks.
+
+ OK:  after test run, filesystem is empty, but cblocks_used is around 360.
+  cblocks_used:
+        is loaded at mount time
+        collects pblocks_used on a phase flip
+        is updated in lafs_summary_update (unless pblocks is)
+   So we must be missing a lafs_summary_update when phys->0
+
+
+ Lots of problem:
+   truncating big (multi-level index) seems to be bad
+     Leaves 'pb-338 !!! and cb+689, even after sync.
+   still 'looping on' occasionally
+   Haven't found cblocks_used leak yet.
+   Occasionally non-B_Valid blocks are actted on.
+     I think I need to improve io locking.
+
+---------------
+1st March 2009
+  Need some improvements to iolock locking.
+  We use this lock to wait for a block to be written out (if that is happening)
+   before we allow lafs_invalidate_page to complete.n
+   It is also use in lafs_erase_{d,i}block (Similar purpose)
+  We take the lock in lafs_cluster_allocate, and then make sure the block is
+   still dirty.
+
+  Also lock in lafs_new_inode as initing the inode is a form of IO ??
+  load_block takes the lock
+  We only clear_bit(B_Valid, ) under this lock.
+
+  So the issue is this:
+    A block that is going to be written is passed to lafs_cluster_allocate.
+    This happens either after taking it of a _leafs list, or when 
+    lafs_writepage requests the write.
+
+    lafs_invalidate_page needs to be able to release the page, so there needs to
+    be no transient references.  In particular, once the block has been
+    removed from a _leafs list it must already be iolocked.
+    Invalidate_page can then either remove from that list and erase the block,
+    or use io_lock_block to wait for the IO to complete.
+  So when a datablock comes out of get_flushable it must be iolocked, and must
+  remain iolocked until after Dirty and Alloc are clear
+  Index blocks belong entirely to the fs, so we can be more relaxed with them.
+  If get_flushable finds the block already iolocked, it is either being invalidated
+  or already has IO pending, so it can be dropped.
+
+
+16th Match 2009
+
+  FIXME  When we sync a small file, we just write out the inode.
+     rollforward currently ignores data in inodes I think.
+     Thanks needs to be fixed to ensure this data is safe.
+
+ - stop iblock from disappearing so much.
+
+ - I think...
+    While cleaning a file, I truncate it.  This makes it appear
+    to fit in the inode but it is very big and we get confused.
+   We cannot allocate block 0 until all the others have been
+   allocated to 0 and forgotten.
+   But what if we truncate a file to 10 bytes, then fsync?
+    We need to write the data promptly, but we like doing truncate
+    in the background.
+   When we extend a file we already need to wait for truncation
+    to complete (FIXME do we do that?)  We could wait on fsync too.
+   We cannot just delay block0 as it might be part of a checkpoint
+    that has to complete promptly while truncation can take a long time.
+   i.e. we have a very large file.  We update the first byte, then
+    truncate to 2 bytes.... we don't need to write until fsync which will wait...
+    Directory?? delete lots of entries so it shrinks to one block?
+       There is no delayed truncate there.
+   ?? Never clean an I_Trunc file.  
+   If we try to allocate a file with other indexes:
+     clear Realloc
+     if Dirty and Pinned, just do normal alloc
+     if Dirty and not pinned, skip.
+
+
+  Sometimes I run out of credits while truncating a file.
+  I need credits - maybe only briefly - to dirty the index blocks.
+     -- FIXED I think.
+
+  An indexblock remains pinned while the refcount is non-zero.
+  A pinned index block can be on a _leaf lru
+  The _leaf lru holds a refcount.
+  This is an awkward referential loop.
+  We break it at checkpoint time with special code in phase-flip.
+  But there are other awkward times such as truncate.
+
+  We cannot use PinPending like we do with data blocks because there
+  could be multiple pending Pins (from different children).
+
+  We could possibly treat checkpoint_lock like pinpending, but that 
+  might be racy.
+
+  We could not count the _leaf lru, but that might just make the race
+  harder to find.
+
+  I think we want to explicitly drop the pin when we truncate a block.
+  Normally, once we Pin an index block is will become dirty so we don't
+  want to de-pin before a checkpoint anyway...
+
+  Just to clarify: an index block gets dePinned:
+   - during checkpoint on a phase_flip if it is no longer dirty etc
+   - on truncation when we erase it
+   - during pre-emptive write-out which is a bit like an early phase_flip
+           not sure that we implement that one yet.
+
+17th March 2009
+ Deadlock?
+   - checkpoint calls incorporate call erase_iblock calls iolock_block
+   - rm calls orphan_pin calls phase_wait
+ The problem is in lafs_incorporate.  It expects the block to be iolocked,
+  but can call erase_iblock which try to get an iolock itself...
+ ...fixed that and it still happens.
+ checkpoint calls phase_flip calls allocated_block (on uninc list) calls
+    iolock_block before calling incorporate
+ Maybe all of these should assume an IO lock.
+
+ FIXME truncate assume truncate-to-zero.  We need proper ftruncate support.
+
+ It nearly works....
+  Things to do:
+    - sort out individual patches and review DONE
+    - allow compilation without refcount tracking DONE
+    - don't hold a 'leaf' reference. NO
+    - clean up *ref calls - differentiate those that can be called when zero DONE
+    - use enum for B_* DONE
+    - support truncate to non-zero offset DONE
+    - "looping on" found an 'OnFree' block!
+    - clean out lot of debugging
+
+ Hmmm.... deadlock.
+  rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
+  checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
+  Not cool.  i_mutex must not be taken by checkpoint
+ Fixed that, though it is a bit of a hack....
+
+ New deadlock:  checkpoint calls phase_flip which calls allocate_block,
+    to move the uninc_next across, and that tries to iolock the parent to
+    perform a partial incorporation.  But that seems to be iolocked.
+    Generally that is ugly as ->uninc_next might be very long and require
+    multiple splits, and direct-driving that from phase_flip is bad.
+    I should just move the list across
+
+
+19th March 2009
+  Spent too long trying to remove refcount help by *_leaf lists.
+  This leaves InoIdx block with zero refcount so Data block can get
+  lost and bad things happen.
+  I might be able to fix it up, but it is probably better to try the
+  checkpoint_lock approach if I can only remember what that is.
+
+Locking:
+  Available locks:
+
+   Spin:
+
+    lafs_hash_lock
+        Used in:
+          lafs_shrinker
+          lafs_refile ???
+        Protects:
+          ib->hash
+          ->lru when on freelist 
+
+    i_data.private_lock
+        Used in:
+          lafs_shrinker
+       Protects:
+          ->iblock / refcnt
+          ->dblock / my_inode
+           ->children / ->parent within an inode
+          setting ->private
+
+    fs->alloc_lock
+        fs->allocate_blocks
+
+    fs->stable_lock
+        segsum hash table
+        segsummary counters (in blocks)
+
+    fs->lock
+        _leafs lru
+        ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
+        Pinned consistent with lru
+        ->checkpointing / ->phase_locked
+        fs->pending_orphans
+        ->uninc and ->chain ??  Should use parent->B_IOLock ??
+       uninc_table - should use B_IOLock
+       free list / clean list segtrack
+
+   Mutex:
+
+    fs->wc->lock 
+      wc[0] .. something in prepare_checkpoint
+       ->remaining etc
+      cluster_flush
+      mini blocks
+
+    i_mutex
+      inode_map
+      orphans
+
+   Other:
+
+    B_IOLock
+       erase_block
+       incorporate
+       cluster_allocate
+       allocated_block
+       IO
+       Phase flip
+       Initialising new inode
+    B_IOLockLock
+         IOLock across a page
+
+
+--------------------
+This is a list from 18 months ago, with updates
+
+ - Understand how superblock 'version' should be used.
+
+ -  Review and fix up all locking/refcounts.  See locking.doc
+       Also lock inode when copying in block 0 and probably
+       when calling lafs_inode_fillblock (??)
+ -  lafs_incorporate must take a copy of the table under a lock so
+         more allocations can come in at any time.
+
+ - We don't want _allocated to block during cluster flush.  So have
+   a no-block version and queue blocks on ->uninc if we cannot
+   allocate quickly.  Find some way to process those ->uninc blocks.
+
+ - Use above for phase_flip so that we don't need to _allocated there.
+
+ - Utilise WritePhase bit, to be cleared when write completes.
+     In particular, find when to wait for Alloc to be cleared if
+      WritePhase doesn't match Phase.
+       - when about to perform an incorporation.
+ - make sure we don't re-cluster_allocate until old-phase address has
+     be recorded for incorporation.
+
+ - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
+
+ - Can inode data block be on leafs while index isn't, what happens if we
+       try to write it out...
+
+ -  If InoIdx doesn't exist, then write_inode must write the data block.
+
+ - document and review all guards against dirtying a block from a previous phase
+    that is not yet safe on storage.
+          See lafs_dirty_dblock.
+ - check for proper handling of error conditions
+     b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
+ - review checkpoint loop.
+       Should anything be explicit, or will refile do whatever is needed?
+ - Waiting.
+       What should checkpoint_unlock_wait wait for?
+       When do we need to wait for blocks the change state. And how?
+
+ - load/dirty block0 before dirtying any other block in depth=0 file
+
+ - use kmem_cache for 'struct datablock'
+ - indexblock allocation.
+        use kmem_cache
+       allocate the 'data' buffer late for InoIdx block.
+       trigger flushing when space is tight
+       Understand exactly when make_iblock should be called, and make it so.
+ - use a mempool for skippoints in cluster.c
+ - Review seg addressing code in cluster.c and make sure comments are good.
+ - consider ranges of holes in pending_addr.
+
+ - review correct placement of state block given issues with stripes.
+
+ - review segment usage /youth handling and make a todo list.
+      a/ Understand ref counting on segments and get it right.
+ - Choose when to use VerifyNull and when to use VerifyNext2.
+ - implement non-logged files
+ - Store accesstime in separate (non-logged) file.
+ - quotas.
+        make sure files are released on unmount.
+
+ - cleaner.
+       Support 'peer' lists and peer_find. etc
+ - subordinate filesystems:
+     a/ ss[]->rootdir needs to be an array or list.
+     b/ lafs_iget_fs need to understand these.
+ - review snapshots.
+      How to create
+      how they can fail / how to abort
+      How to destroy
+ - review unmount
+      - need to clean up checkpoint thread cleanly - be sure it has fully exited.
+ - review roll-forward
+      - make sure files with nlink=0 are handled well.
+      - sanity check various values before trusting clusters.
+
+ - Configure index block hash_table at run time base on memory size??
+ - striped layout.
+         Review everything that needs to handle laying out at cluster
+         aligned for striping.
+
+ - consider how to handle IO errors in detail, and implement it.
+ - consider how to handle data corruption in indexing and directories and
+     other metadata and guard against problems (lot of -EIO I suspect).
+
+ - check all uninc_table accesses are locked if needed.
+
+ - If a datablock is memory mapped writeable, then when we write it out,
+     we need to with fill up it's credits again, or unmap it.
+ - Need to handle orphans asynchonously.
+
+ - support 'remount'
+ - implement 'write_super' ??
+
+ - pin_all_children has horrible gotos - remove them.
+
+ - perform consistency check on all metadata blocks read from disk
+   e.g. don't assume index blocks are type 1 or 2.
+
+23rd March 2009
+ + looking at cleanup for unmount.
+ - various more refcounts fixed up
+ - B_SegRef is never dropped!  and we take a ref on a segment when
+   we start a cluster on it, but never drop that reference.
+  THIS is next thing - review all setting and clearing of B_SegRef.
+
+30th March 2009
+ - SegRef and lafs_reserve_block...
+   There is room for recursion here, I need to be careful.
+   To dirty a data block, all parent index blocks must be Pinned and must
+   be able to be written.  That means their segusage blocks must be
+   available for update.  And Pinning a segusage block for update requires
+   all its parents.  So the segment for the block, the indexes, and the
+   segusage and indexes and so-on must all be pinned.
+   When we pin a block, we do it from the root down to avoid recursion.
+   We probably wany whatever reserve_block calls, to return an unreserved
+   block rather than call reserve_block itself.
+
+  When do we clear SegRef?? We set it when Pinning, so I guess we
+    clear it when unpinning.
+   pin_dblock, mark_cleaning, prepare_write, truncate
+   seg_move clean_free
+  We it is really when Pinning, or Dirtying or Reallocing.
+  So we clear when unpinning, or when a dblock gets written...
+  Maybe just when we lose ->parent
+
+6th April 2009
+ - sometimes sugsum counter goes zero for random data block
+     Something is going wrong in roll-forward.  The block looks transiently valid
+     so doesn't get read, but has no good data in it.
+ - After deleting a directory, the block might still have incorporation
+   to happen, but is not marked dirty
+ - at unmount, there are various blocks that are still dirty.
+ - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
+
+12th April 2009
+ - that rollforward problem above:
+    When rolling the checkpoint, if we find segusage blocks we want to include
+    them directly into file.  But by pinning the block we might preread a
+    segusage block.. but we must be sure not to update it.
+    So during the early stages of rollforward while still in the checkpoint,
+    seg_inc must be called with in_phase == 0.
+    so seg_move is called with phase != qphase.
+    ditto for summary update.
+    So the block must be pinned to the previous phase...
+    Normally 'phase' changes at checkpoint-start,
+             qphase changes at checkpoint-end
+    So we probably want to start with qphase being 0 and phase being 1.
+    When we reach the end of the checkpoint, we flip qphase to 1.
+
+ - blocks still in phase_leafs at unmount:
+    After we force a final checkpoint we still have Pinned:
+        root InoIdx
+        ino==8 InoIdx due to Dirty block0
+        ino=16 InoIdx due to dirty block0
+     and dirty:
+        inode block 1,  inode usage map
+                    2,  root directory
+                    8,  orphan
+                   16   seg usage
+     Problems:
+        inode blocks dirty but not pinned?  No InoIdx...
+        Segusage dirty - probably by seg_apply_all - disable that at umount
+        orphan dirty ??... but not pinned!
+           This is possible - we don't pin for clearing entries, just for setting.
+        The inode problem stems from the datablock being dirty while the
+         InoIdx block isn't.  That is, at best, confusing.
+
+13th April 2009
+   segusage blocks aren't being pinned
+   They need to be pinned  whenever dirty.
+   and youth blocks aren't even made dirty some times.  They need to be
+    pre-pinned in many cases.
+
+   So: segusage gets changed when we write out a cluster, and when we
+      delete/relocate blocks.
+      In the first case we pin the block when it becomes part of the free list,
+      and need to keep it pinned across checkpoint changes.
+      In the second, we pin when the block is dirtied and again must keep it pinned.
+      Youth gets changed when a segment becomes free and again when we allocate
+      a segment to it.
+
+      Keeping a datablock pinned across checkpoints is awkward - we currently need
+      to repin for each dirty... I guess we can re-pin for each checkpoint
+      in lafs_seg_apply_all.  That might work for segusage, but not for youth!
+      If segsnum for ssnum==0 held a reference to the youth block, that might
+      help.  Segstat on 'clean' or 'free' would imply a reference to that segsum.
+
+      Is it OK to keep all youth/usage blocks for free/clean blocks
+      pinned?  We can currently have 810 entries.  Only half will be clean/free.
+      For each entry there can be two blocks, youth and usage.  So that could be
+      810 blocks. 1Meg?  Normally much less.  If it became a problem we could
+      reduce the number dynamically I guess.
+
+      maybe segusage blocks need to get phase_flipped, as other blocks do
+      depend on them,   pin_all_children wouldn't be able to find them though..
+
+    1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
+      Youth block.
+      
+14th April 2009
+   I think I want to link dirty block to the space in free segments that we
+   actually know about.  Each of those segments has youth and usage blocks
+   pinned (at least parent pointer is active).  So we have everything we need
+   to write everything that is dirty.  So 'free' or 'clean' implies
+   a segsum reference which holds youth block.
+
+   When we get low on space, we wait for cleaning/finding to progress.
+   This would limit us to  400 segments, say 16Meg each, so 6Gig of dirty
+   memory.  I guess that we need to scale the 'free' list based on available
+   memory (FIXME).
+
+   When cleaning needs a segment, it needs to load the usage blocks for other
+   snapshots too.
+
+   When cleaning in the presence of snapshot we need to be careful never to
+   duplicate a block that is shared.  To allow for v.many snapshots, we don't
+   even want to duplicate in memory.
+   So we need to choose a 'primary' copy - probably first one found - and
+   follow the peers link when possible...
+
+18th April 2009
+   (continuing).
+
+   So clean and free segments in the list carry a SegRef.  But it could be
+   excessive if all of them did - we shouldn't be required to pin more
+   data than we need.
+   So for segments with a usage of 0, we use the score to record if a
+   segref is held.  0 means 'no', 1 means 'yes'.
+   When space_alloc wants more space we need to find an entry and
+   segref it.  Maybe we want free lists - reffed and not-reffed.
+
+   Then again, SegRefs are fairly cheap as they are heavily shared.
+   maybe 512 to a block.  If we hold 400 refs they could easily all be
+   in one block.  We could possibly encourage this by sorting the list
+   and discarding from one end if it is too full.
+   Sorting is a good idea definitely.  It keeps youth/usage updates
+   together.
+
+   Just check the numbers.
+   a 1TB device with 1K blocks might have 32M segments of which there
+   would be 32768.  512 per block means 64 blocks or 16 pages (64K).
+   So total segusage files is 128K plus snapshots.  Not worth worrying
+   about surely.
+   For 16TB, that is 2Meg plus snapshots.
+
+   So
+    - keep a SegRef for all free and clean blocks.
+      This must include a youthblk reference.
+    - sort the free list when 'clean' is merged or when a pass
+          finishes.
+       sort clean list
+       fix youth value
+       merge as many as fit into free
+       sort 
+
+   How is the code flow...
+      add_cleanable is called during the periodic scan.  It could hold
+               a SegRef easily.
+      add_cleanable calls add_clean as does lafs_get_cleanable during
+          clean.  That might block getting a segref, might even
+          deadlock?
+      add_free is also called by seg_scan
+
+      So seg_scan should get a segref and leave it with everything!
+
+    BUT.....
+    A SegRef implies a 'struct segsum' for each segment.  We don't
+    want to allocated one of these for every segment in the table.
+    We only want a reference to the youth and segusage block, which
+    are heavily shared.
+
+    But these blocks need to be Pinned and SegReffed etc so we can
+    write them at any time.
+
+20th July 2009
+  The refcount held by the 'leaf' lru is a problem.
+  While it holds a count we do not unpin an index block, so it cannot
+  be removed from the list.
+  Thus we can only remove from the leaf lru on a phase change.....
+  Or when doing lru based flushing... Maybe we can remove from the
+  lru while holding the checkpoint lock.
+  This happens when truncating..
+
+  No, that is just too messy as it is too easy to get put back on the list.
+
+  Maybe the leaf lru should not imply a reference count ... or maybe
+  we need to split the refcount:  'inuse' and 'active'.....
+  How about we test refcnt against list_empty(->lru)...
+
+  ....
+
+  During truncate, we need each index block to get unpinned so they can
+  all be cleaned up.
+  But the InoIdx block is held pinned by by the inode block being dirty.
+  In this particular case, the InoIdx block is Invalid as the file is empty.
+  But.... InoIdx should always be valid until after Inode is destroyed??
+
+
+ umount
+ I need to stop the cleaner and flush everything before trying to
+ clean up.
+
+ This is awkward though.
+ The 'sync' of umount is done by kill_block_super, but I call
+ that rather late, after checking that the tree is empty.
+ There are pinned/dirty bits left after sync that we want to magically
+  clean.
+ We have:
+   - segusage/youth blocks.  Maybe if we don't seg_apply_all...
+   - orphan block.  Maybe don't mark it dirty when we remove things?
+   - inode map?? why is that dirty
+
+   - root directory is dirty still??  But it has been erased.
+     InoIdx is valid-but-empty.  Inode Data is dirty
+        Data block 0 is Dirty at block 0.
+
+  ......
+ Ahh... need to mark page dirty when block is marked dirty !!
+
+ The seg usage blocks are now flushed out but not incorporated.
+ I feel that might be correct - we don't want to care about
+ incorporation as we will never use it.
+ For this, segusage and quota are very special cases.
+
+ Inode map is no longer dirty, but is pinned
+ Orphan does have a dirty block still
+    The orphan table contains the root directory.
+ root is now clean and gone
+
+ Segusage doesn't get incorporated after last checkpoint now
+ so that is better.
+ But now we have a circular reference for SegRef.  This should not
+ be surprising given the circular problems we had setting SegRef.
+ I guess we just erase the references in the segsum table...
+
+22nd July 2009
+ Hurray!!! I can unmount without crashing!
+ Now I need to sort through all the fixes required to achieve that
+ and make discrete patches, and be sure it is all OK.
+
+DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
+DONE - (block.c) Mark page dirty when block becomes dirty
+DONE - (checkpoint.c) print orphan_slot with Orphan flag
+DONE - Don't incorporate segcount etc after final checkpoint
+DONE - Don't apply seg changes after final checkpoint.
+DONE - Don't start opportunistic checkpoint after final.
+DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
+DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
+DONE - (cluster) be more flexible about credit usage when flushing InoIdx
+DONE - (dir) do add_orphan when we abort as well as on success
+DONE - use inode_dec_link_count, not i_nlink--
+DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
+DONE - change %d/%d to strblk
+DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
+DONE - (index) refile: when unpinning, remove from lru
+ - lafs_refile: ->iblock can be non-null for inode 0.
+DONE - Make sure I_Deleting gets cleared when deleting finished.
+DONE - phase_flip should have something separate to call, not lafs_allocated_block
+ - inode.c: lafs_dirty_inode: getref_lock used to get dblock
+NONO - ?? getref_locked allowed if PagePrivate
+DONE - segment: lafs_seg_put_all needed at unmount
+DONE - segdelete_all: need to put intable references
+DONE - lafs_free_get: put the intable references
+DONE - lafs_get_cleanable: put the intable references
+DONE - fix sort splitting in add_cleanable
+DONE - add lafs_empty_segment_table for unmount
+DONE - lafs_release: flush all dirty blocks
+DONE - lafs_release: force a final checkpoint
+DONE - lafs_release: move kill_block_super before final check
+DONE - lafs_put_super: release orphans and segsum files.
+DONE - lafs_destroy_inode: putref should be 'iblock'
+ - lafs_destroy_inode: allow for iblock to be present but no ref held....
+DONE - can roll forward call lafs_allocated_block without dirty???
+
+27th July 2009.
+ - I've re-arranged lafs_release so that the flush is all done in
+   generic_shutdown_super.  However it calls invalidate_inodes, and that has
+   problems with pinned inodes.  So we need for fsync_super to checkpoint
+   out all inodes that we don't hold our own reference to.  
+   If we do hold a reference, then invalidate_inodes will skip them,
+   and ->put_super can be used to drop the references and perform the final
+   checkpoint.
+   fsync_super calls ->sync_fs. after syncing call files.  Maybe I can
+   do some sort of checkpoint there...
+   There almost is a checkpoint in there.... But only when called without
+   'wait'....
+   I need to understand 's_dirt'.
+   This is controlled entirely by the filesystem, common code only examines it.
+   If it is set:
+          file_fsync (the generic 'fsync' method) will call ->write_super
+          fsync_super will call write_super
+          generic_shutdown_super will call write_super
+          sync_supers will call write_super
+          sync_filesystems(0) will call ->sync_fs
+   sync_fs is called:
+        twice from 'sync', once with '0', once with '1' for 'wait'.
+             (though in emergency_sync, both are '0').
+        once from unmount and remount with 'wait' set to '1'.
+        We don't want two checkpoints for a 'sync', but we want to start
+        on 'wait=0'.
+        Maybe if we get called with '0', we set a flag and treat the '1'
+        differently..  There is no locking to make this really safe, but
+        it will probably be OK...  I could take a process_id, but then
+        parallel 'sync's could race.
+        write_super is called before the syncs.  So it could start the checkpoint,
+        and sync could wait for it.
+        write_super is called multiple times at shutdown,  We really need 
+        to utilise sb_dirt to avoid some of these.
+        We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
+            - when we pin a dblock or dirty a this-phase iblock.
+
+29jul2009
+  at unmount, we iput the root inode which de-references the dblock
+  before clearing ->iblock, which fails an assertion ... why?
+   Apart from the shinker, ->iblock is only set to NULL in refile
+   when we find an I_Destroyed inode... I guess the root block isn't
+   getting Destroyed...
+ The protocol for freeing iblocks is bad.  Should be:
+   - it only gets freed by the shrinker
+   - when inode dies, set ->inode to NULL
+   - when InoIdx iblock dies, set ->iblock to NULL
+   ...???
+30Jul2009
+  So, what exactly is the protocol?
+    - index blocks live either in the parent/sibling tree, or
+      on the inode's free_index list
+    - when refcnt is 0, they live on 'freelist.lru'.  When refcount
+      is elevated they stay on lru until they need to be 
+      added to some other lru (leafs or cluster)
+    - when shrinker finds block on freelist.lru with non-zero refcnt,
+      it just removes from lru
+    - when shrinker finds free block, it removes from free_index and discards
+      the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
+        I think not as such would either have children or be on an lru
+    - When we destroy an inode, all index blocks get disconnected from the
+      inode and freed.  This must include the ->iblock
+    - When an index block becomes free due to index tree shrinkage,
+      we set the ->depth to -1 so that it cannot be found by mistake,
+      and leave it for shrinker or inode destruction.
+
+   Confused about inode<->dblock dependence.
+   We don't want the inode to refcnt the dblock as that wastes space.
+   We don't want the dblock to refcnt the inode as that stops it from being freed.
+   So each must disconnect from other when freed.
+   What locking?
+   inode takes private_lock, then checks dblock
+   dblock cannot take private_lock before checking ->my_inode..
+   Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
+     drops ref
+
+1Aug2009.
+  Tracking down the 'credit' count and making sure it stays correct.
+  It seems that I have a Dirty InoIdx block which is not pinned.
+  Due to this it has no refcount and so the data block disappears so
+  the InoIdx block is not visible in the tree.  This isn't a definite bug
+  but it means I cannot count credits properly.
+  And surely Dirty index blocks must always be pinned!!??
+
+  When as small file is flushed to the inode we were dirtying the
+  iblock.  That seems wrong - should dirty the dblock?  Need to 
+  check that is valid
+
+  I got a hang in 'rm adir/4'.
+  rm is in lafs_cluster_update_commit_both
+       getting a mutex.
+  cleaner is in lafs_do_checkpoint+0xe4
+  pdflush is in writepage/lafs_cluster_flush waiting on a lock
+  so I guess cleaner is holding a mutex and waiting for something
+   that wont happen?
+
+
+  Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
+   cleaner is at some point, holding a mutex to stop 'sh'.
+  0e4 == 228
+
+  ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
+   to be allowed.
+  So when something locks the checkpoint and needs to flush, we have problems....
+
+
+  I seem to have fixed the above.  Now:
+    Free space is a real problem.  When I remount after the successful unmount,
+    we find a usage pattern like:
+CLEANABLE: 0/0 y=10 u=34179
+CLEANABLE: 0/1 y=0 u=65144
+CLEANABLE: 0/2 y=0 u=65535
+CLEANABLE: 0/3 y=32773 u=32910
+CLEANABLE: 0/4 y=32772 u=149
+CLEANABLE: 0/5 y=0 u=0
+CLEANABLE: 0/6 y=32770 u=16529
+CLEANABLE: 0/7 y=32769 u=35084
+CLEANABLE: 0/8 y=32768 u=31877
+
+    Which is ridiculous. 
+   Better fix up what I have first...
+
+ ...
+ In rm /mnt/1/nbfile* we hang.. 
+   rm is in lafs_phase_Wait from pin_dblock in unlink
+wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
+
+   cleaner is in lafs_iolock_block from add_block_address in phase_flip
+iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
+
+ So cleaner is probably deadlocking against itself via iolock_block.
+  This is taken:
+    - in lafs_invalidate_page just to wait for any io - it isn't held long
+    - in lafs_erase_dblock while we erase and 'allocated_block'
+    - in lafs_get_flushable to protect blocks being checkpointed
+    - in lafs_writepage to call cluster_allocate (which releases), both for
+             data block or for inode when data was flushed there.
+    - lafs_add_block_address to process pending incorporations to make room.
+         This is what is trapping the cleaner.
+    - lafs_inode_handle_orphan when truncate finishes to erase_iblock
+    - lafs_inode_handle_orphan again to incorporate all removal
+    - and again to erase_iblock
+    - and for partial truncate to incorporate some removals
+    - and again....
+    - lafs_new_inode to keep it from being cleaned while being created
+    - roll_block to add addresses
+    - lafs_load_block during IO
+
+  So: who holds it?.... let's use the code to find out...
+  And the answer is : lafs_get_flushable.
+   So get_flushable iolocks the block then calls phase_flip which tries to
+   incorporate other-phase children which try to iolock the block.  Deadlock.
+   Do we need to hold iolock during phase_flip ??.  Not for all of it..
+
+02August2009
+   FIXME When erasing a block, do I need an uninc credit?  I usually don't
+    have one and the need certainly isn't as great...
+
+  Now... let's try to get free space accounting right.
+   Observed problems:
+     - unlink sometimes failed with ENOSPC
+     - usage scan shows segmetns with enormous usage - 23039!!
+
+  no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
+  no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
+
+  no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
+
+
+  after umount/remount df says "4608 7 1544" but cannot
+   create anything.
+df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
+============= Cleanable table (7) =================
+pos: dev/seg  usage score
+  0:   0/0        1 0
+  1:   0/5        1 64
+  2:   0/6        6 384
+  3:   0/7        2 128
+  4:   0/8        3 192
+  5:   0/3        1 64
+  6:   0/2        2 128
+...sorted....
+  0:   0/0        1 0
+  1:   0/3        1 64
+  2:   0/5        1 64
+  3:   0/2        2 128
+  4:   0/7        2 128
+  5:   0/8        3 192
+  6:   0/6        6 384
+--------------- Free table (1) ---------------
+12290:   0/4        0 0
+--------------- Clean table (0) ---------------
+CLEANABLE: 0/0 y=10 u=1
+CLEANABLE: 0/1 y=32775 u=3
+CLEANABLE: 0/2 y=32774 u=2
+CLEANABLE: 0/3 y=32773 u=1
+CLEANABLE: 0/4 y=0 u=0
+CLEANABLE: 0/5 y=32771 u=1
+CLEANABLE: 0/6 y=32770 u=6
+CLEANABLE: 0/7 y=32769 u=2
+CLEANABLE: 0/8 y=32768 u=3
+
+
+03Aug2009
+ Current issues:
+FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
+Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
+ 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
+ 4/ invalidate_page find Realloc set, even after iolock ..
+     This is during umount  in generic_shutdown/lafs_put_super/iput
+ 5/ 
+
+
+ Thoughts:
+   If we flag a block for Realloc then Dirty before it is allocated,
+     then all is fine.
+   But if we have already allocated to a cleaning cluster... what happens?
+    We need to treat this like it was dirties after being written, so
+    it gets written to a regular cluster as well.
+    As we only have one uninc bit for both Dirty and Realloc, we need
+    to *not* incorporate the Realloc update if the block is still dirty.
+   So:
+        - block gets chosen for cleaning and allocated to a clean-cluster
+        - block gets marked dirty.  This must not clear Realloc
+        - cluster is flushed, block is dirty, so don't call lafs_allocated_block
+        - Return the Realloc credit, but keep dirty and Uninc.
+     Is there a race if Dirty is set after we enter lafs_allocated_block?
+      As long as the index block gets marked Dirty, not Realloc we might
+       be safe... though it gets awkward if the Dirty writeout falls in to
+       the next phase.  But reserve_block will have provided NCredits for that.
+     So:
+        1/ don't clear Realloc when setting Dirty
+        2/ do clear Realloc if cleaner finds the block is Dirty
+        3/ avoid calling lafs_allocate_block when cleaning a dirty block.
+                   This is an optimisation.
+
+    Almost...  A B_Realloc block no longer has B_Credit so B_Dirty cannot be
+       set.
+
+
+  Thoughts3.
+     When cleaning blocks we hold no reference to the inode and it can disappear.
+     We don't want to hold the inode active, but need a reference much like
+      the truncate code has.
+     I think we need a subordinate refcount for both cleaning and truncate.
+      These hold inode present but not active.
+     Maybe every block->inode should be counted like this.
+     And this might simplify the my_inode->dblock inter-relationship.
+     For later..
+       We need to ensure that if a new iget is called on an inode that still
+       exists, we don't allocate a new one but just reuse the old.
+       But that won't work as we cannot add an inode back into the hash table.
+     So I think when cleaning a block we need to ref the inode.
+      i.e. B_Realloc implies an i_grab
+
+05aug2009
+ So I have a problem with the cleaner wanting to hold and inode that
+ the VFS is destroying.
+ I don't want the cleaner to hold i_count as that delays truncate etc.
+ So we need a second counter subordinate to i_count.
+ This is held by the cleaner and by delayed truncate, and by i_count.
+ Possibly ->my_inode holds this, which means it can be a single bit...
+
+ When a lookup wants an inode, we need to load the inode data block and
+ see if it has my_inode.  If it does, we insert that inode in to the
+ hash table.  If not we fall back to regular inode creation....
+
+ On reflection, that is too complicated and hard and error prone.
+ When relocating a file we need the data so it had best be in the page
+ cache so the filesystem really needs to know that the inode is still
+ active.
+ So cleaning needs to keep a reference to the inode.
+ The cost of this is that if an inode is being deleted while it is
+ being cleaned the truncate cannot happen until the cleaning
+ completes.  This means that space usage will be wrong.
+ When nlink becomes zero we can drop the cleaner reference.  When
+ the inode is dropped/destroyed we can tie the cleaning in with the
+ delayed truncate so that the final destruction doesn't happen until
+ the cleaner has let go.
+
+ So: how to track that the cleaner has a reference to the inode?
+ Maybe every B_Realloc block owns a ref on the inode.... but dropping
+ those references when i_nlink hits zero would be difficult.
+ They could hold a secondary refcount which, if non-zero, implies a
+ ref on the inode.
+
+ So:
+  - Set B_Cleaning when we look at a block for cleaning, and clear
+    it when we find Realloc clear and ....????
+  - Whenever a block has B_Cleaning set, it holds a counted reference
+    on LAFSI(b->inode)->cleaner_ref
+  - When cleaner_ref is non-zero and I_Deleting is not set, we hold
+    a reference on the inode (i_grab).
+  - when i_nlink hits zero, set I_Deleting and drop any reference
+    held by the cleaner.
+ DONE - cleaner must be careful not to process any block that has been
+    truncated, or file that is dead.
+ DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
+  - What about filesystem inode... how do they fit in??
+
+
+  Question. When are the index blocks for an inode flushed?
+  We need to have them gone when the inode disappears.
+  For deleted inodes, this happens in background truncate.
+  For memory-pressure inodes it will hopefully happen well in advance,
+  but we need to make sure in destroy_inode that everything is
+  written. - FIXME
+
+
+  Thinking again about B_Cleaning, any B_Realloc block will hold a
+  reference through to InoIdx and so dblock will be present and the
+  inode won't be freed.  So we only need an extra reference during
+  the first little phase of cleaning when we are collecting blocks.
+  After that a reference can be useful as it will delay flushing so it
+  can be more efficient...
+
+  Maybe this is all much simpler than I thought.
+  If we hold a ref on the inode whenever the InoIdx block is Pinned
+  and i_nlink is non-zero, then we won't be forgotten until all
+  index blocks are written.  We may still be deleted, but as that
+  is one-way we can hold on to the inode at little cost.
+
+  getting/putting that ref at exactly those times turns out to be
+  messy.
+  It might be best to have a flag to say "We hold an extra ref".
+  Then we occasionally call a function that validates the setting.
+  It is most important to drop the count at the right time, so
+  after unlink/rmdir/rename and when B_Pinned is dropped.
+
+  B_Pinned is set in:
+     set_phase which is called from:
+          lafs_cluster_allocated when moving 'pin' across to data block
+              so don't need checkpin
+          lafs_pin_block_ph
+              only need check_pin if dropping spinlock
+          pin_all_children
+              only pins data blocks (Index are already pinned if relevant).
+          grow_index_tree
+              where "inoidx block pinning" doesn't change
+          do_incorporate_leaf
+              No InoIdx involved
+          do_incorporate_internal
+              ditto
+   So only need check in lafs_pin_block_ph and maybe pin_all_children...
+
+08Aug2009
+  - credits get out of sync from
+      lafs_incorporate->refile->space_return from checkpoint.
+      counter is one more than we can find.
+      returning space on 
+         i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
+       Note it in an Index but not InoIdx.  The parent is still in the tree.
+     This that is FIXED
+
+  - and out by 8! at
+      delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
+    FIXED that.
+
+  - BUG credits<0 in space_return from lafs_incorporate from add_block_address
+     from phase_flip
+Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
+     from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
+msg: (1,3,1)(1,1,-1)
+Credits = -1, rv=1
+ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
+
+    This is a predicted but not handled problem.
+    The answer is that not all blocks need ICredit/UnincCredit.
+    The purpose of this credit is to allow for a split in the parent.
+    pre-existing index blocks can never split the parent themselves
+    If an index block becomes full, it will split and this might split
+    the parent.
+    If an index block has free space, then it will only over flow if it
+    gets multiple child updates and this will provide multiple credits.
+    So an index block with space for 3 or more new addresses does not need
+    and ICredit/UnincCredit.  So when we split we don't need to provide an
+    uninc credit.
+    In particular.
+    When we have a fully InoIdx block and a single new child with 1 UnincCredit,
+    each block already is either 'Dirty' or has a 'Credit', and the InoIdx has 
+    an ICredit, then create a new intermediate such that
+        InoIdx is Dirty and has an ICredit
+        New Index is Dirty with no ICredit - it used the UnincCredit
+        New child looses its UnincCredit
+    When another block in the new index arrives, it's unincredit is used to
+    provide an ICredit
+
+    When a leaf block cannot fit a single address it will have ICredit.
+    The block is split so that each has 3 spaces and so do not need ICredit,
+    but as soon as ICredit is available, they take it.
+
+    Worst case is that every ancestor is full and the leaf is split
+    We then get two full branches, each block half empty so not needing ICredit.
+
+
+  Then...
+    free data being used in lafs_refile from cleaner.
+    b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
+    Answer: lafs_refile was derefering ->inode when it wasn't safe.
+     Need to at least have a parent before it is safe.
+
+  Hang:
+     soft lockup cleaner->lafs_iget->ifind_fast ....
+    Then (may be caused)
+Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
+.......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
+Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
+
+    It seems the cleaner gets confused and goes spinning.
+
+
+  So: space problems:
+    After the run, we have -14 used and 2055 available (of 4608), and
+    cannot create anything.
+    4 segments ar free, one is cleanable.
+   free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
+or
+   free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
+or
+   df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
+   free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
+   and very little free
+
+  ablocks_used is going negative - why?
+   Probably we erase a dblock without clearing Prealloc.
+   Then when Prealloc later gets cleared, ablocks_used is
+   wrongly decremented.... no...
+
+
+10aug2009  (don't forget above problems)
+  Another problem.
+   read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
+     getiref_lock triggers BUG.
+   This is presumably because I have just fixed it to get the correct
+     iblock and not the iblock of the filesystem.
+
+  FIXME I hacked around this but I'm not sure the result is right.
+    The question is about when the InoIdx should be dirty and when
+    the inode data block should be dirty.
+   In this particular case we are writing a page of a small file.
+     cluster_allocate calls flush_data_to_inode which tried to dirty
+     the inode dblock but finds that iblock is not pinned...
+     When we dirty a data page we aren't pinning the parent!
+   That might be OK - we only need to count and reserve the parent.
+    We don't need to pin it until it becomes dirty.
+
+   Still need to resolve when which block gets to be dirty, and also
+    exactly when an index block needs to be pinned.  And how does that
+    related to holding a ref on the inode when the inoidx is pinned.
+    Maybe it should be when the inoidx is referenced.
+   FIXME
+
+11aug2009
+   Another problem. unlink->handle_orphans->erase_dblock->allocated_block
+    and get a zero from lafs_add_block_address but parent is not pinned.
+  And... One unmount, orphan file still has pinned blocks so the inode
+    isn't free.
+  And ... root still old phase after lots of 'rm' then sync.
+    Inode 244 has pinned inode block held by writepage0 and writepage
+         this is adir/170
+
+13aug2009
+  - lots of bugs introduced by change to marking inode blocks dirty:
+     writepage/cluster_allocate wants to Dirty inode data block with no credits.
+         because I put credit in iblock!
+
+  - ohhh.... The phase contour is broken.  When a block is added to a
+    cluster for allocation it isn't in the phaseleafs any more, but prevents
+    it's parent from joining.  So we cannot assume that if dblock is on
+    list then iblock or a child will be too.
+    So when we find dblock we do need to remove it.... done that.
+
+  - root not changing because Data 1/0 is Pinned and IOPending
+     and held by writepage!!
+     Problem is that IOPending blocks aren't put back on lru.
+     But that should only be blocks on the cluster list.....
+     But that is where I am putting it.
+     Maybe I need exclusion between checkpointing and any other
+       code that writes to checkpoint so checkpoint can wait
+       for that ... can we use wc->lock??  That doesn't lock
+       against cleaner, but that isn't a problem...
+   But now 0/228 is still pinned and in writepage and IOPending
+    So there is more to it than that.
+    When checkpoint finds an IOLocked block, it might be about to
+     join a cluster, in which case we don't really want to wait, or it
+     might be undergoing incorporation in which case we want to wait.
+     or it could be being erased, so wait..
+     Maybe I wait until it appears on some list.... yes.
+
+14aug2009
+    At unmount Index 8/0 with child and leaf is still pinned
+  This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+
+  and..
+
+  A problem is that something goes wrong in the erase process.
+  We find new children after we erase the inoidx block!
+
+  This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
+
+  When/how do we erase indexblock and particularly inoidx blocks?
+  Does and inValid InoIdx simply mean there is no indexing and does not
+  reflect on the Data block?
+
+.xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
+
+ Orphan problem:
+nextfree = 0
+reserved = 0
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
+This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
+[cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) 
+  [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
+  [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
+  [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
+
+nextfree = 1
+reserved = 0
+  0: 1 0 0 304
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
+This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
+  [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
+badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
+
+
+erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
+erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
+------------[ cut here ]------------
+WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
+unlink/orphan/erase_dblock_allocated_block
+---[ end trace 61b8bd59512ea4da ]---
+zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
+   [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
+   [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
+
+BINGO.  When we remove last entry from directory we erase the InoIdx block,
+ then when we add entries, we hit problems.
+
+
+nextfree = 3
+reserved = 0
+  0: 1 0 0 306
+  1: 1 0 0 307
+  2: 1 0 0 74
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
+
+This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
+  [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
+
+This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
+  [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
+
+This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
+  [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
+
+We have stray 'cleaning' references.
+It is taken -
+   on a data block that was in a to-clean segment
+     at which point we igrab the inode
+     the block is put on the ->cleaning list.
+It is put:
+   when we get an error finding the block
+   when we find that it isn't in the segment
+   when an error occurs loading the block-to-be-relocated
+   and when we mark that block for cleaning.
+  i.e. always unless we got EAGAIN or some space error.
+   If we still hold some blocks, try_clean returns 0.
+
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
+This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
+[cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) 
+  [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
+  [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
+  [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
+
+NOTE these inode data blocks are not pinned and so did not get written!!
+
+FIXME I should wait for the checkpoint to finish
+nextfree = 1
+reserved = 0
+  0: 1 0 0 301
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
+This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0) 
+  [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
+
+16Aug2009
+  When I clean and find an inode that is already deleted, I need to be
+  very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
+  to be.  lafs_delete_inode gets called a lot, but mostly for dead inodes.
+
+  BUGS:
+FIXED orphans don't get cleaned up.  It seems a 'create' fails and leaves
+      and orphan block un-released.
+   - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
+   - Not sure that we handle complete truncation, then adding blocks properly.
+     - what should the state of the InoIdx block be?
+   - On remount, the filesystem contains rubbish.
+   - create fails even when there should be free space.
+   - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
+   - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
+          and 74 has similar issue
+     327 = adir/big1   74=adir
+
+
+17Aug2009
+  Segusage blocks aren't always Pinned when we make them dirty.
+  Yes. That is correct.  They are not forced out by phase change but by
+  lafs_seg_flush_all at the end of a checkpoint.  So they need to be
+  preallocated, but not Pinned.
+  But, once we have finished the last checkpoint we don't want to
+  dirty Segusage blocks any more.. I wonder if we are.
+  No, but we were Pinning inodes without PinPending and they
+  lost the pinning straight away!
+
+  OK, other annoyance.
+   InoIdx block and similar are getting erased at the wrong
+   time.
+   We can only safely erase them when they have no children.
+   I guess what we really want is the incorporation leaves them
+   existing but empty, and when we go to write them out, if they
+   are empty we register an address of 0.
+   When we drop the ->parent pointer of an Index block it 
+   just goes away...
+   So:
+    When incorporate or truncate produces and empty index block
+     it simply clears B_Valid.
+    When incorporate want to add to an index block, we set B_Valid
+    When cluster_allocate gets a non-Valid index block it call
+    block_allocated with phys of 0.
+
+    Yes, that seems to work.  Mostly
+
+18Aug2009
+  On remount, check_credits dies: 16/20-0
+    In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
+
+19Aug2009
+  OK, this index block clearing is a mess.  There must be a neat model I can
+  follow that will make it "just work".
+  The key seems to be children.  If an index block has children, then it
+  really must exist.  If it has no children and no content, then it can
+  be discarded, in which case it needs to be unlinked from its sibling list.
+  What locking do we use here?  Probably IOLock on the parent index block.
+  So we need iolock while looking in a parent for children, and we take
+  IOLock while incorporating or pruning.
+  Once the empty index block has dropped out it will never be found again.
+  When we incorporate the zero address, the index block becomes invisible
+  unless it is shortly after it's predecessor in the sibling list.  But
+  that is hard to ensure, especially if the first child is the one that
+  is being erased.  So if an index block is erased, then it must be
+  discarded quickly and any children need to be relocated...
+  Or maybe not.... maybe if there are children, we just write and empty block?
+
+22Aug2009
+  We need better locking of the index information.
+  It seems best to use IOLock as that is already held during incorporation.
+  So any code that accesses or updates and index block must hold IOLock.
+  This might be a bit of a restriction if we try to do a lookup while
+  writeout is happening.... Maybe we need a separate writeback flag for that.
+  But I think it is good to use IOLock for now.
+  Places we need this are:
+     flush_data_to_inode needs to lock the InoIdx block
+       - DONE
+     lafs_leaf_find as it recurses down.  This should return a locked leaf.
+       - DONE
+     callers of clear_index
+         erase_dblock for depth=0??
+       - DONE
+     incorporate should lock new blocks for consistency 
+       - DONE
+
+   Locking dependency rule is that if we hold a lock, we are allowed to
+   lock a child index block, but not a parent.  IF we hold a data block,
+   we are allowed to lock the an index block.
+
+
+  The read/write completion seems all wrong.  It unlocks if the page was locked,
+   and that isn't really safe, because it might not have been locked for read..
+   We need to flag block0 to say if lock or writeback need to be cleared.
+   Given that, I don't need IOPending any more:
+    Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
+    Write: We queue all writes, then set 'do_clear_writeback', then check.
+
+  Now... can we use a writeback flag to avoid waiting to read while writeout
+  is happening?  We would need:
+     set writeback in cluster_allocate
+     wait_writeback after some lock_block
+     clear_writeback when writeout finishes.
+     Extra checks where we already check for IOLock
+
+
+24aug2009
+ Lots of progress but....
+   cluster_flush calls cluster_done calls refile call iput call
+    drop_inode call write_inode_now calls writepage calls cluster_flush
+  and we get a locking loop.
+   I think we need the run that cluster_done from a different thread.
+
+
+ We seem to have a refcnt problem with segsum.
+
+25aug2009
+ Lots more progress but.....
+
+  orphan_release is finding that the orphan block has no credits.
+  We can allocate credits and simply not do the update if they
+  are not available:  having an extra entry in the orphan file isn't
+  a problem.  However we need some mechanism to clean up other than
+  waiting for a remount..
+  I think we leave that until we redo orphan handling.
+
+ and: adir sometimes loses one block so it and the contents don't get
+   deleted.
+
+ and: it seems we sometimes try to clean the segment being written
+   to.  We must avoid that.
+
+ (long ago I wrote::
+  FIXME When pin fails, we need to remove PinPending from everything!!!
+ and never followed up ... I wonder?
+ )
+
+25Aug2009
+ Orphan handling.
+  Every orphan block goes on a per-fs list and gets removed only
+  if the B_Orphan bit is clear.
+  There are two times when we want to expedite orphan handling.
+  1/ on rmdir we need to know if the directory is really empty.
+     This requires that we expedite the orphan handling of all
+     blocks.  As soon as we find a non-orphan, we can give up.
+     Then we need to make sure the index tree has collapsed.  WE
+     can borrow that code from truncate.
+
+  2/ When writing past Trunc_next.  We just pass the block to
+     special orphan handling.
+
+  This requires that orphan handling is re-entrant.
+  For dir, that is protected by i_mutex, but rmdir needs to come
+   in under the radar.
+  For trunc, the iolock on the index blocks should be enough.
+  I wonder if IOLock can be used on dir as well... allowing
+  parallel orphan handling in the one dir even!!.
+
+  We need to ensure exclusion of orphan handling, including:
+      - only one orphan handler at a time
+      - don't run orphan handler while still processing action
+        that makes it an orphan.
+  Maybe if we just use IOLock for that?  Does that work?  Maybe
+  but it gets messy for directories (on first attempt anyway).
+  For directories we can just use i_mutex.
+  Maybe i_mutex for files as well?
+
+27Aug2009
+  Orphan handling is going well... but not perfect.
+  I'm using IOLock to ensure exclusion for orphan handling.
+  However:
+    I'm not really implementing that on directories
+    Inodes go bad because lafs_erase_dblock needs the lock too.
+    The call from rmdir will always faile because we hold i_mutex.
+
+  Bigger problem.  I'm IOLocking inodes across checkpoints to preserve
+   Orphan status.  But that might stop the checkpoint proceeding.
+   .. so use i_mutex, not IOLock - find.
+
+  Now... it seems I've confused myself.  Orphans don't get handled
+  immediately.  In particular, inodes should not be handled until
+  they final delete_inode.  So setting the B_Orphan flag and putting
+  on the list are two separate events.  The flag must come first,
+  but the list may come much later.  So some of that mucking around
+  with i_mutex is pointless.
+  So:
+    make_orphan makes sure it is in orphan file, sets bit, and removes
+      from list (if present).
+    add_orphan puts it on the list for handling.
+ 
+    For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
+        as does any unlink/rmdir/rename that fails.
+
+    For directories: put it on list in commit/abort.
+
+
+  And...
+    I hit the BUG where find_leaf wants and address of 0.
+      If an index block gets cleaned out it doesn't disappear
+      immediately.. there is no leaf to find in that direction.
+      We probably need to avoid non-Valid blocks or something...
+  And...
+    Orphans 0/299 to 0/329 and  0/280 are still on the list
+     but are not orphans.
+     Maybe I need to catch mutex_unlock to run the orphans??
+  And...
+    We underflow a segment through orphans are unmount.
+      We are cleaning and truncating at the same time.
+      The same block gets allocated to 0 and to 1225
+      in quick succession.
+      Problem is that we apply new address while in writeback
+      so a new lafs_allocated_block
+
+29Aug2009
+
+  Review of inodes in orphan list:
+    lafs_new_inode makes are orphan for a non-existant inode.
+    If the inode cannot be created, orphan_release is called.
+    If it can, a 'struct inode' is filled in with valid type
+    and nlink==1 (!!) and attached.  The inode will only be
+    detached when the refcnt hits 0, and the orphan list implies
+    a refcount, so if we ever find something on the orphan list
+    with a NULL my_inode, it must be very new and can be ignored.
+
+    When we find an inode block with a my_inode there are a few options:
+      if I_Trunc is set, we must progress truncation providing we can
+            get the i_mutex
+      else if I_Deleting we must delete the inode
+      else if nlink is 0, we remove from the list
+      else nlink > 0 and we must remove orphan status.
+    This means that if nlink is elevated, we need to be holding the mutex...
+    So don't elevate nlink any more...
+
+    When nlink becomes non-zero the block need to be put back on the
+    orphan list (it must already be an orphan).  Also when we set
+    I_Deleting or I_Trunc it must go on the list.
+   .. OK, I think I have all of that.
+
+
+30Aug2009.
+   I have some wierdness that seems to be caused by the orphan stuff,
+   probably due to it all being async now.
+   - A deleted inode clears I_Trunc and then sets it again.  The only
+     explanation seem to be that delete_inode is being called again,
+     so I must be igrabing it again, maybe from cleaning.
+   - bits of directories aren't getting deleted.  Sometimes single
+     blocks, though the referred files are deleted.  Sometimes
+     the whole directory... More interestingly, those blocks then
+     don't get cleaned, so something about them means that they
+     don't get deleted and don't get cleaned either.
+
+   Even weird... I just had a case where file 331 had a different
+   index block for every 4 data blocks...
+
+
+   FIXME:
+    - What stops pinned blocks from being flushed by bdflush in middle
+      of operation and so losing allocation?  Must make sure to set
+      them dirty very late.
+    - orphan_release can fail, so much make sure we can always call
+      it, even if my_inode is NULL.... but how?
+
+
+    - make_orphan could fail due to lack of space, which is not OK.
+      I made it loop, but I'm not 100% sure that is right... it isn't.
+      I need to pass down the 'I'm freeing space' flag, and I need to
+      not require Credit of Dirty is set, etc.
+
+
+    - I seem to have a deadlock and unmount.
+       umount is waiting for lafs_checkpoint_lock_wait in
+          lafs_put_super
+       pdflush is in down_read in sync_supers
+       lafs_cleaner is iget_locked/ifind_fast/inode_wait
+                This is waiting for I_LOCK to be clear.
+      
+
+31Aug2009
+  - When a file shrinks and becomes level-0, make sure
+    old addresses get deallocated.  I seem to have
+    a directory where they didn't.
+
+  - Due to the fact that we over-preallocate, we really shouldn't
+    return ENOSPC until we have flushed dirty data and performed
+    a checkpoint??
+
+
+  - When I removed the last index from an inode
+    (Indirect type) it seems that I didn't write
+    out the corrected block..??
+
+1sep2009
+ I ran my simple test run repeatedly overnight.
+ It ran 208 times before I stopped it.
+ There are 3 possible failure modes:
+   1/ didn't completed within 500 seconds
+   2/ triggered a BUG
+   3/ appeared to complete, the number of blocks
+      in use was not the correct '7'.
+
+ 74 (35%) did not fail!
+ 31 () did not complete
+ 40 () triggered a BUG
+ 2 did not complete but did not trigger a bug
+
+ 94 of those that failed did not have a BUG
+ 92 actually completed.  Of these:
+      1 final blocks 1
+      1 final blocks 110
+      1 final blocks 23
+      2 final blocks 12
+      5 final blocks 0
+      6 final blocks 10
+     11 final blocks 8
+     21 final blocks 11
+     44 final blocks 9
+
+ of the BUGs,
+       1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
+      1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
+      2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
+      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
+      6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
+      7 BUG: unable to handle kernel paging request at 6b6b6bfb
+     11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+
+ super.c:655 is "block is still pinned" at unmount time.
+  The block was always an InoIdx with a child.
+  Either inode 0 or 16.
+  child is held by various things:
+      [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
+      [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
+      [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
+      [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
+      [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+      [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+      [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+      [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
+      [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
+      [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
+      [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
+
+ The "unable to handle kernel paging request" is always in
+ umount.
+     invalidate_inode_buffers(26/46)/lock_acquire
+
+
+ block.c:529
+    This is iblock valid when erasing a block
+    The block we are erasing is always 0/327 or 0/328.  It is
+    an orphan we are handling, iolocked but not always pinned
+
+ lafs.h:276
+    Map an iblock which is not IOLocked
+       always in lafs_clear_index for the InoIdx block for a directory
+       which is in Writeback.
+       Call is in lafs_allocated_block from cluster_flush.
+
+ segments.c:351
+    seg_inc reduces seg usage below 0
+      - lots of blocks (inode 327) that were cleaned, where then erased twice.
+      - 2 block (inode 328) were erased twice, both from prune
+      - ditto
+
+ segments.c: 1028
+     The free list is empty.... odd as only first segment is currently
+     in use.
+
+ soft lockup:
+     Still orphan: 0/328  Index(1) is in Writeback and Dirty
+       again inode_handle_orphan2 is in Writeback
+
+ inode.c:821
+     inode_handle_orphan are end, child list is not empty.
+       The children seem to be in Realloc - cleaner need to let go.
+
+ cluster.c:1219
+     my_inode is null while cluster_flush an inode and want to set
+        WritePhase.
+
+
+ block.c:485
+     no ICredit for unincredit in dirty_dblock from dir_delete_commit
+     from lafs_unlock.
+
+
+ spinlock lockup in subsequent to real bug
+ ditto for sleeping function.
+
+ Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
+ appear to have other strange values....
+
+ A select '9' has two extra block for the directory '74'.
+ But that directory is long gone.
+ These dir blocks are currently fully populated with numbers.
+ This seems to be the pattern with all non-7 blocks.
+
+
+ 02Sep2009
+  Found a problem, possibly related to the dir blocks not being
+  cleaned up.
+  When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
+  so that fact is never copied in to the datablock.
+  On further exploration, the I_Dirty bit is set but never used, which
+  isn't good.
+  So: exactly when do we copy inode into datablock, and what do we do
+  when dirty_inode is call (if anything).
+  We could just set I_Dirty when dirty_inode is called, checking that
+  the block is Pinned which it usually will be.
+  Then we copy inode to data just before writing data block.
+  However that defeats transactional properties.  We to copy in the
+  same transaction, and that means either straight away, or when
+  the data block's phase changes.
+  So dirty_inode either copies to the block, or sets I_Dirty.
+  When lafs_refile unpins an inode data block, it need to check
+  I_Dirty and possibly re-dirty it.
+
+  To redirty it we must steal the NCredits.  Any further dirty attempt
+  will have to allocate more.
+  The stealing is done automatically by dirty_dblock, so we just flip
+  the phase and call dirty_inode ... making sure it doesn't try to
+  prealloc too hard.
+
+  Need to review when inodes get dirtied.
+    - commit_write only sets I_Dirty !
+
+    We call lafs_dirty_inode:
+      dir_create_commit - a child of inode is PinPending
+      lafs_create - ditto
+      lafs_link - before dir_create_commit
+      lafs_unlink, lafs_rmdir - data block is pinned
+      lafs_symlink - before create_commit
+      lafs_mkdir - before create_commit, or block pinned
+      lafs_mknod - before create_commit
+      lafs_rename - (moved to) before create_commit/update_commit
+                     or data block is pinned
+      lafs_dir_handle_orphan - (assured that) child is pinned.
+      choose_free_inum - child is pinned
+      lafs_incorporate - block is pinned
+
+    So either the data block is pinned, or the index block is pinned.
+    In either case it is OK to set something to Dirty.
+
+    (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
+    this is called from:
+        inode_inc_link_count
+       inode_dec_link_count
+       ..various quota ops...
+       inode_setattr
+       __set_page_dirty (Which we don't use)
+       other buffer stuff
+       other quota stuff we won't use
+       touch_atime
+       file_update_time
+       page_symlink
+
+    only the time updates are interesting.  Others we have locking
+    for.
+    file_update_time is called from generic_file_aio_write_nlock etc
+    before ->prepare_write/->commit_write.  So they can pick up the
+    change.
+    Similarly before set_page_dirty is called.
+    touch_atime is called from do_follow_link and readlink and
+    file_accessed which is called all over the place.
+
+    So what to do?
+    If block is pinned, then dirty it to ensure writeout.
+    If not, don't.  But copy data in any case.
+
+
+4sep2009
+
+    OK, I've decided that I don't like clearing B_Valid when an index
+    block contains no indexes.  The final straw was that I seemed
+    to need to initialise the index block when I didn't hold IOLock.
+    That was probably fixable, but I'm sure more problems were coming.
+
+    So: what to do instead?
+    One issue that must be resolved is that an index block can still
+    have valid children even when it become empty.
+    This can happen if we erase blocks from a file, then add them back
+    after a checkpoint, and so in the next phase.
+    The checkpoint writeout could need to show an empty index block,
+    but the next phase will see real addresses.
+    We cannot easily avoid this, so we must handle it.
+    This interact badly with the index lookup algorithm that finds
+    the best index block currently in the parent, and then scans
+    the children.  If there is no index block in the parent, we
+    cannot find any children.
+    This could be handled by responding to an empty index block by
+    scanning all children.  But that isn't a full solution as if
+    just one index block got erased, it's unincorporated siblings
+    would still be lost.
+    We could treat empty index blocks like orphans.  i.e. don't
+    discard them immediately but leave them with possibly real
+    addresses.  Then when they have no children we allocate the
+    0.
+    But we still need to ensure that index blocks off which siblings
+    have been split but not yet incorporated remain present in the
+    tree to mark the place for their siblings.
+    There is another problem.  A horizontal split could leave the
+    new block with no addresses and everything in the uninc list.
+    Nothing can be found in there.
+
+    So maybe we need to revise the lookup mechanism.
+    The goal is to find an index block that starts at or before
+    the target and contains an address at or after the target.
+    Then out search can stop.
+    In rare cases.....
+
+7sep2009
+    I thought about this more over the weekend and think I have an answer.
+    We need to treat internal and leaf index blocks somewhat differently.
+
+    An internal index block must never be empty (while unlocked).
+    Any child block which has not had it's address incorporated must be 
+    attached (simply in the sibling list) to a block which has been
+    incorporated.  This will be the block that it was split off.
+    The uninc block needs to hold a reference so that the primary isn't
+    released.
+    When a 'primary' becomes empty it cannot be discarded, so the
+    addresses in the first dependent index block must be copied
+    across.  This is awkward for indirect blocks so they might be
+    allowed to be empty (they aren't internal so don't violate the
+    above).
+    When a horizontal split break a sequence of dependent blocks
+    between two parents, the second parent must be incorporated
+    immediately so that the first block in the second half of the
+    sequence is incorporated.
+    If an internal index block does become empty and it has no
+    dependent blocks to fill from, it must be invalidated immediately.
+    It cannot have any children - even in next phase - as at least one
+    would have to be incorporated and so the block would not be empty.
+    Invaliding involves allocating to address 0.
+    If index lookup finds a block with PhysValid address of 0, it
+    must look to the previous index block.  If there was none .... it
+    gets a bit complex.
+
+    Leaf index blocks can become empty, but we try to avoid it.
+    If a leaf has blocks which have been created in the next phase,
+    and others which have been deleted in this phase, it can be empty
+    but still have children.  In this case we just treat it as a real
+    index block that doesn't actually have any addresses.  We still
+    write it out even though that is a waste of space.
+
+    We have been working on the assumption that every address always
+    has a corresponding leaf index block.  It is the leaf with the
+    highest index at or below the target address.
+    However this requires the every internal index block has a child
+    with the same address as the parent.
+    Preserving this requirement when the first child of an internal
+    become empty requires either:
+       - loading the 'next' child and reassigning this to the start
+       - changing the address of the parent to match the first child.
+    The former requires possibly reading a block from storage.
+    The latter only involves modifying blocks that are due to be
+    written out anyway, but makes block look up slightly interesting.
+    When lookup finds an invalid block that is 'first', it needs to
+    start again from the top.
+    When incorporation creates an invalid block that is first, it
+    needs to walk down from the top and any index block at the same
+    address needs to be relocated/rehashed.  If the block is
+    incorporated, the incorporated address needs to be updated.
+    So:
+     - flag for unincorporated index blocks which implies a reference
+       on primary
+     - after split, immediately incorporate second block
+     - change lookup to retry when finding invalid block
+     - When internal block becomes empty, either merge with
+       first dependent or invalidate.  If first in parent,
+       update address and parent and recurse.
+       Need some 'clever' locking here.
+       Before unlocking the invalidated block, we take i_alloc_sem,
+       then walk up the ->parent tree locking blocks as
+       required.
+       The index lookup, when it finds an invalid block will take
+       i_alloc_sem, then drop it, then start again.
+       Or maybe some other lock than i_alloc_sem...
+     - When leaf becomes empty, invalidate only if it has no children.
+       When internal leaf becomes unpinned, check if empty.
+
+21sep2009
+   That locking doesn't look like it will work, and we can never 'merge
+   with first dependant' as it is not valid to have a index block
+   where the first child is at a different address.
+   And we cannot always change the parent address, particularly if it
+   is zero - increasing it then cannot work.
+   And there is no need to load a block if we are just going to change
+   its start address (not internal index blocks anyway).
+   Let's drop the idea of relocating the parent.
+   If an internal index block becomes empty:
+     If it is last in parent, no loss, just discard
+       If parent would be empty, need to recurse up.
+     If it is not last relocate the next sibling to this location,
+      rehashing it and updating the parent.
+   If a leaf index block becomes empty we cannot just delegate to
+      next as it might be indirect... not a problem if address is
+      stored.  But that requires a format change... now might be a
+      good time!
+   
+   
+   So:
+     If we hold an index block locked and it becomes empty and we choose
+     to invalidate it, we need to ensure that doing so does not
+     break any indexing paths.
+     So we take a separate lock (i_alloc_sem??) and flag the block as invalid
+     by setting physaddr to 0 while PhysValid is set, and unlock the block. 
+     Any lookup that finds such a block must take and release i_alloc_sem,
+     and then restart from the top.
+     - If the block was not incorporated, we just remove from sibling list
+          and all is done - the space in implicitly included in
+          previous block.
+     - If the block has a different fileaddr than the parent then update
+          the parent directly, either removing the entry, or changing it to
+          point to the first unincorporated sibling (if there is one).
+          This requires taking the lock on the parent of course.  That is 
+         why we dropped the lock on the child.
+          Then all done.
+     - If the block has the same address as the parent we need to find
+          a 'next block' to relocate to the start of the parent.
+          It is either the first unincorporated sibling, or the next
+          block in the index block, or nothing, meaning the parent is
+          about to become empty. 
+        We lock the parent (still holding i_alloc_sem), and rehash the
+         chosen child.  If it doesn't exist, or is not dirty, we need
+         to update the phys address directly in the 
+          accordingly, erasing or replacing the first address.
+          Then we need to rehash the index block, but we need to lock
+          the parent for that.
+          So set a 'busy' flag on the block, unlock it, lock parent,
+          rehash, clear busy flag, and repeat.
+      - We can never relocate a block with fileaddr of zero, as the
+          InoIdx block cannot be relocated.  So leaf index block 0
+         must never be erased unless the file is empty.  So 
+
+28sep2009
+  New idea.
+  We store the start address of an indirect block in the block.
+  These means that the meaning of any index block is completely
+  independent of the location of the block, so we can change the location
+  easily and without touching the block.
+  So if a block becomes empty, we simply move the next block back to
+  fill the gap.
+  i.e. when an index block becomes truely empty (i.e. no children)
+   - if it wasn't incorporated, simply remove it
+   - if it was,
+       - if there is a dependent block, rehash it to take my address
+       - if there is a next block that is dirty, rehash it
+       - if there is a next block that is not dirty,
+          update parent to merge my entry with next, and rehash next
+          if it exists
+       - if there is no next block but we are not first, just update
+          parent
+       - if no next block and we are first, parent becomes empty,
+          recurse upwards.
+
+12Oct2009
+ - too long, I've forgotten what I was up to..
+   + I've changed the format of indirect blocks to store an address.
+   + I've handled incorporation of an empty block
+   So now internal index blocks can never be empty - they get immediately
+   unlinked if they are.
+   Leaf index blocks can be empty while they have children.  We don't
+   flag them as empty, but rather wait until another child gets incorporated.
+   But I don't think I really like that.  It is an external ugliness based
+   entirely on internal implementation details.  Empty index blocks should
+   not get written out.  We need some way to reliably find an empty index
+   block.  The address won't appear in the parent so a lookup will find the
+   previous block which we cannot link to now as it may not exist yet.
+   Worse - if first index block goes empty, we can only unlink it by moving
+   the parent to start at the next block.  That would make this index block
+   totally unfindable.
+   So I think we have to stick with writing out empty index blocks very
+   rarely.  So we need to be sure they disappear properly.
+   The difficult case is if an index block becomes empty while it has some
+   children which don't end up getting dirtied. e.g. an update aborts.
+   We need to leave the block with enough credits to be written out.
+   I guess the Ncredit should be enough...
+   Maybe worry about that later.
+
+ - what about InoIdx blocks when they become empty?  It would be helpful
+   to flag them so that inode deletion can check....
+   Maybe just set depth to 0..
+
+ ARRGGG... I've completely lost it.  In need another ITO week.
+  I just got a bug in summary.c:71!!
+
+7 Jun 2010
+ - summary.c:71.
+   ablocks_used has hit zero too soon.
+   This should be the count of blocks for which space has been allocated
+   (B_Prealloc is set) but have not been given a phys address yet - at which
+   point the usage count is moved to cblocks_used or pblocks_used.
+   The last block (which may not be the cause of the problem) does not have
+   B_Prealloc set, yet physaddr == 0.
+   The block is 0/1, so the inode for the inode usage map.  This should have
+   physaddr 8 !!
+   We did find 8, then change to 73, but then changed to 0!
+  Ahhh... recent fix exposed a subtle bug ... fixed.
+
+ Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+     cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+     cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+     cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+     cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+   We are allocating an InoIdx block, but data block is not valid??
+
+ That isn't very reproducible so I'll have to leave it for now...
+    erasedblock had been called on the data block .. inode 17??
+
+  Problem is that I keep changing the rules.
+   I don't erase the InoIdx block any more.
+   I used to, then change it to iolock_block/cluster_allocate->0
+
+ Problem: When all files are removed, usage is still quite high, two
+   segments have over 400 blocks (out of 512).  Cleaning keeps running and
+   not making much progress.
+  segment 6 has usage of 484.
+  'cluster 3072' shows: cluster 3072, 3085, 3086 3092
+    Inode 0:  blocks 267 272 276 
+    Inode 277: blocks 0/4 6/2
+    Inode 0: blocks 0/2 8 16
+    Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
+    Inode 16: 1/1
+    Inode 17: 0/28
+    Inode 283: 12/18
+          etc.
+
+  All 'old', so must be the product of cleaning, as you would expect.
+  All (most) of this has been deleted though, but count didn't drop.
+   'Count' add to 508, plus the 4 cluster heads makes 512 - good.
+  lafs_seg_move definitely isn't being called on these blocks.
+  it is only called from lafs_summary_update
+  cblocks_used "exactly" matches the number of un-removed blocks.
+
+
+  Another problem
+bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+
+ and
+free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
+Want dump of usage
+
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+ free list is empty - that should not be.
+
+and another...
+/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+ [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
+ [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
+ [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
+ [<c02256bf>] ? autoremove_wake_function+0x0/0x33
+ [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
+
+
+08Jun2010
+ Weirdness with truncating.
+ The cleaner relocates a file resulting in the InoIdx block being
+ Maybe-dirty and phys_addr == 0.
+ Then truncate doesn't prune but just incorporates, finding
+  something weird there..
+  file 278, blocks around 4100
+  seem to find 1949 instead??
+
+ Note: When a non-InoIdx block is erased we set PhysValid
+  and physaddr == 0 to record the fact because it will not be stored...
+
+modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
+Async ??
+modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
+Still Async ... wonder what it means.
+
+- directory block got corrupted.  Maybe conversion to indexed??
+
+
+Getting bug in remove_from_index because the addr isn't
+there, possibly block is empty.  But incorporation is
+??? instant?  No it isn't.
+If an index block hasn't be incorporated it has B_PrimaryRef
+set as it hold a ref to something earlier index.
+But what if nothing is incorporated?
+
+
+Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
+looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
+
+Then spin in a soft-lockup in lafs_inode_handle_orphan
+
+
+-----------
+ - grow_index_tree needs to do initial incorporation so things can be found.
+    just like end of do_incorporate_internal.
+   NO - cannot incorp yet as do not have phys addr.  Don't need to as
+   lafs_leaf_find explicitly handles this.
+   For truncate case we don't use the stored address, but ensure all
+   leaf indexes must be dirty (or gone) so whole tree must be
+   accessible for walking around.
+ - do_incorporate_internal needs to set B_PrimaryRef and take the ref
+ - when we remove a B_PrimaryRef without incorporating it, we need to
+   drop a ref if the *next* in the list is B_PrimaryRef
+ - need to use a constant to identify 'async' calls etc.
+ - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
+   it isn't found as async....
+
+09Jun2010
+ STILL struggling with incorporation.
+ We have a premise that any file address is coverred by precisely
+ one leaf index block.  Every leaf index has an implicit address
+ and it covers all addresses from there to the next leaf.  The last
+ leaf covers to EOF.
+ So there must always be a leaf at address 0.
+ This applies within the tree from an internal index block too.
+ Beneath an internal index block there must be a leaf covering every
+ address up to the next internal index block.  So there must be
+ a first.  So storing the first address is pointless.  And harmful.
+ When an index block becomes empty and disappears its coverage is
+ included in the previous block unless there is none, in which case
+ the next index block must be re-addressed.  If there is no 'next',
+ this index block must be empty and so must disappear.
+
+ BUT if we re-address an index block, we implicitly re-address the
+ first child - recursively - so we need to move/rehash them all
+ or lose them... or record where they are.  Or do lookup not by
+ addr....
+ I think just rehashing them all - with an iolock - is simple
+ and safe.  So just do that.
+
+
+ So:  I cleaned up index handling a truncation somewhat.
+  Now running looptest to see what patterns emerge:
+
+  block.c:197 (*9+1) During umount, the Root datablock is
+        Dirty+Realloc
+        Maybe just need for cleaner to become inactive
+        during umount - hope that doesn't deadlock
+        didn't event work...
+  block.c:529 (*4+1)  erase dblock while iblock depth > 0
+        When pruning InoIdx we want to set depth to 0.
+        FIXME is this really want I want, or is depth=0
+        only for data-inode ... FIXME
+  cluster.c:533 (*2) cluster_allocate on invalid block
+          Block is 8/0 in writepage from sync_inodes
+          This is the orphan file.
+                   blocks aren't dirty
+          I guess the file gets truncated while we wait for it.
+          Just need to re-test.
+  index.c:1936 (*2).   An index block is Root - FIXED??
+  modify.c:1056 - secondary bug, ignore for now.
+  modify.c:1650 update_index fails to find target.
+              second call, phys==0
+              Code was bad ... may not be the cause though.
+  modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
+                   from orphan handler.
+                Maybe just change the do/while back to 'do'.
+  modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
+               Index(0)/InoIdx
+               in do_checkpoint
+               uninc list gets set in lafs_add_block_address (parent of iblk),
+               do_incorporate_internal,
+               Maybe the InoIdx still had children.
+  segments.c:1028.  (*4) The free list becomes empty.
+  super.c:655 (*3)   Busy inodes after umount, and root InoIdx block
+         is still pinned as inode 16 data block was still dirty.
+         segusage slow.  Maybe same as block.c:197 ??
+  invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
+          finds invalid lock.
+          presumably the inodes was freed before invalidated.
+  spin on writeback during truncate (r3a) 8 times. now 10
+        Probably because writeback cannot proceed while
+        orphan processing keeps looping.
+  kmalloc-1024 problems - (*2)
+          A block - should be start of page - isn't not what it appears...
+
+ Others complete with 'cb' ranging from 202 to 715
+
+
+10 June 2010
+
+ Looking at segment.c:1028
+  We run a seg_scan every checkpoint, so that should keep free segments
+  in the list.....
+  Ahh.. do_checkpoint is looping because root isn't changing phase.
+
+  Lowest block pinned to old phase is 
+  [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
+  which is not on leaf list because it has IOLock
+  With more debugging:
+  [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
+  or better (that was in lafs_iolock_written)
+  [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
+  FIXED - I didn't unlock if it wasn't dirty any more.
+  Well almost - it occurs much less now.
+  Out of 48 runs:
+      8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
+      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+      2 BUG: unable to handle kernel paging request at 6b6b6bfbt
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
+
+  So we now have 1/12 rather than 2/3.
+  a/ pinned by IOLock from file.c:220 - FIXED
+  b/ as above
+  c/  Root is pinned by 4 children
+      328/0  with 196 of data blocks in writeback/realloc, in a cluster
+      0/1, 74/0, 0/8   all in a cluster waiting writeout.
+     Don't understand this.
+  d/ as a,b
+
+  Of the 48, 11 ran to completion leaving blocks from 286 to 899
+  
+
+  Looking at the loss of blocks when truncating.
+   tracing show small number of files with remaining blocks at delete.
+     sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
+   next attempt: 14+24+26*11 =324 cf cb=1124
+   next attempt 26+6+15+68+29 == 144 cf cb=383
+   26+18+14+19+284 = 361 cf 379
+    files are (in order)
+   49    bfile       - 30K
+   325   nbfile-49   - 30K
+   320   nbfile-44   - 30K
+   296   nbfile-20   - 30K
+      ??331??
+
+11 June 2010
+
+ Thinking about truncate and index blocks becoming empty while
+ they still have children.
+ For leaf indexes, we need to leave the block in place in case
+ the children get written.  We need to find a time to ultimately
+ delete it...
+ For internal indexes,.... uhm, it just works, OK??
+
+ When I drop an uninc block, I need to remove it from the
+  uninc list, and from phase_leafs
+  clearing dirty and refiling should remove from leafs.
+
+ When we recurse to a parent, we need to remove
+ *this* block from the uninc list for said parent.
+ It should be the only thing in the list.
+ But even when we don't recurse, the fact that we have
+ incorporated means that we should tidy up the ->uninc
+ list.
+
+
+
+12 June 2010
+  unmount hung after lafs_run_orphans from lafs_put_super
+  There are two orphans in Writeback which cannot progress
+  until the current cluster is written...
+  But they keep getting re-written!
+  Other time, one orphan, index block is Dirty on a leaf ???
+  
+orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
+[cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1) 
+LAFS_cluster_flush 1
+
+
+orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
+[cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0) 
+
+ OK, problem is that when we truncate and remove an index block, the
+ next index block expands backwards to fill the space.
+ Then we apply prune_some, but don't check if anything was done.
+ We always mark it dirty, so it has to be written and then
+ we loop through again...
+ So need to check if prune_some did anything.
+
+TODO:
+ - prune_some need to get more done at a time
+ - let cleaner finish up before umount
+ - use early segments first ??
+ - look at write-clusters and check OK
+ - check that df:cb= drops properly.
+
+Bugs:
+      1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170  - SECONDARY BUG
+      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+      3 BUG: unable to handle kernel paging request at 00100104
+      5 BUG: unable to handle kernel paging request at 6b6b6bfb 
+      1 BUG: unable to handle kernel paging request at 7fffffff
+      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
+      9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828! 
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708! 
+      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
+     30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+Quite a haul there!
+
+super.c:655
+    Pinned block in lafs_release:
+         0/2 is Dirty with plenty of credits, so it is a child
+         0/16 is Dirty/Realloc, or once Async
+     Dirty, but not on a leaf list, not pinned
+
+segments.c:332
+    seg_deref with refcnt , 2 in lafs_seg_put_all
+
+segments.c:1028
+     No free segments - no real pattern.
+
+modify.c:1708
+     lafs_incorporate on non-dirty/realloc block
+       328/0 Index(1).  1 in uninc_table - probably during truncate.
+     Either we add uninc while not dirty
+     Or we clear Dirty while uninc present
+     or there is a race between the two.
+
+     Don't know:  add a bugon
+     Bugon in get_flushable didn't fire.
+
+inode.c:843
+     children present in truncate after final incorp...
+       328/0.  64 children, no uninc list.  Maybe we ran the orphans too early??
+      or invalidate_page isn't removing the children.
+      Might want print_tree here?- added that.
+     Answer: all the children are in Realloc on Clean_leafs
+       Maybe erase_page needs to disconnect from cleaner too??
+
+inode.c:828
+     Orphan handling - uninc but not dirty: is Realloc (sometimes)
+     Maybe like  mod:1708
+
+block.c:67 *
+      delref 'primary' from modify.c:2063 in the q2 branch.
+      nxt has PrimaryRef... Maybe  move earlier, but that shouldn't make a diff.
+      ditto at modify.c:2035  nxt is primary as was I, so drop mine.
+      Don't know - looks like sibling list got broken.
+      Tidied up a bit and added a print-tree.
+      v.interesting result.  Lots of consecutive index blocks all holding primary-ref
+            on single primary - which is wrong.
+      1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
+            on self, as are being inserted into chain
+      2/ When splitting, new block must be addressed as first block which cannot
+           fix, not first block which doesn't fit.  Else incorping in reverse order
+           can make lots of tiny index blocks.
+
+block.c:529 *
+        erase with index depth > 1.
+        0/328 in orphan handling.  Still have 8 or 15 blocks registered!
+       Maybe caused by index block errors.  Added some printks.
+
+block.c:479 *
+        not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
+        74/xxxx in unlink
+        16/1 in seg_inc/seg_move...allocated_block/cluster_flush
+
+        - writepage wrote the page??
+        - checkpoint wrote it and didn't replenish the credits?
+
+block.c:197 XX
+        invalidated pages finds dirty block after EOF, after iolock_written
+         0/0 Dirty/Realloc in unmount - all Realloc!
+       Need to wait for cleaner etc to finish at unmount time.
+
+NULL deref in 1b4  YY
+    cleaner->cluster_flush->count_credits->lock??
+    Trying to get a lock on an inode that has since been free??
+       spin_lock(&dblk(b)->my_inode->i_data.private_lock);
+
+
+001001 YY
+     generic_drop_inode -- extra iput??  in lafs_inode_checkpin from refile
+6b6b6b YY
+      invalidate_inode_buffers!! in kill.  use-after-free
+
+7fffff
+    seginsert from scan_seg
+     MAX/number-elements confusion.  Worked around for now.
+
+
+18  June 2010
+After a couple of fixes:
+      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+      1 BUG: unable to handle kernel paging request at 00100104
+      5 BUG: unable to handle kernel paging request at 6b6b6bfb
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
+     16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
+                Realloc blocks confusing truncate
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
+      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+     19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+
+TODO:
+ - truncate gets confused by blocks being cleaned.
+   Need to flush cleaner, or just removed the blocks.
+ - when add PrimaryRef in middle of list, take the right ref.
+ - fix up wait-for-cleaner at unmount time.
+
+19 Jun 2010
+
+      3 BUG: unable to handle kernel paging request at 6b6b6bfb.
+      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
+     22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
+      9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+     17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
+    251 SysRq : Resetting
+      3 SysRq : Show State
+
+ - We can erase a dblock while it is in the uninc_pending or 
+   uninc_next - need to be careful
+ - At umount, 0/2 is Dirty but not Pinned, so not written out
+   ditto from 0/16
+   16/0 sometimes is Async
+      16/0 Async might be from the segment scan - so wait for that.
+   Dirty but not pinned can happen when InoIdx is pinned.
+
+ - I think the uninc_next list (At least) should be sorted before
+    being allocated.
+
+ - root block dirty/realloc/leaf in final iput
+   Could be it was changed during last checkpoint so
+   pushed in to next phase?  But why Realloc?
+   Maybe still issue with losing inode data block.
+
+20 June 2010 Happy Birtyhday Dad!!
+
+420 runs.
+      4 BUG: unable to handle kernel paging request at 6b6b6bfb.
+     26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+     87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
+     12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
+
+ Problems:
+  - inode in i_sb_list has been freed.
+  - block 0/0 is dirty/realloc/leaf after final iput
+  - not all blocks freed by truncate
+  - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
+  - still children when truncate should have finished.
+    all are Realloc
+        Maybe inode has become unhashed and we re-load it??
+        it is invalid after all!!
+  - Index block not dirty when incorp - has uninc. ??
+  - didn't wait for free segments
+  - Data 16/0 is dirty but not pinned after final checkpoint - FIXED
+
+
+watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
+watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
+
+
+ Unclear on dirtying index blocks.
+   We normally mark it dirty first, then add the address to the uninc list.
+   Note that this is the reverse of data blocks which are changed first, then
+   dirtied.  So maybe we should mark dirty afterwards.  We then need to
+   avoid incorporation while we are adding addresses else we might find it
+   has addresses but is not dirty.  Only try if dirty?
+   Maybe we should iolock the parent.  We need to do that anyway to flush
+   incorporations when the table is full.   Yes, that fits the VM model
+   better.  Always lock while updating and preparing to write.  Set
+   writeback once write has started, then unlock.  Cool.
+   Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..
+
+21June2010
+  Apart from tracking down the remaining bugs, I need to:
+  1/ Decide on locking for incorporation and attaching new address to a block
+    and implement it.
+    In particular we need to not lose the Dirty flag before the update is done.
+  2/ Resolve handling of pinned inode data/index blocks
+  3/ Correct handling of empty index blocks, particularly when parent is in
+    different phase.  Make lookup be more careful?
+  4/ Wait for there to be enough free segments before allowing allocation.
+
+  2:  Problem is that we cannot handle a pinned inode-data block while the
+     InoIdx block is pinned in the same phase.
+     We currently unpin it so it drops off the leaf list.  But then we
+     need to re-pin it when the InoIdx is unpinned or phasefliped, and that
+     gets ugly.  Possible though.
+     An alternate is to treat it like a parent and keep it off the list
+     while the InoIdx is pinned/same-phase.  So we would need to
+     re-assess it after unpinning or flipping the InoIdx.  That is probably
+     a lot easier than re-pinning it.
+
+  1: We would normally set 'dirty' after changing the block.  But we need
+     to differentiate Dirty from Realloc, so we set before adding addresses.
+     This requires that are careful not to write an index block while there
+     are pending changes.  The fact that pinned children stop any writing,
+     as do pending addresses in a list should ensure this.
+
+  3: When an index block becomes empty we need to make sure that
+     future lookup doesn't get confused by it.  Specifically future
+     index lookup must avoid the block so nothing new gets added.
+     Possibly a previous block will split again, but this block must remain
+     unused.
+     However we cannot update the parent block immedatiately as it might
+     be in a different phase.
+     So we must record both "don't touch this" and "where to look instead"
+     elsewhere - in children.
+     If the block being deleted is *not* the first child in the parent,
+     then we direct index lookup to the earlier block.
+     If the block being deleted *is* the first child in the parent,
+     then redirect to the second child if there is one and we weren't just there.
+     If there is no other block we flag the parent as empty and retry
+     from the top.
+     We flag a parent as empty with B_EmptyIndex.
+
+     What locks do we need to walk around the sibling list?
+     the inode private_lock is minimal, but we cannot hold that to take a
+     iolock - just to get a reference.
+     I guess we
+        - iolock the parent
+        - try to find a good block using private_lock
+        - get a ref and wait for it.
+        - check if it is still a good block.  If not, start again
+
+     If we find an EmptyIndex block, it must be directly addressed by parent.
+     It will never be followed by a PrimaryRef block because if there were
+     such a block, we would have readdressed it back and hidden the EmptyIndex.
+     So we need to look around for an address in the parent that leads to
+     a non-EmptyIndex block.
+
+     If all children are empty, we need to make the parent empty.  But
+     what if it is InoIdx?
+     Maybe I am making this too hard.  I could just use i_alloc_sem to
+     block lookups while truncate is happening.  That doesn't address
+     single block removal e.g. from directories.
+     So I need to be able to wait for incorporation to happen on an
+     empty index block.  We hold iolock on the parent.  If there blocks
+     on ->uninc, we just process them immediately.  If there are blocks on
+     ->uninc_next, we wait for the checkpoint to complete
+
+     What does lafs_incorporate actually do with EmptyIndex blocks?
+     Providing that match currently incorp addresses, they just cause
+     those addresses to disappear.
+
+     If a block is in the uninc list for its parent, then is phase_flipped
+     and changed and written out it could get a new physaddr before
+     it is incorporated.
+     I guess we never allocate a B_Uninc block which is in a different phase
+     to the parent.  Currently we wouldn't do that anyway except in truncate
+     though memory pressure on index blocks might one day??
+     Truncate?  We cannot allocate directly in lafs_incorporate.
+     We should get lafs_cluster_allocate to notice and DTRT.
+
+     Only hash index blocks when they are incorporated.  Not needed before then.
+     When processing an uninc list, if an address appears twice, prefer the one
+     that isn't EmptyIndex...
+
+22June2010
+    I need a clear picture of the "Steady state" for an internal index block
+    with it's children.
+    The internal index block contains 1 or more addresses.  For each address there
+    maybe a child index block.  If there is it maybe the head of a list of
+    blocks with B_PrimaryRef set thus holding the whole list in place until
+    incorporation happens.
+    Each of these children can be on either ->uninc_list or ->uninc_next,
+    or possibly neither if they haven't been queued for writing yet.  Any
+    PrimaryRef block will be Pinned.
+
+    When a child is incorporated and found to be Empty it is flagged as such
+    and then must never be returned by index lookup.  Index lookup will either
+    add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
+    block and so have to start again from the top.
+
+    When a PrimaryRef block becomes empty it is simply removed from the
+    PrimaryRef chain so it cannot be found.  The space now belongs to the
+    previous block.
+    When a non-PrimaryRef block which isn't the first becomes empty it is
+    flagged and left in place so that following blocks can be found.  The
+    address space now belongs to the previous block.
+    When the first child (fileaddr matches parent) becomes empty - what?
+      We could re-address first child but that forces early address change - 
+          old might not be incorp yet
+      We could re-address the parent, but that doesn't work for InoIdx
+      We could leave it there with physaddr == 0
+
+    Last sounds promising.  So we never re-address an index block.
+
+   So: From the top.
+
+    Index blocks, Indirect blocks, extent blocks each have an address
+    that never changes.
+    When a block becomes over-full it splits - a new block appears with
+    a new address thus implicitly limiting the address space covered
+    by the original.
+
+    When an index block becomes empty and has no pinned children it is
+    marked as EmptyIndex (under IOLock).
+    When an EmptyIndex is allocated it goes to phys==0
+    An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
+    is never used again.  Its address space is ceded to the previous
+    index block - which could split several times...
+    An EmptyIndex which is first can be re-used.  Once it gets pinned
+    children the EmptyIndex is cleared.
+
+    An Index block always has an entry for the first address.  It might
+    be implicit to phys==0.  Loading such a block creates an empty
+    block.
+
+    InoIdx doesn't get EmptyIndex, rather it gets ->depth=1
+
+    Indirect *doesn't* store the first address any more.
+
+    Changes:
+DONE     - remove forcestart from layoutinfo
+DONE     - remove start-address from Indirect blocks
+DONE     - only hash index blocks when they are known to be incorporated.
+DONE     - when incorporating an uninc list, ignore phys==0 if also a block with
+       same fileaddr and phys!=0.  so sort phys==0 first
+DONE     - Create EmptyIndex flag
+DONE     - Clear the flag when adding child pin to index block
+DONE     - avoid EmptyIndex non-start blocks during index lookup
+DONE     - allow index blocks to be loaded with ->phys==0
+DONE     - allow EmptyIndex index block to be "written" to phys 0
+DONE     - ensure index lookup finds implicit start address, possibly 0
+
+So now after 36 runs
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
+     10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
+     14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
+      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
+      3 SysRq : Resetting
+
+
+index.c:1939
+   block 0/2 is Realloc and being allocated from cluster_flush while
+   parent is not Realloc or dirty
+   That is bad as Realloc gets set in lafs_allocated_block ... except
+    that the code was bad.  FIXED.
+
+index.c:403
+  cleaner is pinning a block (299/25) which is not Realloc,
+    and phase isn't locked.  We are only meant to pin data blocks
+    for updates while holding a phase lock.
+    Ahhh - bad code again. FIXED
+
+inode.c:605
+   Truncate doesn't clean up properly. 
+    327 has 60+1
+    331 has 108+1
+    327 has 34+1
+    327 has 60+1
+   No sign of any children.
+
+   Very weird.  Signed in incorporation going wrong.
+     Added more debugging.
+
+Found 4084 4 12 at 890
+Added 4084 4 12
+Found 4089 4 16 at 878
+Added 4089 4 16
+Found 4094 2 20 at 866
+Added 4094 2 20
+Found 2561 2 22 at 854
+Added 514 2 22
+Found 2564 4 24 at 842
+Found 2569 2 28 at 830
+Found 0 0 0 at 818
+
+Why are 2564 etc lost?  No sign of alloc-to-0
+
+segments.c:1034
+   no free segments - need to wait somewhere.
+
+segments.c:624
+   allocated_blocks has gone over free_blocks!
+   in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
+   Wanted CleanSpace to reserve the youthblk
+   Maybe related to not waiting - ignore for now.
+
+super.c:657
+  block 0/2 was dirty but not pinned.  Should not happen to inodes.
+  block 0/0 was Pinned because it had a child - as above.
+
+  Maybe we don't carry the pin across when we collapse dir
+  into inode??... looks quite likely
+
+
+23 June 2010
+
+116 runs.
+      1 BUG: unable to handle kernel paging request at 6b6b6bfb
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
+      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
+      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
+     61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
+      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
+     42 SysRq : Resetting
+
+
+6b6b6bfb:
+  invalidate_inode_buffers called on at shutdown.
+  Still wierd
+
+block.c:497  FIXED??
+  block 16/1 is not dirty with no credits.
+  Maybe writepage got to it?
+
+dir.c:710
+  ouch! dir lookup failed in unlink.
+   No real hints.  Must be hash based - some off-by-one probably.
+   Need to stare at the code.
+
+inode.c:606  FIXED
+  Blocks still present after truncate.
+  typically about 60, but in 1 case '4'.  No index blocks.
+  So probably content of second index block.
+  Yes, lafs_leaf_next was doing the wrong thing for addresses
+   before start of block.
+
+segments.c:1034
+  same old
+
+super.c:657  FIXED
+  dir inode 0/2 is still Dirty but not pinned.
+  Maybe lafs_dirty_inode should be pinning the block
+ 
+  But now this triggers for 16/X still dirty.
+
+
+How and when to write blocks in a SegmentMap file?
+ - We don't want normal write-back to write them unless they have
+   no references
+ - We need to write them in tail of checkpoint, and index info must
+   follow in the next checkpoint.
+
+lafs_space_alloc is called from
+  - mark_cleaning:  always CleanSpace, failure is OK
+  - lafs_cluster_update_pin: ReleaseSpace.  -EAGAIN is OK (CHECK THIS) but failure
+              is not - or shouldn't be.
+  - lafs_allocated_block: CleanSpace, checking if parent of Realloc block 
+        can be saved separately from any Dirty version.  Failure OK, blocking not.
+  - lafs_prealloc - general space allocation.
+  - 
+lafs_cluster_update_pin is call from:
+  - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
+    lafs_mknod, lafs_rename,
+  - lafs_write_inode
+     So best to return -EAGAIN, and it should be handled adequately.
+
+lafs_prealloc is called from:
+  - lafs_reserve_block, after modifying the alloc_type extensively.
+  - lafs_phase_flip to re-fill the 'next' credits.  If they aren't available
+      we simply pin all children so they aren't needed.
+      So failure is OK
+  - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
+       If this fails .. what?? lafs_reserve_block fails. so...
+
+lafs_reserve_block is called from
+  - mark_cleaning - CleanSpace
+  - lafs_pin_dblock - type is passed int...
+  - lafs_prepare_write - on failure write will fail or retry after checkpoint
+  - lafs_inode_handle_orphan - to help with delete. On failure we allow
+         cleaning to happen
+  - lafs_seg_move - should be elsewhere.  Failure BAD !
+  - lafs_free_get - as above, failure BAD
+  - clean_free - update youth for new clean blocks - Failure BAD
+
+lafs_pin_dblock is called from
+  - dir_create_pin - fail or again handled
+  - dir_delete_pin
+  - dir_update_pin
+  - lafs_create etc
+  - lafs_dir_handle_orphan
+  - choose_free_inum
+  - inode_map_new_pin
+  - lafs_new_inode
+    ...
+  - lafs_orphan_release !! cannot handle failure
+  - roll_block should use AccountSpace
+
+So:  It seems we need a new allocation class that will never fail.
+  Maybe it is allowed to BUG though?
+   AccountSpace - i.e. space need to account for the use of space.
+     Must never ever fail.
+
+Then we must ask where blocking should happen on -EAGAIN.
+  dir.c does "lafs_checkpoint_unlock_wait", then tries again.
+  prepare_write does too.
+
+For that to work we must start a checkpoint on returned EAGAIN.... Don't
+we want to wait for some cleaning to happen first though?  Maybe an extra
+flag, and a count of the number of empty (but not clean) blocks.
+
+- Should I skip orphan handling when tight on space?  Probably not.  It will
+  just keep failing while we keep cleaning...
+- roll_block should use account_space .. or not
+
+- lafs_space_alloc simply allocates space, or fails.  'why' is used to
+   guide watermark choice.
+- lafs_prealloc allocates space to a block and all its parents base on
+  'why' for watermarks.  It either succeeds or failed.
+
+- lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
+  to failure as -ENOSPC or -EAGAIN based on 'why'.
+
+- lafs_pin_dblock simply passes on the failure, which must be handled.
+
+So: What to do when we return -EAGAIN?
+ We need to wait until there are *enough* clean segments, then cause a checkpoint
+ so they become free.
+ So a flag that says 'waiting for free space' and a count of segments
+ required.
+
+ But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
+ Maybe we don't ??  Or do it later.
+
+Still to do:
+- Audit all AccountSpace and justify them
+ + lafs_seg_move is probably wrong.  Should have allocated when the
+   free segment was allocated
+- lafs_orphan_release called lafs_pin_dblock but cannot handle failure
+- Need to wait not just for "enough space" but for "enough clean segments".
+
+- how is 'free_blocks' set - what does this tell us??
+
+   free_blocks is the sum of known-clean segments.
+   We probably want:
+         clean segments
+         remainder for each active segment
+   then reserve some segments for cleaning.
+   And separate 'allocated_block' for each ?
+
+Notes:
+ segments.c:647 fired: AccountSpace had no space available.
+   Reserving space to write the segusage of youth block for a newly
+   allocated segment.
+ super.c:657 STILL
+    0/2 is Dirty but not Pinned  Maybe we need PinPending
+ soft lockup
+    in the cleaner!
+    Maybe I need cond_resched??
+
+Maybe I want two separate 'free_blocks' counters.
+ One that includes all free blocks for use in 'df' etc.
+ One that only includes completely free segments for use in allocation...
+
+
+24 June 2010
+
+ Something is wrong with cleaning and segment tracking
+ We have 5 free segments and we get them all without writing
+ anything!  We consumer them all with cluster_flush!
+ It seems that the root inode is not changing phase!
+ Nothing is on the phase leafs.
+ Most children are in Writeback on cluster. and are Realloc
+ Others have pinned children.
+ They are all in 'cluster', but 'flush' doesn't flush them,
+ so they must be in a different clister???  Is the cleaner still
+ cleaning?  Yes, they are on the cleaner 'wc' list so they are
+ queued but not flush for the cleaner.
+
+25 June 2010
+ At last it looks like I nearly have a working FS. Out of 361 test
+ runs, 9 triggered BUGS and one hung at umount.
+
+ I need a new TODO list, starting with 6 jul 2007(!) and adding any
+ FIXMEs etc.
+
+DONE 0/ start TODO list
+DONE 1/ document new bugs
+DONE 2/ Tidy up all recent changes as individual commits.
+DONE 3/ clean up the various 'scratch' patches discarding any tracing that
+    I don't think I need, and making the rest 'dprintk' etc.
+DONE 4/ check in this README file
+DONE 5/ Write rest of the TODO list
+
+DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
+    It is Dirty but only has *N credits.
+    16/1 ...
+
+DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
+   I guess we should getref/putref.
+
+DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
+    and doesn't cope well.
+
+DONE 5d/ At unmount, 16/1 is still pinned.
+
+ 6/ soft lockup in unlink call.
+    EIP is at lafs_hash_name+0xa5/0x10f [lafs]
+ [<d0a56283>] hash_piece+0x18/0x65 [lafs]
+ [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
+ [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
+ [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
+ [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
+ [<c02602c1>] vfs_unlink+0x4e/0x92
+
+  Don't know. Looks like cleanup up a chain in dir_delete_commit.
+  Added a BUG_ON.
+ 
+  Would we be spinning on -EAGAIN ?? 4 empty segment are present.
+
+ 6a/ index.c:1947 - lafs_add_block_address of index block where parent
+          has depth on 1.
+looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
+/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
+
+ 6b/  check_seg_cnt sees to be spinning on the 3rd section 
+    the clean list has no end!
+    we were in seg scan 
+CLEANABLE: 0/0 y=0 u=0 cpy=32773
+CLEANABLE: 0/1 y=0 u=0 cpy=32773
+CLEANABLE: 0/2 y=0 u=0 cpy=32773
+CLEANABLE: 0/3 y=32773 u=6 cpy=32773
+CLEANABLE: 0/4 y=32772 u=124 cpy=32773
+CLEANABLE: 0/5 y=32771 u=273 cpy=32773
+CLEANABLE: 0/6 y=32770 u=0 cpy=32773
+
+of 
+0 0
+1
+2  
+3 6
+4 124
+5 273
+6 0
+7 496
+8 0
+
+
+ 6c/ at shut down, some simple orphans remain
+    missing wakeup ???
+
+DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
+   truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
+Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
+SEGMOVE 1499 0
+Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
+.......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
+Why have I no credits?
+/home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
+      
+   Cleaning is racing with truncate, and that cannot happen!!
+   Actually it could - if i_size changed at the wrong time.
+
+DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
+   block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
+   in touch_atime.  I think I know this one.
+
+ 7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
+               i.e. 1510, 1945-2038, 2448 of 5378
+    Appear to be looping in first loop of try_clean, maybe
+     group_size_words == 0 ??
+    Add BUGON and wait.
+
+DONE 7c/ NULL pointer deref - 000001b4
+     Could be cluster_flush finds inode dblock without inode.
+     Have a BUG_ON of this now.
+
+DONE 7d/ paging request at 6b6b6bfb. 
+    invalidate_inode_buffers called, so inode_has_buffers,
+    so private_list is not empty.  So presumably use-after-free.
+    But is on s_inodes list.
+     Probably cleaner is still active (if this is first call to
+     invalidate_inodes in generic_shutdown_super) so list gets broken.  
+     We need locking or earlier flush.
+
+DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
+     Check for Realloc too.
+
+PRESUME-FIXED 7f/ index.c:2024 no uninc credit 
+        [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
+      found during checkpoint.  Maybe inode credit problem.
+
+PRESUME-FIXED 7g/  inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
+      ->uninc blocks.  This is during truncate.  Need some
+      interlock with cleaner maybe?
+      Probably the same race between cleaner and truncate.
+
+DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
+
+NOLONGERRELEVENT 7j/ resolve space allocation issues.
+    Understand why CleanSpace can be tried and failed 1000
+    times before there is any change.
+
+DONE  7k/ use B_Async for all async waits, don't depend on B_Orphan to do
+     a wakeup.
+     write lafs_iolock_written_async.
+
+DONE 7l/ make sure i_blocks is correct.
+          set on 'import_inode'
+          decreased when lafs_summary_update assigned block to '0'
+          changed when lafs_summary_allocate changes e.g. quota.
+
+      lafs_summary_update is called when a block is assigned to a location,
+        or to zero.  It is real usage.
+      lafs_summary_allocate is called when we set Prealloc on phys==0 or
+         clear Prealloc on phys==0
+      So allocate must be followed exactly.
+       update is already counted for setting !=0, so only dec on ==0.
+      So all is good.
+     What about quota? - hidden in quota_allocate / qcommit
+
+7m/ delete inode could not progress through inode_map_free, so
+   ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+   was permanently an orphan.
+
+DONE 8/ looping in do_checkpoint
+   root is still i Phase1 because 0/2 is in Phase 1
+  [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
+   Seems to be waiting for writeback, but writeback is clear.
+     Need to call lafs_io_wake in lafs_iocheck_writeback for when
+     it is called by lafs_writepage
+
+DONE 9/ cluster.c:478
+    flush_data_To_inode finds Realloc (not dirty) block
+    and InoIdx block is not Valid.
+  [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
+  I wonder if it was PinPending, or where it was IOLocked (or if).
+
+   I guess we truncated, then added data, then tried to clean.
+   Probably just a bad 'bug' given recent changes.
+   No, I think it is the race between truncate and clean which is now fixed.
+
+SEEMS TO BE GONE 10/ inode.c:606
+    Deleting inode 328: 2+0+0 1+0
+
+    2 level index.
+    first index at level 1 was full and prune properly.
+    Nothing else found empty.
+    Somehow the second index block and contents were lost.
+
+ASSUME_DONE 11/ super.c:657
+    Root still pinned at unmount.
+     0/2 is Dirty:  [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+                    [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
+                    [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+                    [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+                    [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
+    maybe dir-orphan handling stuffed up
+    Or maybe it is the I_Dirty issue.  Assume fixed.
+
+
+ASSUME_DONE 12/ timeout/showstate in unmount
+    umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
+    That looks similar to 8
+
+DONE 13/ delete_inode should wait for pending truncate to complete.
+    Document I_Trunc somewhere - including that i_mutex is needed to set it.
+    Verify that assertion.
+    Actually it requires i_alloc_sem, or the inode to be deleted.
+
+
+DONE 14/ Review writepage and flush and make sure we flush often enough but
+    not too often.
+    Probably just remove the cluster_flush from write-page as lafs_flush
+    will do that.
+    But leave for now as it encourages heavy indexing.
+
+DONE 14a/ use bio_add_page to write clusters.
+
+DONE 14b/ Figure out what backing_dev to present for the filesystem.
+
+DONE 15/ The inode map file lost some credits.  I think it losts a PinPending because
+    it isn't locked properly.  Don't clear PinPending if someone else might
+    have set it.
+
+DONE15a/ Find all FIXMEs and add them here.
+    
+
+DONE 15b/ Report directory size less confusingly
+
+DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)
+
+DONE 15d/ What does I_Dirty mean - and implement it.
+
+FIXED 15e/ setattr should queue an update for the inode metadata.
+     and clean up lafs_write_inode at the same time (it shouldn't do an update).
+     and confirm when s_dirt should be set.  It causes fsync to run a
+     checkpoint.
+
+15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
+## Items from 6 jul 2007.  
+
+15g/ test directories with non-random sequential hash.
+
+DONE 15h/ orphan deadlock
+    lafs_run_orphans- lafs_orphan_release can block waiting for written
+     in erase_dblock, but that won't complete until cleaner gets to run,
+     but this is the cleaner blocked on orphans.
+    
+
+DONE 15i/ separate thread management from 'cleaner' name.
+
+DONE 15j/ review rules in getref_locked - and document them
+
+DONE  - fix accesses to iblock
+
+DONE 15k/ newblocks should probably be a count of segments.  Review that.
+
+DONE 15l/ make sure checkpoint_youth is decayed properly.  Review youth decay.
+
+DONE 15m/ consider combining .orphans and .cleaning lists.  If something is an
+    orphan, we probably don't want to clean it just now(?).
+
+DONE 15n/ consider if lafs_pin_dblock should check for iolock.  Maybe 
+     iolock or PinPending (which must be set under iolock).
+     Just require PinPending and always get iolock_written for that
+     except in special cases.
+
+DONE 15o/ Can there be async blocks when checkpoint starts?  Could they
+     pin blocks in old phase?  Do I need to check for them?
+
+DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
+     yet' thing - or somehow avoid the yuckiness.
+
+DONE 15q/ check checksums when reading cluster_header for cleaner
+       This is already done!
+
+DONE 15r/ consider further optimisation in cleaner to avoid lookups.
+
+DONE 15s/ memory barrier for i_size check in cleaner???
+
+DONE 15t/ review usable-space calculations in clean.
+
+DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
+
+DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
+     common helpers.
+
+DONE 15w/ review cluster updates and make sure space used is accounted properly.
+
+DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
+     try to create it.  Would this actually save anything significant?
+
+DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?
+
+DONE 15z/ make sure symlink creation is safe - do I need to log the body??
+
+DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
+
+DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
+     after lock is taken on block?
+
+DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
+      writeout.
+
+DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
+        if more is needed.
+
+DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
+    before removing from lru.
+
+DONE 15af/ Review all locking in lafs_refile
+
+DONE 15ag/ Don't allocate data part of InoIdx block.
+
+DONE 15ah/ Is there a problem with lafs_allocated_block putting an 
+    about-to-be-truncated block on an uninc list?
+
+DONE 15ai/ When allocating a new segment during checkpoint, delay the
+    youth-block update until after the checkpoint
+
+DONE 15aj/ When roll-forward finds a new segment, make sure youth number is
+    updated.
+
+DONE 15ak/ Load orphan file during roll-forward and make every block an
+    orphan.
+
+DONE 15al/ set filesystem update_time somewhere.
+
+DONE 15am/ filesystem 'name' needs to be handled uniformly.
+
+DONE 15an/ can we be sure 'b' will be non-null in delete_inode?
+
+DONE 15ao/ determine what locking is needed to walk the children list
+    in lafs_inode_handle_orphan.  Probably the address_space private lock.
+
+15ap/ Make sure write_inode has been cleaned up.  See if this applies to
+    rollforward of a symlink (see FIXME)
+
+DONE 15aq/ change inode map to be little-endian, not host-endian
+
+DONE 15ar/ understand what to do about errors in lafs_truncate
+
+15as/ handle errors from lafs_write_super ???
+
+DONE 15at/ More wait_queues to wait for different blocks.
+    just use wait_on_bit / wake_bit
+
+DONE 15au/ How should iocheck_block set the page error?
+       and block_loaded <- this gets it right.
+
+15av/ ditto for write errors?
+
+DONE 15aw/ when lafs_incorporate makes a new block where the
+      old is Realloc, the new should be Realloc too.
+
+15aw2 / When a block is a snapshot block it can never be dirty
+    so we only need credits for realloc...
+
+DONE 15ax/ Think about what happens when we relocate a block
+    in the orphan list (lafs_orphan_release), particularly
+    if the block isn't actually loaded.
+    FIXME still need to make sure errors will loading the orphan
+    file are handled correctly - I guess we mark all bad orphans as
+    type==0 and when we find those during release, reduce the size
+    of the orphan file.
+
+DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup 
+    when an inode or dir mutex is released.
+    No, there isn't.
+
+DONE 15az/ Sanity check all values in cluster head during roll-forward
+      i.e. in roll_valid.  If the head isn't complete, we can still
+      use this to commit some previous checkpoints.
+
+DONE 15ba/ roll forward should not BUG on bad data like inodefile in
+    non-primary filesystem.
+
+DONE 15bb/ Do I need to sync something before copying an update over part
+    of an inode, then reloading the inode.
+
+DONE 15bc/ Handle DescHole in roll forward.
+
+DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
+    in roll forward, just for consistency.
+
+DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
+    are actually the correct type.
+
+DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
+   a lookup - or at least we can test for that.
+   lafs_seg_apply_all has similar problems and needs a good solution.
+
+DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
+    if parent splits.  See what to do about that.
+
+DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
+  or handle if it has.
+
+DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
+  to twostage.
+
+DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
+   segdelete.
+
+DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
+
+DONE 15bl/ better checks for 'valid state block address' in valid_devblock
+    include that segment_count is credible
+    also in valid_stateblock
+
+15bm/ make sure everything gets free properly on error during mount / lafs_load
+
+15bn/ How does refcounting of 'struct fs' work with multiple filesets?
+
+DONE 15bo/ use put_super to drop last refer to superblocks
+
+DONE 15bp/ review all superblocks - maybe use more anon??
+
+15bq/ check readonly status in lafs_get_sb
+
+DONE 15br/ sync_fs should probably wait for something if 'wait'.
+
+DONE 15bs/ set f_fsid properly in lafs_statfs
+
+DONE  - use new write_begin / write_end
+
+15bt/    - review how we ensure that credit remain with block.
+
+15ca/ When pin inode data block, pin it as well as index block I think
+    It is still kept of the leaf list until the index block is done with
+    I think.
+
+15cb/ Layout issues:
+     DONE - subset filesys still needs a parent pointer
+     DONE - cluster head needs mtime/ctime to log these.
+     - need better tracking of which devices are in this array??
+            Need to be able to have read-only devices that are shared
+             among arrays.
+     DONE - need multiple parallel write-clusters to allow parallel writes.
+     - record tuning in state block:
+           - max_segs
+     DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
+     - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
+     - policies of whether old or new data is allowed on each device
+     - policies of how much duplication of metadata is required
+     DONE - inode map - not host-endian
+     DONE - segments > 16bit:
+        segusage file - what about youth?
+        cluster_head Clength
+
+15cc/ free any stray B_ASync block found in destroy_inode
+
+15cd/ Some code assumes a cluster header does not exceed 1 page.
+     Is this safe?  Is in true? Is it enforced?p
+     roll-forward now handles large cluster_head.
+     Need cleaner to handle it, and need to possibly write large
+     cluster head when making new clusters.
+
+15ce/ classify BUGs as
+        - internal logic errors
+        - IO errors
+        - unusual conditions I want a warning of
+        - data corruption errors
+
+DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
+     This is needed for the cleaner - the cleaner needs to hold a ref somehow.
+
+15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
+      stuff??
+
+15ch/ Review values of youth and checkpoint_youth and think about off-by-one
+     issues.
+
+15da/ Replace directory updates!!!!!
+
+15db/ Decide how version string will be used.
+
+15dc/ resolve table_size - it should be stored in the segusage file and validated
+      based on device geometry.
+
+15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
+      cluster on same device to verify previous.
+
+15eb/ When multiple devices and lots to do and plenty of free space,
+       allow multiple segments, one per device, to be open at once,
+       and possibly be writing multiple clusters at once using
+       VerifyDevNext2
+
+15ec/ Implement i_version tracking.  This should be a 64bit numbers
+       that appears to change every time the file changes.  We only
+       need a new number when someone looks at the value with
+       getattr.
+       We could simply use mtime with the sub-millisecond part being
+       a counter of times that getattr sees a change in the same
+       millisecond.
+       However as mtime can go backwards we might get i_version going
+       backwards, which is awkward.  I wonder if I care.
+       Otherwise, leave for an inode extention later.
+
+16/ Update locking.doc
+
+17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
+    calls  lafs_iolock_written.  How do we know that won't block on cluster_flush?
+
+18/ See if per-fs shrinker is available yet and consider it for index blocks.
+
+19/ Review WritePhase and make sure it is used properly.
+
+20/ Review places where we update blocks and be sure they are not in writeout
+    or in a different phase.
+
+21/ Review and document all lru uses (locking.doc) and make sure they are
+    all locked properly.
+
+22/ Check possible failures:
+    - thread allocation
+    - memory allocation
+    - reading critical metadata
+    ...
+
+23/ Rebase on 2.6.latest.  Done for .38
+
+24/ load/dirty block0 before dirtying any other block in depth=0 file,
+    else we might lose block0
+
+25/ use kmem_cache for
+        datablock
+        indexblock - probably a mempool because we cannot allow failure when
+                     splitting an index block.
+        skippoint (mempool?)
+        segsum - mempool??
+        others?
+
+26/ Review seg addressing code for 2-D geometries.
+
+27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.
+
+28/ Make sure youth blocks are always referenced properly.
+
+29/ Make sure new segments are referenced properly.  I think there might be
+    some double referencing.
+
+30/ Decide when to use VerifyNULL or VerifyNext2
+
+31/ Implement non-logged files
+
+DONE 32a/ Store access time in a file
+32b/ Make it a non-logged file
+32c/ Avoid writing out dirty atime file blocks when not necessary.
+      i.e. keep the page clean and active, and trigger 'write'
+     on release_page.
+
+33/ Support quota : group / user / tree
+
+34/ handle subordinate filesystems:
+     ss[]->rootdir needs to be array or list
+     lafs_iget_fs needs to understand this
+
+35/ review snapshots:
+      - peer lists and cleaning
+      - how to create
+      - failure modes
+      - how to destroy
+
+36/ review roll-forward
+
+DONE 36a/  make sure files with nlink == 0 are handled well
+DONE 36b/  sanity check before trusting clusters
+DONE 36c/ handle miniblocks which create new inodes.
+DONE 36d/ Handle DescHole in roll_block
+DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
+     than just iolock, for consistency...
+DONE 36f/ What to do if table becomes full when add_block_address in
+     roll_block ??
+DONE 36g/ Write roll_mini for directories.
+DONE 36h/ In roll_one, use the cluster counting code to find block number and
+     make sure we don't exceed the segment.
+DONE 36i/ add more general error checking to lafs_mount - 
+            lafs_iget orphans and segsum.  Check type is correct.
+         errors from lafs_count_orphans or lafs_add_orphans.
+         alloc_page failure for chead - maybe allocate something bigger??
+
+37/ Configure index block hash_table at run time base on mem size??
+
+38/ striped layout
+        review everything needed for safe RAID5
+
+39/ How to handle all different IO errors
+
+40/ Guard against data corruption at every level.
+
+41/ Add checksums on index blocks and dir blocks and Inodes and ???
+
+42/ Store duplicates of some blocks.  At least index and inode.
+
+43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
+    Make sure ->page_mkwrite sets up credits properly
+
+44/ Examine created filesystem and make sure everything looks good.
+
+DONE 45/ mkfs.lafs
+
+46/ fsck.lafs
+
+47/ Write good documentation
+
+48/ Review all code, improve all comments, remove all bugs.
+
+49/ measure performance
+
+50/ Support O_DIRECT
+
+51/ Check support for multiple devices
+    - add a device to an live array
+    - remove a device from a live array
+
+DONE 52/ NFS export
+
+53/ 'overlay' support
+        So I mount one device read-only an another device
+        writable which gets all the updates.  metadata on first
+        device not updated.
+
+54/ cluster support - is this possible?
+
+55/ is any useful variant of reflink  possible?
+
+56/ Review roll-forward completely.
+
+57/ learn about FS_HAS_SUBTYPE and document it.
+    This is for fuse in particular so users can know the real type
+
+58/ Consider embedding symlinks and device files in directory.
+    Need owner/group/perm for device file, but not for symlink.
+    Can we create unique inode numbers?
+    hard links for dev-files would be problematic.
+    What do we gain?  Maybe something for short symlinks.
+    40 seems a good length to get 70% of symlinks.
+
+59/ Fix NeedFlush handling so we don't drop-then-retake
+    a mutex as that isn't sensible.
+
+60/ Introduce some fs state recording that fsck is needed and possibly
+    identifying what sort of fsck.
+
+61/ Try to make the inode struct smaller - maybe move some of the
+    fs metadata into a separately-allocated struct.
+
+62/ System/trusted extended attributes:
+         fileset max size
+         directory hash/seed
+         
+63/ user extended attributes.
+
+64/ wonder if index blocks can be flushed out by memory pressure somehow.
+   e.g. if a data block is written by reclaim, flag the index block.
+   When a flagged index block has no children, it is incorporated and written.
+    ??
+
+65/ review why lafs_allocated_block needs the new_parent label.  Should not
+   lafs_incorporate leave all parents dirty? Maybe it is just the need for
+   B_Realloc - so maybe lafs_incorporate should leave the new block either
+   realloc or dirty rather than lafs_allocated_block doing it.?
+   See also 15ad below.
+
+66/ Delay writeout of directory updates until an fsync.  If a checkpoint happens
+   first, discard the updates (and fsync waits for checkpoint to complete).
+   If a cross-directory rename happens care is needed:  either flush updates
+   first or ensure that a flush does happen before the cross-directory
+   update is flushed.
+   Note that if the target of a rename is a directory, it must also be fully
+   flushed before the rename can proceed.
+
+26June2010
+ Investigating 5a
+
+   Normal sequence is to surrender UnincCredit, then to clear Dirty,
+    then to write.  If anyone re-dirties after Dirty is clear, they
+    will naturally have to add an UnincCredit having reserved space first.
+   However it seems that the Cleaner gets in the way as the block in question
+   has just previously been cleaned, which consumed the UnincCredit
+   Do we need ReallocUnincCredit?? I hope not.
+   We generally need a way to say "I might want to write to this" so cleaner
+   doesn't write it early.
+   For index blocks that is pincnt.  For data it is 'PinPending'.
+   This keeps index blocks off clean_leafs until they are ready, but
+   not data blocks.
+   And in any case, TypeSegmentMap blocks don't get PinPending as they
+   get written *after* the checkpoint.  That is a rather ugly exception.
+   Maybe we make their different handling more explicit.  We put them on
+   a separate list unpinned so the rest of the checkpoint can complete.
+   Then we flush that list?
+   Then PinPending keeps them off the clean_leafs list.
+
+   So to clarify the plan:  If a block is already Pinned to this phase,
+   we can "clean" it by marking it Dirty rather than Realloc.  This is
+   appropriate for blocks that are likely to change soon (as blocks written
+   to the cleaner segment are not likely to change soon).
+   For data blocks we take "PinPending" to say "might change soon".  For
+   index blocks ... we don't know if it is pinned by Realloc or Dirty or
+   PinPending children.  So we set Realloc and wait for any children to
+   be unpinned for whatever reason.  If it is only pinned by Realloc blocks,
+   it will end up on clean_leafs and be processed to the cleaner segment.
+   If it is pinned by anything else it will be found by the checkpoint and
+   processed to the new-data segment.
+
+   So Index blocks always get Realloc, PinPending blocks get Dirty,
+   Other data blocks get Realloc.  Good.
+
+   Must review PinPending usage... always set, then maybe-dirty inside
+   checkpoint lock.  In cases of unlocked usage (inode map) we don't clear
+   PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
+   It is likely to be changing though, so not a big cost.  Even good.
+
+   Could make the distinction later.  PinPending blocks don't go on
+   clean_leafs.  So if they are still realloc at the checkpoint, we Realloc
+   to the new-data segment.  This has the same net effect but is arguably
+   cleaner.  It means that if a realloc block gets pinpending set, it
+   immediately stops being a clean leaf and so is safe.
+   So: just keep PinPending blocks off clean_leafs.  Keep them on phase_leafs.
+   However there is no mechanism for moving things from phase_leafs to clean_leafs.
+   So maybe they stay on clean_leafs, but when the cleaner gets to them, it
+   dirties them and drops them.... that would work.
+
+   So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
+   Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
+   to pick up.
+
+   BUT:  Does this work for TypeSegmentMap blocks?  They aren't PinPending.
+
+   We could treat them specially in the cleaner.  Or we could set PinPending
+   and pin them to the phase, but treat them differently in checkpoint.
+   If we gathered them onto a separate list, then flush the list after
+   the phase had changed, it might be quite neat.  No more getting writepages
+   to do our work for us.
+   They would need to be re-pinned to the next phase, then written out.
+   Or just unpinned, and let seg_inc re-pin as appropriate... except that
+   seg_inc is too later to pin.  It dirties.  We need to pin when we get
+   SegRef.  We currently reserve but we don't pin.
+   We really do need to phase_flip these segmentmap blocks.  But that requires
+   getting extra credits, and Pinning everything if new credits are not available.
+   And we don't really have a good list of 'everything' that depends on a segment.
+   But seeing the space_alloc never fails for these...
+   So Pin them, and flip them with AccountSpace
+
+   So:
+    - split out common 'flip' code
+    - add 'flip' for data blocks
+    - create list of accounting blocks and flip accounting file blocks onto
+      that list during checkpoint
+      Flush should write that list,  not the files.
+    - Get cleaner to ignore pinpending blocks, marking them dirty.
+    - pin segusage blocks while ref on them is held.
+    - writepage no longer needs special case for TypeSegmentMap, just PinPending
+    - lafs_prealloc just tests PinPending
+
+
+   [[aside: quota files seem to be handled like segmentmap files.  Is that 
+     right??
+     We only track usage of data blocks based on various 'owners' of the file.
+     We need to know if a block was written in one phase or the next, and
+     only count blocks written/allocated in the one.
+     Data blocks can slip into 'this' phase quite late - any time before the
+     parent is finally incorporated.  So we don't write quota blocks
+     until checkpoint is done.  So yes, they are like SegmentMap
+   ]]
+
+
+  segsums....
+   If there are hundreds of snapshots, then a block being cleaned (whether to
+   cleaner segment or new-data segment) could affect hundreds of segment
+   usage counters.  That would be clumsy to work with.  Every block in the
+   free table would need to hold references to hundreds of blocks.  This
+   is do-able and might not be a big waste of space, but is still clumsy.
+   I could change the arrangement for accounting per-snapshot usage by having
+   a limited number of snapshots and having all the counters for one segment
+   in the one blocks. So 1024byte block could hold 512 counters (youth plus
+   base plus 510 snapshots).  Half that if I go to 4byte counters.
+   In more common case of 32 snaphots, could fit counters for 8 segments in
+   a block.  This means using space/io for all possible snapshots rather than
+   all active snapshots.  It would also mean having a fairly fixed upper limit.
+   I wonder what NILFS does....
+   Worry about this later.
+
+  Still trying to get pinning of SegmentMap blocks right.
+  Normally we need a phase-lock when pinning a data block so that we
+  don't lose the pinning before we dirty.  But as we phase_flip
+  these it doesn't matter... So just add that too the test??
+
+28June2010
+ Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
+  datablock not, and doesn't cope.
+  We either prealloc both, which seems clumsy, or always defer
+  to InoIdx if it is present and pinned.
+  lafs_prealloc does both Index and Data blocks for inode.
+  But Data could lose as writeout while index will replenish at
+  phase_flip, so maybe not a good idea.
+  If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
+  credits across to the data block (on non-cleaning segments) so the
+  Data block doesn't need to have credits.
+
+  dirty_inode gets called:
+     {__,}mark_inode_dirty{,_sync}
+     inode_{inc,dec}_link_count
+     [[various quota ops]]
+    inode_setattr
+    touch_atime
+      file_accessed
+    file_update_time
+      generic_file_...write
+      do_wp_page
+
+  updates through inode_setattr go to lafs_setattr so the
+  data block will be pinpending and the checkpoint lock will be held.
+
+  updates through inode_*_link_count happen in filesystem and the inode data
+   block is PinPending, or a block in the file is pinned and will be
+   dirty, so it will get written.
+
+  updates through touch_atime or file_update_time are unexpected and
+  cannot be prepared for.  file_update_time changes will be caught by
+  normal file writeout.  atime changes will be lost until we get the
+  atime file working.
+
+  So:
+    dirty_inode cannot change the block as it might be in writeout, and
+    it cannot lock anything as it might be in touch_atime which shouldn't
+    block and cannot fail.
+    So just set I_Dirty and use that to flush inode to db at writeout.
+    Any changes which must be in the next phase will come via setattr and
+    so will wait for incompatible changes to be written out.
+
+ Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
+  my_inode is set
+     lafs_import_inode
+         iget and mount-time stuff
+     lafs_inode_dblock
+
+  my_inode is cleared
+    When I_Destroyed is set and the last ref on the block is dropped
+    When inode_map_new_prepare claims an inodeblock
+
+  So we could easily not have a my_inode - e.g. just cleaning the data block.
+  ->my_inode cannot disappear while we hold the block, so a test is safe.
+
+
+ ----------------------------------------------
+ Space reservation and file-system-full conditions.
+
+  Space is needed for everything we write.
+  Some things we can reject if the fs is too full
+  Some things we can delay when space is tight
+  Some things we need to write in order to free up space.
+  Others absolutely must be written so we need to always have
+  a reserve.
+
+  The things that must be written are
+       - cluster header  - which we never allocate
+       - some seg-usage and youth blocks - and quota blocks
+         Whese continually have credit attached - it is a bug if there
+          are not enough. (We hit this bug)
+
+  Things that we need to write to free up space are
+   any block - data or index - that the cleaner finds.
+
+  Things that we can delay, but not fail, are any change to a block that
+   has already been written or allocate.
+
+  When space is needed it can come from one of three places.
+     - the remainder of the current main segment
+     - the remainder of the current cleaner segment
+     - a new segment.
+
+  Only Realloc blocks can go to the cleaner segment, so the
+  'must write' blocks cannot go there, so unused + main must have enough
+  space for all those.
+  Realloc blocks can go anywhere - we don't need a cleaner segment if things
+  are too tight.
+
+  When we run out of space there are several things we can do to get more:
+   - incorporate index blocks.  This tends to free up uninc-credits which
+     are normally over-allocated for safety.
+   - cluster_allocate/cluster_flush so more blocks get allocated and so
+     more can be incorporated.  See above.  This is probably most helpful
+     for data blocks.
+   - clean several segments into whole cleaner segments or into the main segment.
+  Much of this happens by triggering a snapshot, however we should only do that
+  when we have full cleaner-segments (or zero cleaner segments).
+
+  When cleaning we don't want to over-clean.  i.e. we don't want to commit
+  any blocks from a second segment if that will stop us from commiting blocks
+  from the first segment.  Otherwise we might use one cleaning segment up by
+  makeing 4 half-clean.  This doesn't help.
+
+
+  So: we reserve multiple segments for the cleaner, possibly zero.
+
+  We clean up to that many segments at a time, though if that many is zero,
+  we clean one segment at a time.
+  lafs_cluster_allocate only succeeds if there was room in an allocated segment.
+  If allocating a new segment fails, the cluster_allocate must fail.  This
+  will push extra cleaning into the main segment where allocations must not
+  fail.
+
+  The last 3(?) [adjusted for number of snapshots] segments can only be allocated
+  to the main segment, and this space can only be used for cleaning.
+  Once the "free_space - allocated_space"  drops below one segment, we 
+  force a checkpoint.  This should free up at least one segment.
+
+  We need some point at which we stop cleaning because the chance of finding
+  something to clean is too low. At that point all 'new' requests defintely
+  become failures.  They might do earlier too.
+  Possibly at some point we start discounting youth from new usage scores so
+  that the list becomes sorted by usage.
+
+
+  Need:
+    cut-off point for free_seg where we don't allow cleaner to use segments
+      3? 4?
+
+    event when we start using fixed '0x8000' youth for new segment scores.
+       Maybe when we clean a segment with usage gap below 16 or 1/128
+    event when we stop doing that.
+       Maybe when free_segs cross some number - 8?
+
+    point when alloc failure for NewSpace becomes ENOSPC
+       same as above?
+
+    point when we don't bother cleaning
+      no cleaner segments can be allocated, and checkpoint did not increase
+      number of clean segments (used as many as freed).
+      Clear this state when something is deleted.
+
+
+   Allocations come out of free_blocks which does not included those
+   segments that have been promised to the cleaner.
+   CleanSpace and AccountSpace cannot fail.
+     We *know* not to ask for too many - cleaner knows when to stop.
+   ReleaseSpace fail (to be retried) if available is below a threshold,
+     providing the cleaner hasn't been stopped.
+   NewSpace fail if below a somewhat higher threshold.  If we haven't entered
+     emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
+
+   
+   Possibly limit some 'cleaner' segments to data only??
+
+
+  So: work items.
+    - change CleanSpace to never fail, but cluster_allocate new_segment
+      can for cleaner segment.  This is propagated through lafs_cluster_alloc
+    - cleaner pre-allocates cleaner segments (for new_segment to use)
+      and only cleans that many segments at a time.
+    - introduce emergency cleaning mode which causes ENOSPC to be returned
+      and ignores 'youth' on score.
+    - pause cleaner when we are so short of space that there is not point
+      trying until something is deleted.
+
+30june2010
+  notes on current issue with checkpoint misbehaving and running out of
+  segments.
+
+  1/ don't want to cluster-flush too early.  Ideally wait until segment is
+   full, but we currently hold writeback on everything so we cannot delay
+   indefinitely.
+  2/ row goes negative!!  let's see...
+
+    seg_remainder doesn't change the set, but just returns
+        the remaining rows times the width
+
+    seg_step  move nxt_* to *, stepping to the next ... row?
+             save current as 'st_*
+
+    seg_setsize - allocate space in the segment for 'size' blocks plus
+         a bit to round of to a whole number of table/rows
+               nxt_table nxt_row
+
+    seg_setpos initialises the seg to a location and makes it empty,
+       st_ and nxt_ are the same
+
+    seg_next reports address of next block, and moves forward.
+
+    seg_addr  simply reports address of next block
+
+   So the sequence should be:
+
+     seg_setpos  to initialise
+     seg_remainder as much as you want
+     seg_setsize when we start a cluster
+     seg_next  up to seg_remainder times
+     seg_step  to go to next cluster (when not seg_setpos).
+            or maybe just before seg_setpos
+
+     Need cluster_reset to be called after new_segment, or after we
+     flush a cluster but don't need a new_segment.
+
+   I think I'm cleaning too early ...  I am even cleaning
+   the current main segment!!!!
+
+   OK, I got rid of the worst bugs.  Now it just keeps cleaning
+   the same blocks in the current segment over and over.
+   2 problems I see
+      1/ it cleans a segment that it should not touch
+           We need to  avoid cleaner segment increasing the
+             checkpoint youth number.
+      2/ it has 6 free segments and doesn't use them
+
+   clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
+   watermake is 4 segs, so free < 4.  So we have 3 allocated to cleaner,
+   3 in reserve and so nothing much to clean!
+
+   The heuristic for returning ENOSPC is not working.  Need something more
+   directly related to what is happening.
+   Maybe if cleaning doesn't actually increase free space.
+
+   !Need to leave segments in the table until we have finished
+   writing to them, so they cannot be cleanable. - DONE
+
+   WAIT - problem.  If cleaner segment is part-used, the alloc_cleaner_segs
+   doesn't count that.  Bad?
+
+   When nearly full we keep checkpointing even though it cannot help.
+   Need clearer rules on when there is any point pushing forward.
+   Need to know when to fail requests.
+
+02 july 2010
+
+  I am wasting lots of space creating snapshots that don't serve any
+  purpose.
+  The reasons for creating a snapshot are:
+    - turn clean segments into free segments
+    - reduce size of required roll-forward
+    - possibly flush all inode updates for 'sync'.
+
+  We currently force one when
+       newblocks > max_newblocks
+          max is 1000 , newblocks is never reset!
+          probably make that a number of segments.
+       lafs_checkpoint_start is called
+          when cleaner blocks, and space is available
+          at shutdown
+          on write_super is s_dirt
+             __fsync_super before ->sync_fs
+               freeze_bdev
+               fsync_super
+                 fsync_bdev
+                 do_remount_sb
+             generic_shutdown_super before put_super if s_dirt
+             sync_supers is s_dirt
+               do_sync
+             file_sync !!! is s_dirt
+
+      I think I should move checkpoint_start to
+            ->sync_fs
+
+
+ After testing
+  - blocks remaining after truncate - one index and 1-4 data
+  - truncate finds blocks being cleaned
+         FIXED - move setting of I_Trunc
+  - orphans aren't being cleaned up sometimes.
+        Hacked by forcing the thread to run.
+  - parent of index block has depth==1
+        Don't reduce depth while dirty children.
+        Probably don't want uninc either?
+
+  - some sort of deadlock? lafs_cluster_update_commit_both
+     has got the wc lock and wants to flush
+    writepage also is flushed.
+   Not sure what the blockage is.
+   I think the writepage is the one in clusiter_flush, and it
+    is blocking
+
+  - Async is keeping 16/0 pinned during shutdpwn
+03July2010
+
+  Testing overnight with 250 runs produced:
+ - blocked for more than 120 seconds
+      Cleaner tries to get an inode that is being deleted
+      and blocks, so inode_map_free is blocked waiting for
+      checkpoint to finish - deadlock.
+     Need to create a ->drop_inode which provides interlock with
+     cleaner/iget
+ 
+    But this is hard to get right.
+    generic_forget_inode need to write_inode_now and flush all changes
+    out and then truncate the pages off so the inode will be
+    empty and can be freed.  But flushing needs the cleaner thread
+    which can block on the inode lookup.
+    Ahh.... I can abuse iget5_locked.
+    If test sees I_WILL_FREE or similar, it fails and sets a flag.
+    if the flag was set, then 'set' fails
+
+
+ - block.c:504 DONE (I trink).
+    unlink/delete_commit dirties a block without credits
+    It could have been just cleaned..
+    It looks like it was in Writeback for the cleaner when
+    unlink pinned and allocated it....
+    or maybe it was on a cluster (due to writepage) when
+    it was pinned.  Then cluster_flush cleared dirty ... but
+    it should still have a Credit.
+    Maybe I should iolock the block ??
+
+    On reflection it wasn't cleaning, just tiny clusters
+    of recent changes which were originally written as tiny
+    checkpoints. Maybe lots of directory updates triggered the clusters.
+    I guess writepage is being called to sync the directory???
+    Or maybe the checkpoint was pushed by s_dirt being set.
+
+    So use PinPending and iolock to protect dir blocks from writepage.
+    
+ - dir.c:1266 DONE
+    dir handle orphan find a block (74/0) which is not
+    valid
+    This can happen if orphan_release failed to reserve a block.
+    We need to retry the release.
+ - inode.c:615
+    index block and some data blocks still accounted to deleted file.
+
+    No theory on this yet.  Always one index block and a small number
+    of data blocks.  Maybe the index block looked dirty, but was then
+    incorporated with something that was missed from the children list...
+    Or maybe I_Trunc is cleared a bit early...
+    Or trunc_next advanced too far?? or too soon
+    ??
+
+ - segments.c:640 DONE
+     prealloc in the cleaner finds all 2315 free blocks allocated.
+     no clean reserved.
+    Need to be able to fail CleanSpace requests when cleaner_reserve
+    is all gone.??
+
+    or just slow down the cleaner to one segment per checkpoint when
+    we are tight..  Hope that works.
+ - super.c:699
+     async flag on 16/0 keeping block pinned
+   Maybe clear Async flag during checkpoint.  Cleaner won't need it
+   No, just ensure to clear Async on all successful async calls.
+   
+     orphan file 8/0 has orphan reference keeping parent pinned
+      [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
+   Orphan handling is failing to get a reservation to write out the
+   orphan file block?  Not convincing as there should be lots of space
+   at unmount, and 'orphan sleeping' has become empty.
+
+ - Show State
+     orphan inode blocked by leaf index stuck in writeback:
+   [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5) 
+   [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0] 
+
+    This is in the write-cluster waiting to be flushed
+
+
+9July2010
+  Review B_Async.
+    If a thread wants async something, it 
+         - sets B_Async
+         - checks if it can have what it wants.
+           + if not, fail
+           + if so, clear B_Async and succeed
+
+    If a thread releases something that might be requested Async,
+         it doesn't clear Async, but wakes up *the*thread*.
+
+    This applies to
+        IOLock      - iolock_block
+        Writeback   - writeback_donem iolock_written
+        Valid        - erase_dblock, wait_block
+        inode I_*   - iget / drop_inode
+
+     orphan handler, cleaner, segscan - all in the cleaner thread.
+
+  107 runs,
+   2 hit 'Show State' with a blocked orphan inode.
+    Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
+    Both NoPhysAddr
+
+   Several runs blocked in cluster_flush or waiting for writeback.
+
+   - first case: looks like cluster flush should run but doesn't.
+        cluster_flush runs:
+           checkpoint, cleaner, cluster_allocate when full, update,
+           writepage, sync_page
+        So we have no timeout or other flush.
+      I guess if we are waiting for writeback, we need to trigger a
+      cluster_flush.
+
+   - other case - cluster_flush was called but is waiting for pending count
+       to go down.
+       Looks like cluster_reset shouldn't be changing pending_next
+
+   New hang.  Orphans not being processed:
+        inode, because InoIdx is on leaf and checkpoint isn't pushing
+        it along.
+        dir block 0 is Dirty leaf
+
+     Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
+     
+10July2010
+  Over night it looks *very* good.
+  Have one infinite loop with 31770 repeates of 
+  ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
+                   Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+
+  So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
+    lafs_add_orphan too short.
+    tracing shows after truncate_inode_pages.
+    must be blocked in inode_map_free - maybe use AccountSpace??
+   But why isn't the the truncate progressing?
+   Probably same reason:  No ReleaseSpace available.
+   Maybe we aren't cleaning because there is a free segment, and
+   we aren't checkpointing because there aren't enough yet...
+
+   Probably the cleaner has halted while CleanerBlocks - fix that.
+
+  - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
+        Need a checkpoint to release the orphan?
+   ditto for 0/331 - 331/0
+    XX/0 is InoID
+
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice 
+day...
+This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
+,UninCredit,PhysValid leaf(1) intable(6) release(1)
+ [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
+Valid leaf(1) intable(6) release(1) Leaf0(0) 
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
+
+Forgetting 0 0
+724 != 7  (st->free.cnt afte segdelete, close_segment, close_all)
+------------[ cut here ]------------
+WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
+
+we called segdelete on something that was on the freelist.
+This happens when the final cluster starts a new segment.
+Need to improve the fix though.
+
+
+ lafs_inode_handle_orphan can make progress without leaving
+ anything async.  Maybe we need a return status:
+  -EAGAIN - try after async
+  -ENOMEM - try some time soon - hope memory will be better
+  0 we called orphan_release
+  anything else loops.
+
+
+ - we allocate a segment in last checkpoint we don't
+   take references properly.
+
+ - orphan handle spinning on: 
+
+  ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+   26402 calls.
+   stuck in delete_inode?? ?
+
+
+  never-ending cleaning? Maybe just computer slow ??
+
+11July2010  - on plane to Prague.
+  How can we safely access ->iblock?
+   normally iolock, but how do we get iolock?
+   - flush data to inode
+   - cluster flush takes private_lock
+   - private_lock is used to set to null.
+  I guess we use private_lock to get a reference
+  then iolock and revalidate
+  but I can probably test for NULL at any time? though that can change under private_lock
+  If we own a reference to a child with a parent, then we can use
+   rcu_dereference to get a ref which might change
+
+12july2010
+
+ ->write_inode is called by write_inode() called by __sync_single_inode
+  to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
+ Do we care?
+
+ change to addresss we already handle with checkpoints
+ change due to setattr we can handle directly if we want
+ that just cleans mtime/ctime and atime.
+   mtime/ctime calls ->dirty_inode
+   as does atime
+
+ So:
+  getattr changes set I_Dirty so that when cluster_allocate
+  happens all the changes get saved.
+  
+  when dirty_inode is called, we set I_Dirty but don't dirty
+  the inode block.
+  If anything happened to justify an inode write, it will
+  be dirty anyway.  If it isn't, this is just atime
+
+  So on dirty_inode we check if atime has changed and if so
+  we schedule change to atime file
+
+  sync_inode should write an update for the inode if I_Dirty
+  but sync_filesystems should not
+
+  Simple.  fsync calls ->fsync.  We get that to write an
+  inode update, but nothing else does.
+
+  Possibly all directory updates could be chained onto a
+  directory and only written when fsync is requested before
+  a checkpoint.
+  both sides of a rename ??
+  leave that for later.
+
+WritePhase - what is that all about?
+  We must not change a block while it is being written to previous
+    phase, else we corrupt causality.
+  But we probably don't want to change it any way as that would
+  mess up any checksum or duplication.
+
+ So we want to ignore WritePhase - scrap it.
+ Before changing a block, we must iolock_written
+  - all dir updates
+  - inode update in fsync
+  - orphan file
+  - segusage?
+  - quotas?
+
+ But what about regular data.  If prepare_write finds a block in
+ writeback, do I need to wait, or can I just mark it dirty in 
+ commit_write?  If no checksum and no duplication applies, this should
+ be fine.
+
+16July2010
+ BUT e.g. dir operations are in particular phases.  If the dirblock
+ is pinned to the old phase, we need to flush it, then wait for io
+ to complete.  So we need lafs_phase_wait as well as iolock_written.
+ This is already done by pin_dblock.
+ I wonder if we need a way to accelerate pinned blocks that are being
+ waited for - probably not, they should be done early.
+
+ So we probably want to iolock after phase_wait in pin_dblock.
+ Though dir.c pins early.
+ I need to review all of this and get it right.
+
+ So:
+  - we aren't allowed to block much holding  checkpoint_lock as
+    checkpoint_start waits for that.  However phase_wait will only
+    block if a new checkpoint has started already, so there is not
+    chance of phase_wait ever blocking checkpoint_start.
+    So it is safe to call phase_wait in checkpoint_lock.
+    phase_wait will wait until block is written, added back to
+    the lru clean, then found and flipped... I wonder if that is 
+    good - it keeps parent from being a leaf, and so written, until
+    child write has completed.
+    We want to phase-flip a block as soon as it is allocated by cluster_flush.
+
+    With directory blocks, i_mutex stops other changes, so an early iolock_written
+    will leave the block clean and phase won't be an issue.
+
+    With inode-map blocks.. we:
+      set B_Pinned to ensure no-one writes except for phase change
+        do that after lock_written so it starts safe.
+      once we have checkpointlock, wait for phase if needed.
+      then lock_written again which should be instant but ensures
+      that block is locked while we change it...
+
+  I think I want
+    - refile to call phase flip if index is not dirty and is in wrong phase
+       and has no pinned children in that phase.
+    - Only clear PinPending if we have i_mutex or refcnt == 0
+    - before transaction:
+          lock_written / set PinPending / unlock
+      the inside cluster_lock
+          lock_written pin / change / dirty / unlock
+      it will only wait for writeout if phase changed.
+      so don't need phase_wait
+     but want pre-pin then pindblock
+     Transactions are:
+        dir create/delete/update - DONE
+        inode allocate/deallocate - on inode map DONE
+        setattr  DONE
+       orphan set/change/discard
+
+     Orphans are a little different as when we compact the
+     file, the orphan file block 'owned' by the orphan block
+     can change.  As along as we keep them all PinPending it
+     should be fine though.
+     I think that every block in the orphan file will always be
+     PinPending ???
+
+    OK - done most of that.
+    Early phase_flip is awkward.  We need an iolock to phase_flip,
+    and we don't have one.  The phase_flip could cause incorporation
+    which cannot happen until the write completes.  So I guess
+    we leave it as it is.
+
+
+   FIXME what about inode data block - cluster_allocate is removing
+    PinPending after making them dirty from the index block..
+
+  If all free inode numbers a B_Claimed,  don't think we allocate
+  a new block... yes we do, as 'restarted' is local to caller.
+
+ Also
+  each device has a number of flags
+   - new metadata can go here
+   - new data can go here
+   - clean data can go here
+   - clean metadata can go here
+   - non-logged segments allowed
+   - priority clean - any segment can be cleaned
+   - dev is shared and read-only - no state-block updates
+
+  state block needs a uuid for an ro-filesystem that this is
+  layered on.
+
+  Is metadata an issue?
+    We might want it on a faster device, but ditto for directories
+    and for some data.  So probably skip that.
+
+  Have separate segment tables for:
+    - can have new data
+    - can have clean data but not new. (this often empty)
+
+  Clean data can go to new-not-clean if nothing else
+  new data can go to clean-not-new ?? if not sync??
+  Maybe call them 'prefer clean' and 'prefer new'
+
+  I think we want:
+    'no sync new' - don't write new data, unless it is in big chunks and
+           can wait for checkpoint to be 'synced'
+    'no write' - never write anything - this is readonly.
+               used for removing a device from the fs.
+
+  A 'no sync new' device can have single-block segments.
+  This doesn't allow compression, but avoids any need to clean
+  In this case we don't store youth and the segusage is 32 bits per segment.
+  That means  - for 1K block size - 0.5% of devices used for segusage.  That
+  feels high.  For 4K, 1/1024 so a giga per terabyte.
+  Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
+
+  Other segusage for 29 snaps is 1/million of space used.
+  So we 'waste' 0.1% of device for no secondary cleaning.
+  Can still do defrag though.
+
+  clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
+  as does creating a snapshot.
+
+18jul2010
+ If lafs were cluster enabled we would want multiple checkpoint clusters,
+ one for each node. When a node crashes some node would need to find and
+ roll-forward.  For single node failure, it is enough to broadcast cluster
+ address to all others.  For whole-cluster failure, need to either list all
+ in superblock or link from main write cluster.
+
+ When writing to multiple devices we may want multiple write clusters
+ active for new data.  These all need to be findable from checkpoint cluster
+ so linking sounds good.
+ Having a single 'fork' link in cluster head might work but does scale to large
+ cluster.  I doesn't need to be committed to other not does checkpoint end, so
+ that should be ok.
+ Could have a special group_head to list other clusters for roll forward.
+ If we put fsnum first, a large value - 0xffffffff - could easily mean
+ something else
+ 
+ Or every  cluster head could point to an alternate stream, and if we want many
+ quickly, each simply points to another, so we create a chain across all writers.
+
+
+ Another issue...
+  When we 'sync' we don't wait for blocks until after the checkpoint is started,
+  and we know that will be driven through to CheckpointEnd which will commit and
+  release everything.
+  However 'fsync' doesn't have the same guarantee.  The sync_page call will ensure
+  the data has been written, but we don't know it is safe until the next
+  header is written.  So we need to push out the next cluster promptly.
+
+  So if sync_page is called on a page in writeback, then we mark the cluster as
+  synchronous.  When a sync cluster completes, the next (or even next+1) clusters
+  are flushed out promptly.  Hopefully they won't be empty on a reasonably busy system,
+  but it is OK if they are.
+
+  If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
+  as the write completes the block will be released.
+
+  So: to clarify sync_page:
+    This can be called when page is in writeback or locked.
+    If locked there is nothing we can do except maybe unplug the read queue.
+    If page is in writeback and block is dirty, then it is probably in
+    a cluster queue and we should flush the cluster and the next.
+    If page is in writeback and block is not dirty, but is writeback,
+    just flush one cluster.
+    But we don't want these cluster flushes to start while the previous is
+    still outstanding else we stop new requests from being added.
+    So as soon as the cluster can be flushed we flush, but no sooner.
+    I guess we use FlushNeeded and make that be less hasty.
+
+19June2010
+
+  superblocks....
+   We currently have a superblock for each device.
+   I cannot see a good reason for that.
+   We can just bdev_claim for 'this' filesystem.
+   Rather we should have a number of anon superblocks,
+    one for each fileset, then one for each snapshot.
+   Do we use different fs types? probably yes
+       lafs - main filesystem made from devices
+       lafs_subset - subordinate fileset, given a path to  fileset object
+                 can have 'create' option when given an empty directory.
+       lafs_snap - snapshot - given a path to filesys and textname.
+
+    Cannot create a snap of a subset, only of the whole filesystem
+    Is it OK to mount eith snap of subset or subset of snap?
+    It probably does, so need to use the same filesystem type for both.
+    Maybe lafs_sub or sublafs. Needs path to directory.
+    can be given 'snap=foo'.
+    No: a given filesystem may not exist in a snapshot.  You need to
+    mount the snapshot first, then the subset of the snapshot.
+    So we have three types as above.  All subsets as 'lafs_subset',
+    whether they are subset of main or of snapshot.
+
+    Should we be able to create a snapshot or subset without mounting it?
+    It doesn't really seem necessary but might be elegant..
+
+    remount doesn't seem the right way to edit a filesystem as it forces
+     some cache flushing.
+    What do we want to edit?
+          - add device,  remove device
+          - add/remove snapshot by name
+          - add/remove subset?  Not needed, just mkdir/rmdir and mount to convert
+                     empty dir to subset.
+          - change cleaner settings??
+    Could have remount as an option. If problem find other option.
+
+    While cleaning (which is always) we potentially need all superblocks
+    available as we might need to load blocks in those filesystems to
+    relocate them.
+    Unfortunately each super needs to be in a global list so there is a cost
+    in having them appear and disappear. I guess that is not a big deal.  They
+    are refcounted and will disappear cleanly when the count hits zero.
+
+    So:
+     DONE - change all prime_sb->blocksize refs to fs->blocksize
+     DONE - create an anon sb for the main filesystem
+     DONE - discard the device sbs, just bd_claim the devices and add to list
+     - use lafs_subset for creating/mounting subsets.
+
+  Changed s_fs_info to point to the TypeInodeFile for the super, but
+   for root/snapshot that doesn't exist early enough to differentiate the
+   super in sget.
+   So we make an inode before the super exists and attach it after.
+   Need to do all that get_new_inode does.
+        inode_stat.nr_inodes++   - just don't generic_forget the inode
+        add to inode_in_use -   seems pointless - just set i_list to something
+        add to sb->s_inodes - if we don't it won't flush - maybe that is good?
+        add to hash - don't want
+        i_state == lock|new - only really needed if hashed.
+    but there is lots of initialisation in alloc_inode that we cannot access!!
+
+   Problem is that we need s_fs_info to uniquely identify the fs with something
+   that can be set in the spinlock, so allocating an inode is out.
+   And also to get to the filesystem metadata which is in the inode.
+   I guess we allocate a little something that stores identifier and later inode.
+     for lafs  we use uuid
+     for subset we use just the inode
+     for snapshot we use fs and number
+
+
+25July2010
+  superblocks:
+   - sget gives us an active super_block.  We need to attach to a vfsmnt
+     using simple_set_mnt, or call deactivate_locked_super.
+   - sget's set should call set_anon_super
+   - kill_sb (called by deactive_super) should then call kill_anon_super
+
+  If we have a vfsmnt, we have an active reference, so we can atomic_inc
+  s_active safely.  So use this to allow snapshots and subsets to hold a
+  ref on the prime_sb and thence on the 'fs'.
+
+26July2010
+ - DONE  need to set MS_ACTIVE somewhere!!
+ - FIXME if an inode is being dropped when iget comes in, it gets confused
+    and the inode appears to be deleted.
+
+   We cannot really break the dblock <-> inode link until after write_inode_now,
+   but there is no call-back before generic_detach_inode is complete.
+   The last is write_inode which is only calledif I_DIRTY_something.
+   Maybe when writeback completes on an inode dblock, we should check if
+   the inode is I_WILL_FREE and if so, we break the link...
+
+   Or maybe when we find my_inode set we can check the block and if it isn't
+   dirty or being deleted we break the link directly... That makes more sense.
+
+   So... what is the deal with freeing inodes???
+     ->iblock is like a hashtable reference.  It is not refcounted
+             It gets set under private_lock
+      iblock is freed by memory pressure or lafs_release_index from
+             destroy_inode
+     when refcount of iblock is non-zero, ->dblock ref is counted,
+     else it is not.
+     dblock is set to NULL if I_Destroyed, or when dblock is discarded,
+       (under lafs_hash_lock)
+       and set to 'b' in lafs_iget and lafs_inode_dblock
+
+     We can drop the dblock link as soon as iblock has no reference
+
+    probably get clear_inode to break the link if possible, which it should
+    be on 'forget_inode'.  Then lafs_iget can wait on the bit_waitqueue.
+    or maybe do clear_inode itself
+
+   FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
+      dblock is not NULL.
+
+28July2010
+  So: ->dblock and ->my_inode need to be clarified.
+
+  Neither is a counted reference - the idea is that either can be freed and
+  will destroy the pointer at the time so if the pointer is there, the
+  object must be ... but we need locking for that.
+  ->dblock is reasonably protected by private_lock, though if ->iblock exists
+  we hold a ref of ->dblock so we can access it more safely.
+
+  Need to check getiref_locked knows ->dblock exists when called on iblock
+  and lafs_inode_fillblock
+   yes, both safe!
+
+ But ->my_inode needs locking too so the inode can safely disappear without
+ having to wait for the data block to go.  After all data blocks some in sets,
+ and one shouldn't keep others with inodes.
+ So something light-weight like rcu might work.
+ We use call_rcu to free the inode and rcu_readlock to access ->my_inode
+
+ Yes, that will work.  Occasionally we will want an igrab to, but not
+ often.
+ Should look into rcu for index hash table and ->iblock as well.
+ Current ->iblock is only cleared when the block is freed .. I guess that is fine...
+
+
+31Jul2010
+  rcu protection of ->my_inode
+  A/ orphan inodes - are they protected?  
+  B/ orphan blocks - are the inodes of those protected? Probably...
+
+  inodes are 'orphan' for two reasons
+    1/ a truncate is in progress
+    2/ there are no remaining links, so inode should be truncated/deleted
+       on restart.
+
+  The second precludes us from holding a refcount on any orphan inode,
+  else it would never get deleted.
+  So we must assert that an inode with I_Deleting or I_Trunc has an implied
+  reference and so delete must be delayed... not quite.
+  If we set I_Trunc but not I_Deleting, then we igrab the inode until
+  I_Trunc is cleared.  While we hold the igrab, I_Deleting cannot possibly
+  be set as that is set when last ref is dropped.
+
+01Aug2010
+  FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
+    .. and in lafs_orphan_release
+  Well... only iolock_written can be a problem, and our rules require that
+  only phase-change writeout can set writeback.  So the cleaner can never
+  wait for writeout here.  Maybe it can wait for a lock, and maybe we don't
+  really need a lock, just 'wait_writeback'.
+08Aug2010
+  So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
+   It is writeback waiting on 74/BIGNUM fromm file.c:329.  So writepage
+   tried to write a block in a directory .. but it is PinPending so that
+   must have been set after writepage got it...
+   lafs_dir_handle_orphan gets an async lock, then sets PinPending.
+   If write_page is before that, it will have the lock and dir_handle will try later.
+   If write_page is after it will block on the lock, or see PinPending and
+   release the lock.
+   So someone else must be clearing PinPending!
+     - checkpoint clears and re-sets under the lock, so that is safe
+     - dir.c clears under i_mutex
+         dir_handle_orphans always hold i_mutex ... or does it.
+     - refile drops when the last non-lru reference goes.
+     - inode_map_new_abort clears for inode
+   No, not that - just bad test on result lof iolock_written_async ;-(
+
+  Now have an interesting deadlock.
+    rm in lafs_delete_inode in inode_map_free is waiting for the block to
+    flush which requires the cleaner.
+    The cleaner thread in inode-handle_orphan is calling erase_dblock
+     on the same inode which blocks while inode_map_free has it locked....
+     no, not same block - just waiting for writeout which requires cleaner.
+     lafs_erase_dblock from inode_map_free must be async!
+   pin_dblock in lafs_orphan_release must too.... no - only the setting of
+   PinPending needs to be async or out side of cleaner, which it is.
+
+  Ok, got that fixed.  All seems happy again, time for a commit.
+
+
+09Aug2010
+   14b/  What backing-dev to show the filesystem.
+     backing-dev holds:
+         congested state
+         unplug function
+         read-ahead info
+         throughput measurements
+
+    Much of that is for generic code to use.  We need to:
+     - provide an unplug funtion that unplugs all devices
+     - provide a congested function that which checks all devices,
+       or for 'write' - at least the device we are writing to.
+
+    How do we set the backing device?
+    The 'struct address_space' point to one, as does struct super_block.
+    set_anon_super establishes a null bdi, set_bdev_super gets it from the
+    bdev->queue
+
+    We need to bdi_init and bdi_register (if no error) our bdi.
+    bdi_destroy calls unregister and reverses bdi_init
+    or just bdi_setup_and_register
+    but bdi_register_dev gives a better name - isn't this sick!!!
+
+    Partly done ... but I'm hitting more bugs :-(
+
+  -Checkpoint cannot complete because...
+   Lots of dirty inodes that are orphans are not pinned!! I
+   guess the InoIdx is ??
+   Most of them don't have InoIdx(?)  Only '8' does.
+   8/0 is also an orphan and is on wc[0]
+
+   It seems that this block keeps getting re-written and stays in
+   Phase0.
+   Is that because it is a data block with PinPending.. No, that works
+   as long as it become un-dirty: we drop pinpending, refile, and set again
+
+   It is being dirtied again during writeout for the checkpoint
+   so it doesn't get to changed phase when we lift PinPending.
+   I gues we mustn't dirty it if it is in the old phase.
+
+  -And twice inode 17 is deleted without B_Orphan being set!
+   That is the only file that exists before we mount.
+     Problem was orphan_release instead of orphan_forget
+     I wonder why it only affected 17...
+
+  -at shutdown we drop an inode and try to invalidate pages, but
+   root inode is still dirty - I wonder why.
+     The dblock is in a different phase to the iblock.
+     In checkpoint we wait until root iblock changes phase, but
+     not root dblock!
+
+
+  UP TO:
+    I'm testing subordinate filesystems, which don't work yet.
+    I need to create the root directory and inode map.
+    Obviously I cannot record the inode map file in the inode map....
+      inode_map should ignore everything less than 16? 8? 2?
+    Need to make sure creating with a given inode number works.
+    Need to make make sure auto-allocate inum is never less than 16.
+
+11Aug2010
+ How to map from filesys inode to superblock?
+  Need in
+    lafs_iget_fs
+    choose_free_inum - to get inode-1
+    ditto in inode_map_free
+    lafs_put_super has something odd with i_sb
+
+  Could do an sget search..
+  Or could just store it in the inode (but not in i_sb!!)
+  inode already a bit large though.
+  Do it for now, but make a note to trim the fs_md part of inode 
+  into a separate allocation.
+
+  lafs_new_inode should take an 'sb' not a 'filesys'.
+  In fact, get rid of filesys.  It is
+    MAP(i->i_sb->s_fs_info)->root.
+
+ 15f - timestamps for roll-forward.
+    The writeout can be much later, but logging the mtime is fairly
+    boring ... we could log mtime in the group head, which might be cheap
+    enough.  How much precision is needed, and against what base?
+    probably mtime of last checkpoint from superblock.  That should
+    be not more than 2048 seconds ago, so 16 bits gets is 30msec...
+
+14Aug2010
+ 15l - decay youth info.
+    Need to decay:
+         youth_next and checkpoint_youth in 'struct fs'
+         all blocks in youth files on storage
+         all scores in seg-tracker.
+           - not needed, they'll get updated in normal progress
+            and being wrong for a while is no cost.
+    ensure correct youth is stored in lafs_free_get
+    check little-endian conversion of all youth accesses
+
+    checkpoint_youth only used by thread, so no locking needed
+    youth_next protected by fs->lock
+
+ 15m - share orphans and cleaning list_heads in datablock
+   It certainly is possible to clean an orphan but it is very unlikely
+   as it will have changed recently, or be changing soon.
+   The cleaner could just dirty any B_Orphan it finds.
+   But if orphan finds a block on the list, it must be careful...
+   I guess when cleaner drops a cleaning ref, it should check if the block
+   is an orphan, and re-queue if it is.
+
+ 15o - async blocks just have an extra refcount.
+   This could:
+     - keep PinPending set
+     - keep an index block pinned - will phase-flip
+     - keep ->parent link
+   not not get in the way of a checkpoint.
+
+   Should we clear any that we find though?
+   Normally async is only used by cleaner, orphan processing, or segscan
+   So it should all be finished when we do a checkpoint.
+
+   So if checkpoint, or release_page, finds an async block, drop it.
+
+ 15r - further optimisations in cleaner to avoid lookups.
+   We have fsnum,inum,blocknum and cluster seq number and trunc num.
+
+   I want to introduce more async though.  Currently it only loads
+   one inode at a time.
+   To do more, I need to mark inodes as 'done' when they are and always
+   restart from the start of the cluster (only do one cluster at a time
+   for now).
+   So if we get all the way though a cluster with no 'EAGAIN' we finish
+   with the cluster.
+
+ 15y - when could a directory block become an orphan?
+    - when deleting that last entry - we don't know if it can be fully
+      deleted until we look in next block
+    - when deleting an entry follows a chain back to the first block
+    - when deleting the last entry in the block.
+
+    So it could be an orphan if the entry found:
+        - is at end of block
+       - is first entry
+       - is only entry
+     or first entry is already deleted.
+
+15Aug2010
+  looking at flushing etc when run out of space.
+  We often force a checkpoint when it won't do any good as
+  nothing has been cleaned.
+  In fact we write lots of dead checkpoints to 0/0 until it is full,
+  then move on, clean 0/0 and suddenly have space.
+  We shouldn't do that.  sync should be what pushes us forwards.
+  Maybe that is fixed..
+
+  InoIdx blocks still cause confusion.  Should they ever have credits?
+  or do only the data block have those?  Certainly they cannot have
+  SegRef.
+  And there is confusion in my mind whether data blocks can be pinned 
+  while the InoIdx block is - need to clarify that.
+
+
+13Sep2010 - now, where was I...
+ - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
+ - test run: block.c:660 - no credits available while dirtying an InoIdx block during
+   orphan handling.  lafs_reserver_block (under checkpoint lock) should have set credit.
+   Only I just changed reserve_block to do that dblock instead - I wonder why.
+   OK, I think I cleaned that up...
+
+
+ - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
+   so pin_dblock did too.  So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
+   returned error.
+   Problem is that we don't push a checkpoint when cleaner runs out of things to do.
+   But we don't want to go back to pushing a checkpoint too often.
+   Maybe the problem is that we only force the checkpoint when we have enough space to do
+   new allocations, but we need to force it earlier if nothing new can be cleaned.
+
+   Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
+   we need to wake 'checkpoint_wait' then.
+   But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
+   ...anyway, that is sorted now at commit  95b6b05e460
+
+
+  So: InoIdx blocks.
+    - These never get SegRef as that is meaningless - done.
+    - These can have credits.  It possibly isn't necessary bit it makes things
+      easier.  They are 'written' by transfering the credits to the data block, or discarding them.
+    - I think dblock and iblock can both be pinned
+      The problem this caused was that the dblock might get processed as a leaf before iblock.
+      We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
+      is pinned to the same phase.
+      lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
+      it unpins an iblock
+      So lafs_pin_dblock doesn't need to pin the inode instead.
+   OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000
+
+ 15u - I don't need to get a segref there, but I need to have one from the original dirty block,
+       so fix that up - commit Mon Sep 13 15:28:08 2010 +100
+
+ 15v - What do we have?
+       lafs_dirty_dblock:  set Dirty, clear Credit clear NCredit
+                           set Uninc, clear Icredit clear NICredit
+       lafs_dirty_iblock:  set dirty, clear credit
+                           test uninc, clear ICredit, set Unincredit - not essential
+       mark_cleaning:      test realloc, / alloc / set realloc
+                           test dirty / clear realloc/ set credit
+                           set uninc clear icredit
+       cleaner_flush:      set dirty, clear realloc, clear credit
+                           test dirty, clear realloc set credit
+       flush_data_to_inode:
+       lafs_cluster_allocate - there is some odd code ther!!
+       flip_phase
+       lafs_allocated_block
+
+       all rather different really.
+       Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock
+
+ 15w/ Space used by cluster updates??
+       It is all fine - just some confusion of function names.
+
+ 15z/ logging symlink creation.
+      Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
+      symlink itself.  So I guess we queue the block for writeout so it will go with the
+      dir update.
+      Yes, that works: Mon Sep 13 17:33:54 2010 +100
+
+ 15ab/ already did that in commit f90959e6f492b6
+
+
+ 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
+    be freed after the write completes?  A checkpoint could do it, but that would write out index block
+    that cannot be freed too.  A checkpoint would only be good after lots of data pages had been written.
+    We could just wait and let other processes kick in..
+
+    I don't think we need to do anything.  lafs_shrinker doesn't really know how tight memory
+    is, and periodic checkpoint will free up any memory that we are pinning.
+
+    .... but something is needed.  We need some trigger to write dirty index blocks
+    Maybe:
+       - a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
+        DONE THAT.
+
+    Not sure this is a complete solution.  I might want to incorp/flush index block when they
+    have no dirty children, but I'm not sure about that.
+
+14sep2010
+  15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
+     failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
+    This could split the parent which means the block could have been re-parented - it would have been in the
+    child list and so found and fixed.
+    lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
+    Inf this case, realloc isn't an issue, only dirty.  lafs_incorporate must have made it dirty and
+    it won't get written while it has these in-phase children, so all is happy.
+
+ 15ae - refile race?  Someone might set B_IOLock before removing from lru, so
+          onlru is 0 and refcnt is elevated so it doesn't seem to be unused.
+          But then whoever has the refer will refile again when dropping it and
+          so the right thing will be done.
+        But more generally, do we really want the lru etc to own a counted reference?
+        If it didn't:
+          - we would need to refile when removing from any list
+          - we would need to get a ref when removing from list.
+          uhmmm..
+
+    lafs_refile does:
+            clear PinPending  if refcnt is low
+            unpin   if not PinPending, or dirty etc and data or refcnt is low
+            place on leaf list - if pinned etc - this can be earlier
+            drop parent linkm if refcnt is low, and not pinned etc
+            handle dblock issues
+
+        if lru was not refcounted, then the only things we might do when refcnt isn't zero are:
+            unpin a dblock once it is not dirty
+            add to lru
+
+       But if we don't count lru, then we can lose the refcount on dblock
+
+     Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and
+       don't get freed.
+
+   I think I want things on 'leafs' list to not hold a counted reference.
+   Things *only* get removed while walking the list.
+   InoIdx blocks hold a ref on the dblock both when counted and some other time.  Possibly
+    when pinned.  This ensure they are held InoIdx is while a real leaf.
+   But: When we take that first ref, how do we know the dblock even exists?
+
+   What is the lifetime of ->dblock?
+         removed when page is released
+         set by lafs_import_inode
+         set by lafs_inode_dblock
+         removed by clear_inode
+   So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock
+   This is currently callers of getiref_locked
+          - erase_dblock_locked ?? shouldn't need a lock
+          - ihash_lookup - never on InoIdx
+          - lafs_make_iblock - already have dblock
+    So none of those really need lafs_inode_dblock
+    What about when we set Pinned
+         only really from set_phase ... messy.
+    What about when we set ->parent
+           grow index tree - not relevant
+           ditto do_incorporate_*
+           block_adopt
+              Can be called on InoIdx from:
+                lafs_make_iblock  only!!
+
+15sep2010
+
+  I have tidied lafs_refile up a lot but I need to make locking a lot cleaner.
+  In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref
+  is taken until I have finished my cleanup.  I suspect the inode private_lock is the one to use.
+  I also need to clean up getiref_locked and getref_locked - having both is awkward.
+
+  So: when are they called?
+
+   getref_locked:
+     lafs_get_flushable - hold fs->lock
+     first_in_seg       - holds private_lock, but shouldn't need _locked as hold a ref through child.
+     (getiref_locked)
+     pin_all_children   - hold private_lock
+     find_better        - private_lock
+   getdref_locked
+     lafs_invalidate_page - to get a ref on each block to either erase or invalidate it
+                          presumably page is locked
+     lafs_get_block     - holds private_lock - plus once with only page_lock
+     lafs_release_page  - holds private_lock
+     (getiref_locked on dblock) - no locking
+     lafs_inode_dblock  - private_lock of my_inode...
+     lafs_delete_inode  - private_lock of my_inode
+     lafs_destroy_inode - ditto
+     lafs_drop_inode    - ditto
+  getiref_locked
+     erase_dblock_locked - private_lock
+     lafs_get_flushable - fs->lock
+     ihash_lookup       - lafs_hash_lock
+     lafs_make_iblock   - private_lock
+
+  So private_lock looks like a good choice.  Issues are:
+       - what is the story with dblock on my_inode->private_lock
+       - what is the lock ordering
+       - what can refile negate that we need to be careful of.
+         i.e. we want to keep things stable while refile does its tests, but what do we need to keep
+           stable for others?
+            + we break the parent link?? and so the siblings link
+            + move things to freelist
+            + can put_page
+            + free dblock if not page_private
+
+   Lock_ordering.  private_lock, then fs->lock, then lafs_hash_lock
+   So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock
+
+   This is getting messy - I need something nice and clear.
+   So:
+     Index Blocks.
+        If Pinned, either has references or is on a leaf list - possibly both
+        If no references and not pinned then not on leaf list, so can be on free list
+
+        Pinned can only be set when there are references, and can only be cleared under private_lock
+                  This is violated by phase_flip, which badly reads refcnt
+        If refcnt is zero and not pinned, then can be moved to free_list
+        If on freelist and refcnt is zero under hash_lock, can be freed
+
+        So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore.
+            Someone else must hold a ref and will put it and it will refile.  but that is pointless as
+            it could immediately be cleared after we test Pinned.
+
+        lafs_get_flushable should get a reference before deleting from list.  This ensure it won't be freed
+         by lafs_shrinker, though it could be on the free list.  If it is, then it isn't pinned so it is not
+         interestin to us.
+
+
+       Data Blocks:
+         These are removed from lru when freed - we just need the extra refcnt check after removing from list.
+         No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt
+         so they cannot disappear
+
+    What is the story with my_inode->private_lock though?  This is used to protect ->dblock accesses.
+    I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared?
+              dblock is cleared in refile for the dblock,
+              or in clear_inode under the inode rivate lock.
+
+    So:
+     There are various places that hold a non-counted reference to a block.
+     These include
+            - index hash table            lafs_hash_lock
+            - index free list             lafs_hash_lock
+            - phase_leafs / clean_leafs   fs->lock                        only if pinned
+            - inode->iblock               lafs_hash_lock
+            - inode->dblock               inode->i_data.private_lock
+
+     Each of these is protected by its own lock, but not all the same lock.
+     When we turn one of these into a counted reference, we increment refcnt under the local lock,
+     then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has
+     finished.  This must be done before changing/using the block in any way.
+     To free an index block it must first be removed from _leafs list.  Then if the refcount is still
+     zero it can be freed - or put on freelist and subsequently freed.
+     An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference.
+     To free a data block we similarly need to recheck refcnt after removing from leaf list.
+     If it is in an inode file we also take that inode's private_lock to clear dblock.
+         We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero.
+
+17sep2010
+   review lafs_refile - are some of those tests redundant? - yes, one is gone.
+
+ So:
+  15ah - What about truncated blocks sitting on an uninc chain?
+       I don't see the problem.  It will eventually get incorporated and do the right thing...
+
+  15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in
+      a stable way.....
+    No, I don't think that is really a problem.  It only gets written out in the tail of the checkpoint after
+    the root.  I guess it could then get a youth number for a segment that it has no count for, if the root is
+    written at the end of one segment and the segusage/youth written at the start of the next.
+
+    But I think roll-forward is missing something.  Blocks in the next phase need to be counted into segusage.
+    Are they?  oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space,
+    but the important blocks picked up by the roll-forward are handled.
+
+    So....
+
+     A checkpoint could cover multiple segments.  We need to be sure these each get a valid youth number.
+     Probably most of them will, but we need a consistent approach to be sure.
+     They don't need to be added to the segtracker, except the last needs to be active, and it already is.
+     So as we find a new segment we want to do much like was lafs_free_get does youth_update.
+     But the data block - isn't that youthblk?  When it that set?
+        segsum_find sets if it ssnum == 0
+
+19sep2010
+   15ak - run the orphan file at mount time.
+     After roll-forward when we have a working filesystem, we need to read the orphan file, load each block
+     mentioned, and register each as an orphan.
+     This involves:
+            - setting the orphan_slot
+            - setting B_Orphan
+            - lafs_add_orphan
+         Just like at the start of orphan_commit
+     We also need to initialise nextfree and possibly 'reserved'.
+     But: can orphans be created during roll-forward?  They certainly can.  We currently hide that in a re-use of
+     the orphan list..  But directory updates are possible too, and not handled.
+
+     I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot
+     change the orphan file.  Then after roll-forward, we read the original part of the file and set up
+     any orphans that aren't yet.
+     So we want to read once to get the size.  Then read again to process content up to that size.
+
+   15am - filesystem name.
+       This is only used for identifying snapshots
+
+01oct2010
+  - mkfs is done to an initial version of lafs-utils. !!!
+
+ So: 15am - filesystem name - used to identify snapshots
+   So the name is pointless in subordinate filesets.  So I could just shrink
+    the metadata.  The primary metadata needs to be big enough to get a name
+    easily though.
+
+ 15aw..
+    When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'.
+    But we have the same B_UnincCredit bit for both.  Is that safe?
+    Processing the cleaner could absorb the UnincCredit while the blocks is
+    reserved but not dirty.  Then when it gets dirtied, there may be not
+    enough credits to split.
+    We set Dirty from Credit, and use ICredit for UnincCredit.
+    But when only Realloc (not dirty) we don't use those bits.  We allocate
+    fresh credits or set Dirty if that fails.
+
+03Oct2010
+   Need lafs_iget_fs to work on other filesystems.  And other snapshots?
+   We use it:
+     in cleaner when parsing cluster head
+     in orphan handler when loading orphan file or when rearranging it.
+     in roll forward
+
+   Each of these might need to kern-mount the fs - so we need to hold the ref
+   somewhere.
+   Cleaner also needs to explore snapshots.
+
+   Don't want kern_mount - that is too heavy weight and includes a vfsmnt.
+   Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need
+   to hold.
+   Similarly for snapshots.  Cleaner needs to consider all snapshots, so they
+   all need to be mounted.
+
+   So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops.
+   Subset 'sb's can be attached to the parent inode and then only dropped when
+   the inode goes... only sb currently references inode.
+   So maybe the first ref to an sb doesn't ref the inode but others do - is that
+   possible? No, as we don't see them being dropped.
+   Every inode in the subset could ref the filesys inode.  That would keep it active
+   the right amount of time, but release/destroy could still be racy.
+
+   I guess cleaner/orphan/roll need to explicitly ref the fs.
+     cleaner already refs inode when B_Cleaning, so hold fs too.
+     B_Orphan seems to own and inode ref too.
+     
+   So:
+       lafs_iget_fs gets a ref on the inode and the sb.
+       need lafs_iput_fs to drop both references
+       B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref.
+
+    cleaner holds refs on all snapshots
+
+    FIXME I probably need to hold inode/fs for B_Async too.
+       No.  Async only refs the block, not the inode or fs.
+        Something else would normally ref the inode - e.g. cleaner.
+        When the inode is free, the page invalidation will notice the
+         B_Async flag and release it.
+
+    So that is all done now, except I don't hold refs on snapshots in the cleaner
+    yet.
+
+11oct2010
+ DescHole
+   - When is this used? directory etc don't need it.
+   - a regular file might, but there is no API to punch
+     a hole.... yet I guess.
+   - So we just want to allocate these blocks to 0.
+
+15oct2010 - happy birthday Daniel...
+ Looking at 36:
+  a/ files with nlink==0;
+        If we happen to find them, we hold a reference until all roll-forward
+        is done, incase a name is found - it is important not to start deletion
+        early.
+
+18oct2010
+  36g - write roll_mini for directories.
+   We get a name, an inode number, and one of:
+      LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET
+
+   The REN_SOURCE is linked with a REN_*_TARGET which could be in a
+   different directory, so we need to stash the SOURCE until the TARGET
+   arrives.
+   We simply impose the implied change on the directory and update the
+   link count in the target inode.
+   So:
+     load the inode
+     possibly record REN_SOURCE for later
+
+     calls prepare/pin/commit as appropriate.
+     Put the inode on orphan list if appropriate - needs care
+        as we retarget orphan list.
+     update inode link count.
+
+   (28Feb2011)
+   Just a refresh on the purpose of these updates.
+   1/ They allow us to fsync a directory without performing a full checkpoint.
+     As directory blocks are not processed in roll-forward we need the update
+     for data to be safe.  As fsync of directories are rare in some common
+     situations we could avoid actually writing these.  Simply queue them
+     internally and discard them on a checkpoint.  If an fsync comes before the
+     checkpoint, only then do we write them out.  If there are any cross-directory
+     renames then the preceeding updates in both directories need to be flushed
+     before the cross-directory rename.  It might be easier to always flush on
+     a cross-directory rename.
+   2/ They ensure consistency of inode link-count wrt to names in the filesystem,
+     but as link count is only updated by these (or a checkpoint) there is no
+     problem with delaying.
+
+   So: when replaying these we must update the directory content and the inode
+   link count.
+   It is OK to delay the write-out of these until an fsync, and not bother
+   if a checkpoint happens.
+   So add that to th TODO list - item 66.
+
+28feb2010
+  - roll forward directory updates ... I wonder if I got it right :-)(untested).
+
+
+  I don't seem to have easy-access notes about the various meaning of
+  'width' and 'stride'
+
+  width:  The number of independent devices across which the (virtual) device
+    is placed.  The normal goal is to write 'width' blocks on every single write.
+    On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
+    and it will keep all devices equally busy with writes.
+    The 'width' blocks probably aren't consecutive.
+
+    There are two different layouts - one with width*stride <= segment_size
+    and one with width*stride > segment_size.
+
+  width*stride <= segment_size
+     This is a traditional striped layout like RAID0/4/5/6.
+     The 'stride' is the chunk size, so 'width*stride' is the stripe size,
+     and segment_size must be a multiple of this.
+     In this case all addresses in a single segment are contigious.   We don't
+     necessarily write them in order if we want to write less than one stripe.
+     segment_offset will normally be a multiple  of width*stride though this isn't
+     enforced as one could have a partition with an non-aligned start.
+
+  width*stride > segment_size
+     This implies a catentated layout.  If parity-redundancy is in use when the
+     blocks which combine to form a stripe are 'stride' blocks apart.
+     The benefit of this layout is that an extra drive can be added by simply
+     zeroing it and joining it to the array - no re-stripe needed.
+     This will make all stripes slightly larger so at first the space will not
+     be available.  As cleaning happens the space will gradually become
+     available.  This still requires restriping, but unlike a normal
+     raid5 restripe, the space becomes available in small amounts immediately,
+     when there is no demand for more space, the re-striping (cleaning) can happen
+     at a very low priority with no cost.
+
+     In this case the blocks in a segment are not contiguous.  
+      'segment_size/width' are, then there is a large gap (in virtual address 
+      space) to the next chunk.
+
+     The segment_offset is an amount of space which is free at the start of
+     each device.  0..segment_offset and stride..stride+segment_offset etc
+     do not contain data and can be used for metadata.
+
+  When width > 1 it makes sense to replicate each state block across
+     every device - as we want to write the whole stripe anyway.
+  For now we only write and read the first two copies at the beginning, and
+  the last two at the end...
+
+  Question:  what do we want to do about metadata on flash devices?  We really
+   don't want a small number of locations to store the metadata, but a large
+   number that we search through - possibly a binary search. 
+   These could be all at start/end or scattered throughout the device.
+   The later would make it impossible to find efficiently - there is no way to
+   create useful linkage without writing something else at start of end.
+   As many devices optimise for random writes where the FAT table would be,
+   it make sense to just put the metadata there and not at the end.
+   We should allow one 'page' for each metadatum, which probably meanss
+   32K.
+   So we should allow all state blocks to be near the start.
+
+01mar2011 - Autumn arrives.
+
+  Time to add handling of 'atime' and non-logged files.
+
+  The idea is to have a separate file for storing only 'atime'
+  This is separate from the inode file because the volatility of the data
+  is very different and one of the principles of log-structured-fs is that
+  differently volatile data should be kept separate.
+
+  This does mean that an inode lookup requires getting data from two files,
+  but it is hopped that the 'atime' file will mostly be in cache as each
+  block contains the atime for lots of different inodes.
+
+  The atime file contains 2 bytes for each inode, so with a block size of 4K,
+  each block would hold info for 2048 inodes.  1 million inodes would require
+  2 megabytes.
+
+  The 16bits are treated as a positive floating point number which
+  gets added to the atime stored in the inode.  The lower 5 bits are
+  the exponent, the remaining 11 bits are mantissa.  Though there is a
+  little complexity in interpreting the exponent.
+     If the exponent is 0, the mantissa is used as milliseconds -
+       so shift left 5 and multiply by 1000000 for nanoseconds.
+       The smallest change that can be recorded in 1 millisecond.
+       and values up to (2^11-1) milliseconds - or 2seconds can be stored.
+     If the exponent is 1 to 10, the mantissa has a '1' appended as a
+       new msb, and is shifted by the exponent-1 and then treated as milliseconds.
+       This ranges up to 2^(12+9) milliseconds or 30 minutes, where
+       the granularity will be 2^9 millisecs or 0.5 seconds
+
+     For exponents from 11 up to 31 we add the 1 msb and treat
+       the number as seconds after shifting (e-11).  So at e==31,
+       we shift a number that is
+       up to 4095 by 20 to get nearly 2^32 seconds or 136 years.
+       At this point the granularity is 2^20 seconds or 12 days.
+
+
+   So overall we can update the atime for 136 years without needing to
+   update the inode, and can record differences of 1msec for the first
+   couple of seconds, then gradually less granularity until we are
+   down to one second an hour after the last change, and 4 hours a
+   year later.
+
+   To convert a number of seconds to this format:
+
+   If >= 2048 seconds, we shift down until less than 4096 seconds
+   counting the shift.  We add 11 to that number to form exponent,
+   and shift the resulting mantissa up 5, or with exponent, and mask
+   out bit 16.
+
+   Otherwise we convert to milliseconds (divide nanno by 1000000 and
+   multiply seconds by 1000, and add). Then if < 2048, we shift up by
+   5 leaving a zero exponent and use that.
+
+   Otherwise we shift down until < 4096 counting shifts, add 1 to the
+   shift to form an exponent, and combine with mantissa as above.
+
+   So that is the format - how do we implement it?
+
+   We don't want to expose to user-space numbers that we cannot store.
+   So any 'utimes' call updates that the inode directly can clear the
+   value in the atime file.  Only updates due to accesses go to the atimes
+   file.
+   We define a 'getattr' function which looks at the atime stored in
+   the vfs inode and if it has changed we need to deal with it.
+    - if the inode is still dirty we simply update the lafs inode
+      and use the number as-is, clearing the atimes entry
+    - else we subtract the stored atime from the new atime.  If this
+      is negative or exceeds 136 years we mark the inode dirty and
+      store it there.  It we cannot mark the inode dirty for some
+      reason we just store all 1s in the atime file.
+
+    The same operation is needed when dirty_inode is called to make
+    sure atime updates get saved even when no getattr is called.
+
+    As we always need to be able to update the atime file, it needs to
+    be permanently pinned whenever an inode is read in.  For
+    non-logged files this should be cheap but we must do it anyway as
+    the file might not be non-logged.
+    So we need to keep a permanent reference to each block while the
+    inode is loaded.  That can keep it pinned.
+
+
+    We don't want updates to the atime file to be flushed in any great
+    hurry, especially if it is a logged file.  We would be quite happy
+    to only write at 'unmount' and probably 'sync'.
+    So we want to stop the pages from appearing dirty in the page
+    cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty
+    (I_DIRTY).
+    We can still keep them dirty in lafs metadata so if release_page
+    is called we can schedule a write out then.
+
+
+   So some steps:
+
+    1/ load atime file at mount time - there is one for each
+      filesystem.  It has inum of 3 and type of TypeAccesstime (6).
+      Also release it on unmount.
+
+    2/ loading an inode must take a ref to the block in the atime file
+      if it exists.  A new inode flag records if this has happened.
+      Unless mounted noatime, we pin the block and reserve space.
+
+    3/ getattr and dirty_inode must resolve any issues with the
+       atime.  So lafs_inode probably needs an extra field to be able
+       to check for changes
+
+
+
+  Hmm.. this is getting confusing...
+  When atime is changed the only way we find out is by ->dirty_inode
+  being called.  But that is called when anything is changed.
+  Filtering out whether or not we need to update the inode itself
+  is awkward... maybe there is some context we can use.
+  ->dirty_inode is called by mark_inode_dirty which is called:
+   - by touch_atime, if something changed
+   - file_update_time  - at which time we also update iversion
+   - setattr ... which has changed recently (2.3.37ish)
+   - page_symlink
+   - generic_file_direct_write - which increasing size of inode
+   - set_page_dirty_nobuffers
+
+  So either the inode is pinned, or it isn't.
+  If it isn't, then this *must* be an atime-only update.
+  If it is, then it could be anything, but in any case we update the
+  atime directly.
+  So: dirty_inode should try to get dblock and check if it is pinned.
+   If it is pinned, then update the atime immediately and the offset
+   in the atime file too.
+   If not, just update the offset
+
+
+03mar2011
+  ARGggg... checkpin is interfering with unmount - it keeps an
+    s_active count so unmount 'works' but doesn't release anything.
+
+  checkpin is needed is needed to ensure that inodes remain safe while
+  we are cleaning.  Particularly, while the inode index block is
+  pinned, we keep the inode and fs referenced as well.  I guess the
+  theory is that they won't stay pinned for long - but they do.
+  e.g. segusage blocks are permanently pinned.
+
+
+  We could have a rule about the prime filesystem always being mounted.
+  Then we don't need refcounts, but kill off the cleaner before
+  unmount...  which we sort-of do..
+
+  All subordinate filesystems have references on the prime_sb so the
+  prime_sb must be the last one to go.  When it goes it kills
+  everything off...
+  So we don't need checkpin to take a ref on the prime_sb.
+
+  There might be still an issue with files in subset filesystems
+  being permanently pinned so they stay around longer than they
+  should... need to check on that somehow.
+  The idea is that a quota file block is permanently pinned so it
+  will keep the fs pinned.  That in turn will keep everything else
+  pinned... Worry about that when we implement quotas FIXME
+
+04mar2011
+  I really need to sort this out, and it isn't easy...
+  We really want to know when "all" filesystems have been unmounted
+  so the block device(s) can be released and the cleaner stopped.
+  But we don't have a count for that.  We could if that was all
+  we counted - but that would mean that we only have a single
+  struct super_block for all filesystems.
+
+  So that is what I have to do.  A single super_block for all parts
+  of the filesystem.  I probably still need to allocated other
+  dev numbers stat->dev, but I don't need to use them internally.
+  Maybe I even allocate superblocks... Yes - we need to use
+  set_anon_super and kill_anon_super to allocate the numbers.
+  lafs_inode will need a pointer to the filesystem - we use that
+  instead of the sb.
+
+  -------
+
+  Testing...
+   bug at block.c:658.  Block not B_Valid in lafs_dirty_iblock from
+   lafs_allocate_block  from cluster_flush.
+   Block is 74/0: InoIdx block of a newly created file I think.
+    '74' was /f23, then  /mnt/1/adir.  We are creating file in that
+   dir.
+   This is a depth=0 InoIdx block - i.e. the data is in the
+   dblock, so there is no index info, so it kind-a makes sense for the
+   index block to not be Valid.
+     yes- commit d268a566605bf006cf33c confirms that.
+
+   So why are we trying to dirty it?..
+
+   Maybe:
+     We create a couple of directory entries, then flush and end up
+     with an in-line data block.
+     Then we add more, flush again and so try to dirty parent...
+   Where to we turn depth=0 inodes to depth=1??
+      - erase_dblock_locked - don't want that
+      - lafs_incorporate
+   So I guess the 'bug' is in error - it is OK to mark that invalid
+   block as dirty.
+
+04mar2011
+  So - back to the super_block reworking.  We want only one
+  superblock.
+  So we use the TypeInodeFile inodes a bit more to hold the details
+  of different filesystems.  We need to store a unique 'dev' number in
+  there use set_anon_super/kill_anon_super on a local 'struct
+  super_block' and copy s_dev in/out.
+
+  As we only have one sb, we can only have one fstype, so we cannot
+  use the fstype to choose what to do.
+    - if dev_name is a block device we try an normal mount
+    - if dev_name is a Inode file, we perform a subset mount
+    - if dev_name is a lafs dir and '-o snapshot=name', we mount that
+      snapshot
+    - if dev_name is a lafs dir in root with perm zero and
+      '-o subset=MAXSIZE', create a subset filesystem.
+
+  - lafs_iget needs an inode rather than a superblock
+    ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free,
+    choose_free_inum, inode_map_new_prepare
+  - lafs_iput_fs,lafs_igrab_fs, ino_from_sb
+
+  - NFS filehandles need careful thought
+     They are 'per-super-block', not 'per-vfsmnt' which might be
+     better.
+     We could change that but.....
+     For non-snapshot files it is easy - just record two inodes, the
+     fs and the target.
+     For snapshots there is nothing that is really stable.
+     Maybe we could have different superblocks for snapshots.
+     The snapshot doesn't need the cleaner as it is read-only, though
+     the cleaner can need the snapshot...
+
+     So the cleaner might automagically mount a snapshot, but a
+     snapshot will never invoke the cleaner or any other thread stuff.
+
+  So I guess we want one superblock for the fs and one for each
+  snapshot.
+  The filehandle is then either inum+gen or inum+inum+gen where first
+  inum must be TypeInodeFile
+
+07mar2011
+  ... though I could just put a snapshot number and partial timestamp
+      in..
+
+
+08mar2011
+ This isn't a new to-do list, it is a list of the main features that are
+ still not implemented:
+   - full 2D layout
+        + at very least I don't pad with zeros yet
+        + if stripe size were multiple of 3*3*5*7*2^N, then changing
+          width might be managable.
+          e.g. stripe size: 40320 blocks.. But with megabyte chunksizes,
+          we really want 32bit segsizes and 322560 block segments.
+   - non-logged files - with interface to request access-time file
+   - quotas
+   - snapshots:  particularly cleaning
+   - error handling
+   - metadata (inode/directory/etc) CRCs and duplication
+   - fsck / debugfs
+
+
+  What would fsck do?
+   - locate and validate device and state blocks.
+   - locate and validate checkpoint cluster.
+   - locate and validate filesystem root
+   - roll forward to collect segusage and quota blocks.
+   - load inode map, read inode file, validate each inode and make sure
+     map is correct.
+   - explore each file, following all indexing, count segusage for each
+     segment and make sure segusage file is consistent.
+   - check no block is allocated twice.  This might require multiple passes,
+     each time we examine a different collection of segments.
+
+   - checking a file requires:
+          - checking inode is consistent
+          - checking index blocks are consistent with depth
+          - checking index/extent blocks are sorted with no overlaps
+          - checking block/iblock counts are correct.
+   - checking all cluster headers in the current segment to ensure they
+     look consistent and agree with file information. i.e. if cluster_header
+     identifies a block, the block must live there, or later in the segment.
+
+   - scan all directories looking for consistency of hash etc.  Count links
+     for all inodes.  This might need to be multi-pass too.
+     Could use a bitmap for single-link files, and table for others.
+
+   How to fix errors.
+     - First must find segments which are not in use according to segusage file
+       or according to block search.
+       If there are none, require a new device be provided.
+     - If anything looks incorrect, write corrected version to new segment
+       Then write out new segusage files
+
+   In some cases we might need to search all write-clusters for missing blocks??
+   That could take a very long time!
+
+
+   What do I really want to do about CRCs and hashes.
+    It might be nice to store a hash for each block in the index block.
+    But that wastes precious index-block space.
+    If I store a CRC together with address info in the block, then I could
+    be fairly sure it is the right block.  So e.g. inodes store the inode number,
+    Index blocks could hold inode+depth+address.
+    Last 8 bytes of each block could be a 4byte CRC and a 4byte identity.
+    identiy is XOR of fsinum inum blocknum generation - or a CRC of these.
+
+    Actually, we don't need to store the identity info - we just need to
+    include it in the CRC.  That either saves space, or allows more bits to
+    be used for the CRC, which is probably the best use of bits for detecting
+    errors.
+    Though it might be nice to store phys-addr in the CRC too, we cannot as
+
+21mar2011
+  My short-term todo list is:
+DONE  - get 'lafs' to the stage where I can create an fs requiring roll-forward
+DONE  - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
+DONE  - Make lots of 'layout' changes - see 15cb
+
+02may2011
+  - 'run' goes to completion, but segusage isn't updated in the final cluster
+       and the number left over from before looks wrong.
+DONE  - 'ls -l' on a subset file gets confused.
+  - fs created by 'lafs' has wrong Blocks and Inodes counts
+  - we lose a ref to a segsum and sometimes put it too often.
+REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+
+
+03may2011
+  Once I have these bugs sorted out I want to make some format changes.
+
+   DONE - fs_metadata need a 'parent' link
+        rename needs to be careful about what is updated!
+        so does roll_mini
+        lafs_get_parent needs some thought.
+
+   DONE - roll-forward should get exact mtime stamps, and ctime.
+     So each data block must have an exact timestamp
+     of when the change actually happened.   Or the group_head
+     has a timestamp for the most recent update to the file
+     As we use nanosecond timestamps (pointless though they are)
+     we need 30 bits for the nanoseconds and at least 11 for the seconds.
+     So 48 bits (6 bytes) is plenty.
+     So include a 64bit timestamp in the cluster_head and 48bit
+     number to subtract in the group_head
+     But saving 2 bytes per file isn't really worth it, and we may
+     well lose it in padding.  So just store a 64bit timestamp in
+     the group_head.
+
+   DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
+
+   DONE - state block flags for inconsistencies found
+       If any inconsistency found, fsck is advised.
+       For some it may be imperative.
+       Things that can be wrong include:
+       - generic read error
+       - segusage negative
+       - index block incoherent
+       - dir block incoherent
+       - link count negative
+       - cluster header incoherent
+       -
+       64 bits should be adequate and simple for this.
+       Any unknown bit requires a full fsck.
+
+   DONE - 32bit segment size
+        With 16bit at 4K blocks we are limited to 256Meg segments.
+       64Meg with 1k blocks.  This takes about 1 second to write on
+       a modern drive.  On an array it will take even less time.
+       24bits gives 16 to 64 gigabytes which is plenty.
+       However 24bits is awkward to access. a 1K block holds 341 1/3.
+       A 4K block holds 1365 1/3.
+       But this wastes less space than 256 or 1024 and so causes less IO.
+       But then we probably want to size segments to be very big.
+       A few thousand segments should be OK, which is tens of blocks.
+       I don't think the savings with 24bits are worth it, and I do
+       think v.big segments could be useful, so lets go with 32bit segments.
+
+       Youth is currently tuned to 16bits.  Let's leave it there and
+       maybe waste some space.
+
+
+   - parallel new-data write clusters.
+       I think it is sufficient to include a second 'next_addr' in the
+       cluster_head - or maybe two.  alt_next_addr[2].
+       When a thread wants to start a new stream of clusters it allocates
+       the segments then attaches to the next outgoing write cluster.
+       Once that is written everything in the new cluster is safe.
+       On a checkpoint every stream writes at least one checkpoint cluster
+       and these are linked together through alt_next_addr.
+       The 'next' cluster for each must be the checkpoint cluster and must
+       carry linkage but unlike with first-link, there is no need to wait
+       The data is already safe as long as the state block isn't updated
+       until every cluster_end block is written.
+       So really, one is enough.  I had though 2 would enable quick fan-out
+       but there is no real need for that.
+
+       As 0 is a valid write-cluster address we use 'this_address' to signify
+       that there is no alt-next.
+
+       It is possible that a block of a file could be written to two
+       different streams at different points in time between two checkpoints.
+       We need to ensure that roll-forward gets these in the right order.
+       'seq' can be the same in two different streams so we cannot use that.
+       timestamp could possibly be used, but as times can go backwards it
+       is not ideal.
+
+       NEW IDEA.  Just use one stream of clusters.  However it can
+       bounce from one device to another easily.  So two different
+       threads can be building up two different write clusters at the
+       same time as long as they synchronise at some point to pass
+       addresses around.  They also need some other Verify mode as
+       VerifyNext or VerifyNext2 will destroy any parallelism.
+       As the point of this is two write to multiple devices in
+       parallel, maybe VerifyDevNext{,2} meaning the next header on
+       the same device serves to verify this.
+
+   - policies.
+       This includes
+               maximum number of segments written between checkpoints
+               whether data can be cleaned to a particular device
+               whether a device can receive new data
+               whether metadata duplication is needed
+               whether an RO device from a different array is allowed.
+       Some of these are per-device policies.  Some are per-array.
+
+       The 'RO Device' thing is special.  I think I want an alt_uuid.
+       It works like this:  You assemble the RO array when you
+       mount a new filesystem identifying the old as a component.
+       So that 'state' block on the new devices must identify the alt_uuid
+       and state seq number.
+
+       Do we want to record more info about which devices are in the
+       array?  Currently we just record how many.  If we find enough
+       with the right UUID/seq, they must be it.. what else would we
+       want?
+
+       For all the other policy statements it is probably simplest to
+       allow a set of simple strings. e.g. "noclean", "nonew",
+       "dup=2" "maxseg=5"
+       devblock currently uses 146 bytes, so room for 878
+       stateblock uses 112 plus some for snapshots, so much the same.
+       We currently don't use 'version' and have no concrete plans.
+       The vague idea is to allow lafs to *know* that it cannot mount
+       the array, so any incompatible feature gets set.
+       We could keep those in the policy sets.  From that perspective
+       there are 3 types of things.
+        - if you don't understand, don't worry
+        - if you don't understand, don't try to write
+        - if you don't understand, you cannot even read.
+
+       That last is really best avoided.  We have version info
+       elsewhere in the tree so that a new index style will simply
+       make that block unreadable.
+       So I think make the dev and state blocks a simple incrementing
+       version number which apply to that block, and have "don't
+       worry" and "don't write" policies distinguished by first
+       letter.
+       Capital is "If you don't understand, don't write"
+       Lower is "if you don't understand, don't worry".
+
+       These are space separated strings
+
+   - etc.
+
+   - what about i_version?  Include in timestamp?