So, let's try to write a kernel module that implements this filesystem.
It would be good to have a plan.

- Mount filesystem, providing empty root directory
   o parse mount options - DONE
   o find/load superblocks and stateblocks - DONE
   o present empty directory - DONE
   o Compile external module - DONE
   o test DONE

- Mount filesystem read-only with no roll-forward
   o IO address mapping
          sync_page_io or bread? - not bread I think
   o Index blocks management
   o search cluster-header for root inode
   o file read
   o Directory lookup/read
   o test

- Support roll-forward for blocks, orphans, whatever
   o manage segusage files
   o manage quota files

- Support writing
   o inode bitmap
   o cluster creation / block sorting

- Support Cleaning

- Interface for snapshots and other admin


------------------------
FIXME
 If a device is removed from the filesystem, we cannot reliably
 tell from the other devices or state that this is so.
 Maybe we need to update all devblocks with a new 'seq' number...
FIXME
 How do we specify mounting subordinate filesets?
 What superblock do they have?
 I suspect we do a -F lafs-sub mount from the original filesystem.

FIXME
 If mount fails, we seem to be leaving a super lying around,
 and sync_supers dies on it. - DONE

FIXME
  Umount appear to work, but a sync_supers dies. - DONE

FIXME
  subordinate supers aren't being locked as much - is that a problem?

FIXME
  index pages never get put on an LRU - how is this supposed to work?


--------------------------
Thoughts:
  Inodes live in an address-space, much like a file.  To load the
  first inode, we need an address-space, so may as well have an
  'struct inode' as we may want to expose it to user-space.

 Loading an inode, need
   fs (lafs filesystem structure)
   which subfs (maybe a lafs inode)
   which snapshot - this is implied by the subfs inode.
   and fs can be obtained from inode, so just inode, inum


UPTO
03nov2005
  review block_leaf_find and make_iblock
  need to do setparent and block_adopt next

10nov2005
  need to resolve locking for ->siblings list

24nov2005
  peer_find
  lock_phase
  lafs_refile
  
  I can read a file.....!!!!!
  Code review / tidy up.
     resolve locking buffer vs page

  Export on a web page somewhere??

16feb2006
 (I spent a while getting large-directories to work again in prototype..
  and some holidays).
 - Priority: clean mount and unmount
 - large directories
 - multiple devices.

  FIXME how do we record and handle write errors???

 The iput in lafs_release - which is needed - is oopsing
  at iput+0xe!

23feb2006
 Ok, I finally have a clean mount/unmount.
 .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.

 Next:
  - make sure we can handle 'large' directories.
  - make sure we can handle files with indexes
  - handle filesystems that span devices.

02mar2006
  Hurray - clean unmounts!!!
  There is a nasty circular reference of the root inode which is stored in 
  a block that it manages.  Maybe this should not happen, rather than having to be
  explicitly broken - the root-block can live elsewhere, not in the inode.

  Next multi-level index blocks.

  But first, need to understand memory pressure and pageout.
   How are dirty pages found to be cleaned?
   How is pressure put on a filesystem to clean up?
   How are clean pages reaped?

  - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
      lru_cache_add{,_active}(page) might be easier, but isn't exported.
  - call mark_page_accessed(page)  to keep the page 'active'.

09mar2006
  - make sure indexes work...


 lafs_load_block+0xf
  eax,bx,cx,dx,s1 all zero
from block_leaf_find 203

  ... OK, indexes seem to work.
   But 'lafs' have problems creating some large files. 
   Try 'tt'

   This is due to not handling error properly.. fix it later FIXME

16mar2006

  Must make sure the index address-space gets clearred up... I wonder
  how we find all the pages to free.  This might be one reason to keep them
  in a radix tree.  Though we should be able to walk our own data structures.


  Then work on mounting a 2-device filesystem.


  FIXME dir_next_ent always starts from the beginning rather than 
   remembering where it is up to... can this be fixed??


18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)

  Mounting snapshot needs a way to identify that it is a snapshotmount
  and which snapshot, and which filesystem.
  We could use a different filesystem type, but that isn't really needed

    mount -t lafs -o snapshot=name /original/mount/point /new

  This grabs the named snapshot of /original/mount/point and places it at 
      /new
  The 'snapshot=' option is the trigger.

  For a control FS, we
        mount -t lafs -o control /original/mount/point /new

  To grow a filesystem, we initialise a device (super/state blocks) and
        mount -t lafs -o remount,new=/dev/name whatever /original/mount/point

  as the dev_name isn't passed to remount

  So, mount options are:
        snapshot=name
        dev=/dev/device
        new=/dev/device
        control
    and various
          name=value 
    pairs matching what is exposed in the control filesystem

23mar2006
 - factored out super-block finding preparatory to finding snapshots.

 Thoughts:
    superblocks for snapshots and sub-ordinate filesystems do
    not get stored in the 'state'.  There is, however, a usage count so that
    the prime filesystem cannot be unmounted until all snaps and subs are gone.
    This should just refcount the prime_sb I suspect.

    So: a snapshot sb points to the 'struct fs' but doesn't .... what???

30mar2006
 - remove the super-block finding code by changing the layout to store
    superblock locations explicitly :-)

 - teach 'mount' to mount snapshots.

 - need to audit for bad use of ss[0]
 - need to find better way to map 'sb' to snapshot number.
 - need to make unmount work.

01apr2006 (no, really!!)
 - rewrite index to kmalloc index blocks and use a shrinker to free them.
   This means that indexblock no longer has a 'page', which makes sense.
   It also means they cannot live in highmem, which is sad, but could
   be fixed.

  Notes: superblocks and refcounts.
   Each device holding the filesystem gets a superblock.
    One of these (arbitrarily) is the 'prime' superblock and gets to
      manage the whole filesystem.
    Each snapshot also gets a superblock, as does each
      subordinate filesystem.  These are anon sbs - using anon dev.
    Each anon sb takes a reference to the 'struct fs', and also to the
      prime sb.... how about the reference relationship between fs and prime_sb???

    Need to ponder this,

   - problem with getting parent superblock due to semaphores...
   - when unmount, put_super isn't being called, so inode 0 isn't released!

13apr2006
  (Took a week off to play with rt2500 wireless cards)
  - Use different filesystem type for snapshots and subordinate filesystems.
    This removes the semaphore problem
  + OK, mount and unmount works for snapshots... what next?
     - review index block - worry about himem?
     - review ss[0] usage - OK
     - general code review

  FIXME - what should leaf_lookup/index_lookup return on format error?
      The currently return '0' which will quietly make an empty block.
      Many '-1' would be better to make an error block.
  FIXME check how other filesystem lock the setting of PagePrivate
     Maybe just need to lock_page
  FIXME combine find/load/wait into one operation
  Review dir, super, roll, link

  FIXME module refcount increases on failed mount!

18may2006
  I've been sick for too long, and not much has happened... However I think more than
  the above comment says.  I started looking at roll-forward and have the 
  basic block parsing in place so that it reports what it sees in the roll.
  Also, the format has been changes a little: the address in the state block
  is the CheckpointStart cluster, and we simply roll forward to the
  CheckpointEnd, and then keep going beyond there - there is no longer any
  walking back to find the start.

  Next step is to start incorporating rolled elements into the filesystem

   - data blocks: shouldn't be too hard.  Don't need to update the
           index pages just yet
   - inode updates: should be straight forward enough, but care is needed
           as the data might be in multiple places
   - directory updates: these are probably most interesting..


  Question: how are symlinks created?
    Currently we:
      log the inode creation
      commit the new inode
      log the directory update.
    This allows the 'value' stored in the inode to appear after the directory
    update.
    That might be OK for files (Which are created empty and then extended)
    but is bad for symlinks (which are created atomically).
    So, options include:
     - ensure inode is in a previous cluster to directory updates.
       This slows things down too much I think
     - log the content as well.  This is awkward if it is big, certainly if more
       than a block, which is possible.
     - directory updates could be dependant on the inode being valid.
       This is ugly.
     - log content if it is small, else write inode, flush, then create link.

    So the fast option is:
      log inode create, log content, log filename
    and the slow/safe option is
      log inode ceate, sync file, log filename

    So on roll-forward if we see the inode we just save the data.
    Saving the whole inode seems attractive, but we want minimal order
    dependance: an inode update in the same cluster as the new inode should
    still over-ride, even though it is earlier.

  Ok, rollforward is proceeding slowly.  I think I am now incorporating
  new blocks into the tree properly, though the code probably won't compile.
  It will be nice to test this and see the file have the right data.

  Next step would be to include the index incorporation code.
  Then
    - directory updates
    - segusage summary
    - quota
    - stuff..

08jun2006
 - what exactly should happen when rollforward finds a file with a linkcount of 0?
   Currently all updates get lost - I wonder if they are lost safely?
 - rollforward is getting the size right, but not the content
 - do I need to flag a block that ->phys is valid?

 : Ok, roll-forward picks up new blocks in a file OK,
  but umount has stopped working.
    Presumably because there are pages attached to the inode which aren't
    getting released.  What do we want to do here?
    Normally those pages, or their addresses need to be recorded before
    they are lost.  But on a read-only mount we don't care so much.

22jun2006    continuing above thought..

   When we roll-forward and pick up the pieces of a file, we don't
   want to allocate pages to hold those pieces (and definitely don't
   want to read them all).  We just want to attach the addresses
   to the parent for incorporation.  Similarly after writing
   dirty blocks in a file we want to be able to release them
   immediately rather than waiting for the addresses to be
   incorporated (as incorporation can be more efficient when delayed).

   We could just allow the page associated with a block to be released,
   except that the page provides the indexing to find a block.  We might
   be able to live without the indexing, and hunt down the indexblock tree,
   but living without the mutual-exclusion provided by block indexing would
   be more awkward.
   And the 'struct datablock' still contains a lot more than is needed.

   So maybe we should just have a completely separate structure attached to
   the indexblock which lists fileaddr/physaddr.  This could include
   extent information.  The trick would be guranteeing allocation.
   We could either allocate-late with a fallback of attaching the 'struct block'
   or performing an immediate incorporation, or allocate-early and block
   the dirtying of a page until there is space to record the new address.
   This last is bound to be easiest.

   So: what exactly do we use to store addresses?
    Probably a linked list of tables.
    Each table contains a link pointer and an array of
        fileaddr/physaddr/extentlen
    But we would need to allocate lots of these if there are hundreds of
      dirty pages, but possibly only end up using a few if they made
      extents very nicely.  That might be wasteful.

    Or we could allocate just one.  When it is full we perform an
     incorporation.  But if that causes a page split we are in trouble.
       We could have a spare page, split to it, write out one
        and wait for the spare page to be written and free.
        But we cannot just release the index page as it might still have
        children.

    (I think I've been here before).
    A worst-case scenario involves writing one block and that requires
      spliting every index up the tree to the inode.  This requires
      arbitrarily many pages to be allocated.  To accomodate this we either
      pre-allocate a spare page at every level of the tree down to the data
      block (a bit like storage space allocation) which seems very wasteful,
      or we make sure we can release one of the split pages, which seems impossible.

    I could decide not to worry about it.  Have a pool of index pages and hope
     it always works.  Afterall, most pages are data pages, and they can be 
     freed successfully.  We would only have a deadlock if all dirty memory were
     index pages, and that seems unbelievably unlikely.  If we trigger a 
     checkpoint when the count of locked-pages hits some limit we should be
     safe.

    So: Keep one table per index block.  Use simple append and sequential search.
     When table gets full, force an incorporation

     Do we allocate the table separately, or embed it in the indexblock??

     Probably embed it.  indexblocks that don't need it can be freed at any
     time so that space waste hopefully isn't significant.

     How big?
      If the file is written sequentially, then everything should gather into
      extents, and so it doesn't need to be enormous.
      If the file is written randomly then the index block can be expected to
      be 'indirect', so incorporation will be cheap.
     So 'small' seems ok in both cases.

     Let's say 8.

     But wait a minute.....
     On a checkpoint we can be getting phys updates for prev and next phases.
     next-phase updates cannot be incorporated until the indexblock has passed
     on to the next phase.  So in that case, I think we still keep a linked
     list of unincorporated blocks and live with the fact that we cannot
     free them until the phase change passes.  That shouldn't be a big problem
     as it is a limited time frame - especially for data blocks..

     But does this solve our initial problem??
     During roll-forward we want to keep the addresses but not the blocks,
     and we don't want to force incorporation. That means an arbitrary list
     of addresses attached to an index block.
     I guess we could possibly allow incorporation, but I would rather not
     as I want the fs to be able to be read-only nicely.
     So that means we need to have a list of address tables.
     Maybe the normal approach is 'add a table if possible, else incorporate'?

     OUCH... we may write a block a second time before incorporating the
     new address, so when adding an address to the table we need to check
     if it already exists.  That could be expensive.
     For index blocks might it even be a different address?  I think
     not but the vague possibility (in the future?) does complicate
     things somewhat.  Maybe we just keep thing in chron order and
     don't worry about duplicates until incorporate time, when we have to
     sort anyway.


     todo:
        lafs_find_block  DONE
	free_block must free tables DONE


     Unmounting still doesn't work.
     Problem is that an index block is holding a reference on parent,
     and parent references aren't getting cleaned up.
     On read-only unmount I guess we need to walk the list of leafs,
     discard any address info, and unlock the blocks.
     So that should be the first task for next time.

27jul2006
  Leafs are locked blocks which have no locked children.
  So any locked data block (non-inode) is a leaf
  Any locked index block with lockcnt[phase] 0 is a leaf.

  OK - fixed numerous bugs, but I can unmount now!!
  I can even rmmod and insmod and all is cool.


TODO:
 - review refile and get all the code in there from prototype
       DONE (I hope)
 - write a combined find/load/wait function and use it
       DONE
 - allocate inodes in single memcache and avoid generic_ip
       HALF DONE. (still using kmalloc, not doing initonce well)
 - review recording of new block addresses
    + make sure we lookup there on index lookup - YES
    + make sure ->uninc_next gets tranferred to table at phase change.
    + write incorporation code as it is tricky
 - review how directory updates can be incorporated into a RO filesystem.
    No, they cannot.  We need to update the directory.
 - write directory update code
 - write cluster construction code
 - make sure indexblocks with unincorporated addresses get on to inc_pending
    ?? or is locking them enough?


INCORPORATION - ARgggghhhhh.
 The current uninc_table doesn't really lend itself to building
  index block... though maybe....
 Question: what happens when an index block disappears? i.e. it has no
  addresses in it?
  We clearly need to remove it from the parent.  This should be trivial,
  a direct operation on the parent index block. etc some number to 0.
  Then the next incorporation pass with simply lose that entry.

 OK, that might be all well and good, but how do we sort unincorporated
  addresses so we can merge them?
 A linked-list merge sort is nice and open-ended, but does waste
  quite a bit of space in pointers.

 Or maybe I should just always do small-table incorporations.
 Is there a way that a bad ordering of writes could force very bad
  index layout in this case? i.e. cause a table split every time,
  but new blocks go in the first (full) table.
 OK Decision: always do small-table incorporation.
  i.e. not a list of blocks: just a table of addresses.

 FIXME check validity of index type when it is first read in,
   and reject early if it cannot be recognised.

24aug2006
 Took a break from incorporation.
 Looking at directories.
 Wrote dir.doc in module to sum lots of stuff up.
 Issue:
   dir blocks have an info structure attached.
   This included a counted reference to the parent.
   How long does this need to hang around for??

   - when there is any orphan issue happening, it must stay, via
     the 'pinned' flag.
   - when actually performing a dir op, we need to create and
     maintain this info.

   When last ref of a dir block is dropped, should drop
   the parent reference.


 Status:
    free list management mostly done.
    Next:
      create/delete prepare/commit/abort
      orphan handling
      dirty_block lock_block


 FIXME should dir_new_block zero out the block?
   How will commit_create know what to do with this block?

 NOTE another type of directory orphan is a free leaf block which
   is on the part-free list.

-------------------------------------------------------------
09spe2006 0 on the plane to Frankfurt
 Don't tell me I am rethinking preallocation again ???

 TODO 
   dirty_inode needs to record the phase it is dirty in
   inode_fillblock needs to check current phase and act accordingly.
     we inode.doc
   Make sure the B_Orphan flag is set and used - or discard it.

   How do we commit creating a symlink?
   If it is a full block in size we cannot make an update record.
    - maybe have two update records? We cannot guarantee they are in
      the same  cluster.
    ... but if we put the 'make dir entry' last it should work.

   Change 'struct descriptor' definition
   the 'block_type' aka 'length' 16 field becomes
      0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
      0x8001 -> 0xc000 -> miniblock upto 16K+
      0xffff           -> index block.

   Need to write IO routines which decrease pending-block-count in
     'wc'.


   Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
     gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
     - 512 segments per block - is 512 blocks in each seg usage file

12oct2006
 Need to write
 - lafs_lock_{d,}block  DONE
       Make sure the block has parents and allocation and set the locked
       flag and the phase.

 - lafs_flush
       Given a datablock, wait for it to be written out
       This is needed before updating a block that is still locked in the
       previous phase.
 - lafs_inode_init
       Used when creating a new object/inode
       Given a datablock which is to hold the inode
         and a type (Type*) and a mode,
       Fill in the data block with appropriate data so that
          when lafs_import_inode looks at it, the right stuff happens.
 - phase_flip
 - lafs_prealloc
 - lafs_seg_ref
 - lafs_lock_inode

lafs_dirty_dblock
lafs_cleaner_pause
lafs_dirty_inode
lafs_seg_flush_all
lafs_write_all_super
lafs_quota_flush
lafs_space_use
lafs_cluster_update_abort
lafs_cluster_update_commit_buf
lafs_cluster_update_commit
lafs_seg_apply_all
lafs_cluster_update_prepare
lafs_inode_phase_check
lafs_seg_dup
lafs_dirty_block
lafs_cluster_update_lock
lafs_checkpoint_unlock_wait
lafs_orphan_drop
lafs_free_get
lafs_find_next

2nov2006
 - I need to know if a block is undergoing write-io so that I can
   avoid modifying it in certain circumstances.  But I don't track
   this information.  Options:
    1/ track the info.  This means an extra field in the 'struct block'
        because I still need to know which wc has had a write.
    2/ For blocks that we care about copy the data on write...
        But we care about all inodes and directory blocks.  That is a waste.
   I think we put extra info in the block.
   We need to know which wc was used (0,1,2) and which pending cluster
   in there (0-3) which comes to 4 bits.
   But we only care about the block for wc=0. and we could include the
   which-pending in the b_end_io, or maybe put it all in low bits
   of the block pointer....  Need max 4 bits.  Can only be sure of 2...

   Maybe:
       'which' goes in bottom two bits of bi_private
       'wc' goes in ->flags


4apr2007  (What a long gap !!)

 - lafs_cluster_update_*
   How do we prepare for a cluster update?  How do we lock it.

   The important thing is that the update can be written.  That
   requires that there is space available.  So we need to preallocate
   space and then release it.
   It is possible that each update might go in a different cluster, so maybe
   we need to preallocate one block per update.  That sounds a little expensive.
   After all, we aren't preallocating a cluster block for every data block
   that is dirty.
   So: prepare does nothing
	lock preallocates the space - a full block.
	commit copies it in.
    For now at least.

24May2007

 - Can now create and delete lots of files.  This is cool.
  But:
    Orphan slots just grow and grow - never to be reclaimed - why?
    After rm f*, 7 files remain.  but rm f* again and the go.
         FIXED - readdir wasn't returning them
    Size of directory remains large.
    And sometimes, files become ghosts... (try just removing one after first rm f*).

  TODO - process those orphans to clean up the directory.

20June2007 (Happy Birthday Dad)

 - Creating lots of file and then deleting them leaves 5 orphan slots
   for the directory busy, and one for inode 0?? 

   Directory handling uses the following orphans:
    CREATE:
        A new index block is created by splitting.  This needs to be linked in.
    DELETE:
        The dirent block we are deleting from
           If it becomes empty, it needs to go on free list
        The index block we are deleting from
           If it has lots of free space it might need to be rebalanced.
     The inode that was deleted.

 
 - When a file is fully deleted, we need to drop any orphan info... DONE
 - Need to do orphan handling of free blocks in directory, and
   unmerged parents - but there doesn't seem much point as I am going to
   change the directory layout (again).

 So: writing to a file.
   We need prepare_write, commit_write, and writepage.
   Prepare loads and links the page and checks there is space.
   commit marks it as dirty so writeout is possible.
   writepage chooses a page to write out

25June2007 - HACK week, thanks Novell!!
 - write - DONE
 - sync
     Somewhat done.
     Need to revise the process whereby async completion
     clears PAgeWriteback,
     We need locking in there, and need to worry about
       'which' wrapping too soon.
     Need to not start IO before we set page writeback
 - chmod
     Maybe, but syncing to disk needs more thought.
 - 'df'
    Partly done, need actual content.
 - mkdir
    Can make directory, but creating first entry fails. - FIXED
 - symlink
 - readlink
 - new directory structure.

27Jun2007 - More HACK week :-)

 - new directory layout done - much easier!!
 - If I delete a file that was created, the blocks still have a ref-count
   and we crash.
 - mkdir doesn't increase link count on parent. - FIXED
 
 TODO:
   Orphan handling.
     Infrastructure to process orphans
     Handle specific cases
     flush orphans at key times.
     load orphans at roll-forward

   checkpoint
     Write out a checkpoint (when?)
     Make sure refcount goes back to zero on blocks I write.

  Check on inode_phase_check and checkpoint_unlock and inode_dirty
   in all directory operations.

 FIX: Writing a small file leaves something non-dirty but
    due to be written, and lafs_cluster_allocate complains.
  - seems to work now.

 FIX: dir_handle_orphan doesn't lock the orphan transaction required.

 FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.

 FIX: lafs_allocate hasn't been written!!!

 FIX: before updating any block in a depth=0 file, we must first load
      and 'lock' block 0.

29Jun2007 - still HACK week.
  Summary of how incorporation works.

  Each index block has a small table for unicorporated changes. i.e. 
  blocks number and their addresses.
  This supports efficient storage of extents, and is extensible by allocating
  more tables.  This last is done rarely.

  When a block gets a new address, this is added to the table or, if
  there is a phase missmatch, it is added to a list until a phase change
  happens (so the whole block is pinned pending the phase change).

  If the table is full then:
   - if the filesystem is read-only (including during roll-forward),
     a new table is allocated (else rollforward fails).
   - otherwise we incorporate the table into the block, then add the new
     address to the (now empty) table.

  If incorporation requires that we split the index block we allocate one
   from a pool.  If there are none in the pool, we wait.

  As the table is much smaller than a block, the incorporation into
  two block will always succeed.
  The 'uninc_next' and 'children' lists will then need to be shared
  between the two blocks before the new address is added to whichever
  table is appropriate.

  When looking for a block address, we must always check the table and
  then children lists.  We do not need to check uninc_next as they will always
  be children.

  How to ensure that the pool always has sufficient index blocks and we don't
  deadlock?
  We have two halves of the table, one for each phase.  Before we allow
  a block to be dirty in a phase, we ensure that the pool has adequate 
  index blocks for that phase.  e.g. twice the depth of the block.  If it
  doesn't we block the dirtying until space becomes available.
  For syscall writes, this is easy as we catch in prepare_write.
  When we perform a phase change, we must be sure there are enough index
  blocks for the deepest bloc that will stay dirty.  If there aren't, we need
  to flush all dirty block, and unmap all writable mappings before
  starting the checkpoint.


 FIX: need to work out life time rules so that inodes hang around while they have blocks.
    currently have an igrab that is never put.

 FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.

3Jul2007
 Checkpoint flushing is getting close.
 Current problem.
   InoIdx blocks are not changing phase.
   Phase change should happen when all children have been incorporated, and
    then the write has been triggered marking us clean.
  For InoIdx blocks, we need to be marked clean when the data block
   completes.

5jul2007 - a week off
 Checkpoint flushing seems to work !!!!
 FIX: what should filesize of symlink be?  
     other filesystems use len, but still zero-terminate for vfs.

 Problem.  A chmod is followed immediately by an unlink then a checkpoint.
   The chmod update gets into the checkpoint cluster, but the unlink completes
   before the checkpoint is finished so the new superblock sees the file
   as gone.  Roll-forward find the update and want to update a missing file.

   This isn't a big problem, but with slightly different details, it could be.

   One option is to ignore updates that preceed the updated block.  That might
   be awkward with e.g. directory updates and checkpoints that cross multiple
   segments.

   Another option might be to prohibit updates once a checkpoint has started
   unless they are known to be after the phase change.

 FIX: unlink isn't punching a hole in the inode file.
      Inode usage map isn't being updated. - FIXED (For create, not unlink).

 FIX: roll forward does not pick up inodes, only data blocks.
    But tiny files are synced to inode, so they might not be picked up.
    So we must process a level=0 inode like a data block.

6July2007
 Time for lots of clean up.

DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
DONE 2/ rename 'lock' -> 'pin'
 3/ Review and fix up all locking/refcounts.  See locking.doc
DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
         B_Alloc inside the semaphore
       Also lock inode when copying in block 0 and probably
       when calling lafs_inode_fillblock (??)
DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
         more allocations can come in at any time.
NotYet 3c/ cluster_flush should start all writes before calling _allocate
         as _allocate might block on incorporation/splitting.
       No.  We really want _allocate to not block, but to queue...
        I think this is too hard to get perfect just now, so I will leave it.
DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
     so that locking of lru doesn't have to be too global
DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
       lru system.
DONE 3g/ revise refile lru handling based on new understanding
 3h/ Utilise WritePhase bit, to be cleared when write completes.
     In particular, find when to wait for Alloc to be cleared if
      WritePhase doesn't match Phase.
       - when about to perform an incorporation.
 3i/ make sure we don't re-cluster_allocate until old-phase address has
     be recorded for incorporation.
 3j/ Check that index blocks cannot race when getting locked....
  k/ Check what locking is needed to set PagePrivate exclusively.
DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
     We need to get it done in process context I think and lock
      ->waiting access with fs->lock after changing it to ->lru
DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
        only when *all* have finished.
DONE  n/ on phase change, uninc_next blocks need to be shared out.
NO 3o/ Make sure lafs_refile can be called from irq context.
 3p/   lock all lru accesses.
 3q/ Lock those index blocks!!!
 3r/ Can inode data block be on leafs while index isn't, what happens if we
       try to write it out...
 FIXED Why are extent entries only grouped in 4s? 
 If InoIdx doesn't exist, then write_inode must write the data block.
 4/ resolve length of symlink
   FIXED - long symlink followed by 'sync' crashes.
   FIXED - rollforward isn't calling 'allocated' on blocks, or something
   FIXED - I cannot find 'bfile'. (inode isn't written)
   SEEMS OK...- Must flush final segment of a cluster properly...
 5/ Review what does, and does not need to be initialised in a new datablock
 6/ document and review all guards against dirtying a block from a previous phase
    that is not yet safe on storage.
          See lafs_dirty_dblock.
 7/ check for proper handling of error conditions
     a/ checkpoint_start might fail to start a thread!
     b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 8/ review checkpoint loop.
       Should anything be explicit, or will refile do whatever is needed?
 9/ Waiting.
       What should checkpoint_unlock_wait wait for?
       When do we need to wait for blocks the change state. And how?
DONE 10/ rebase on 2.6.current
DONE     - use s_blocksize / s_blocksize_bits rather than fs->

 11/ load/dirty block0 before dirtying any other block in depth=0 file
 12/ Add writecluster flag for old-phase updates.
     Why is this needed?  updates should always go in the new phase???
 13/ use kmem_cache for 'struct datablock'
 14/ indexblock allocation.
        use kmem_cache
	allocate the 'data' buffer late for InoIdx block.
	trigger flushing when space is tight
	Understand exactly when make_iblock should be called, and make it so.
 15/ use a mempool for skippoints in cluster.c
 16/ Review seg addressing code in cluster.c and make sure comments are good.
DONE 17/ Make sure create inherits uid etc from process.
 18/ consider ranges of holes in pending_addr.

DONE 20/ Implement rest of "incorporate"
DONE 21/ Implement staged truncate
DONE         use for setattr and delete_inode
DONE 22/ block usage counts.
 23/ review segment usage /youth handling and make a todo list.
      a/ Understand ref counting on segments and get it right.
 24/ Choose when to use VerifyNull and when to use VerifyNext2.
 25/ Store accesstime in separate (non-logged) file.
 26/ quotas.
        make sure files are released on unmount.

 30/ cleaner.
       Support 'peer' lists and peer_find. etc
 31/ subordinate filesystems:
     a/ ss[]->rootdir needs to be an array or list.
     b/ lafs_iget_fs need to understand these.
 32/ review snapshots.
      How to create
      how they can fail / how to abort
      How to destroy
 33/ review unmount
      - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 34/ review roll-forward
      - make sure files with nlink=0 are handled well.
      - sanity check various values before trusting clusters.

 34/ Configure index block hash_table at run time base on memory size??
 35/ striped layout.
         Review everything that needs to handle laying out at cluster
         aligned for striping.

 36/ consider how to handle IO errors in detail, and implement it.
 37/ consider how to handle data corruption in indexing and directories and
     other metadata and guard against problems (lot of -EIO I suspect).

 - check all uninc_table accesses are locked if needed.

And more:
  1/ fs->pending_orphans and inode->orphans are largely unused!
  2/ If a datablock is memory mapped writeable, then when we write it out,
     we need to with fill up it's credits again, or unmap it.
  3/ Need to handly orphans asynchonously.

---------
22nov2007
Free index block are on two lists, both protected by the global
hash_lock.
  1/ The per-inode free_index, so they can be destroyed with the inode
  2/ The global freelist so they can be freed by memory pressure.

11feb2008.   Where was I up to again?
   reviewing phase_flip and lafs_refile.

  UPTO
     Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
     Need to thnik about how index block really change.  How old blocks get 
      dis-counted from segment usage, and what optimisation are really good
      for re-incorporating index blocks.
        Operations to consider are:
              i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
          i/ leaf block splits, index block gets new entry at end, and replacement
                  for other entry.  Easy to handle
         ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
        iii/ probably caught in leafs.  May cause internal split so we add new
             index address, which is easily handled if there is space.
         iv/ same as iii, though split more likely.

       What about merging index blocks.  That just makes addresses disappear, which
        we handle the slow way.
       Do we ever re-target index blocks?  Would need to be careful about that.
       Make it look like a split where one block ends up empty as a hole.
     Need to write
           grow_index_tree (DONE - untested)
                  ib is a leaf inode that is getting full.  Copy addresses
 		  into 'new', and make 'ib' an index block pointing at new.

	   add_index/walk index (DONE - untested)
          
           end of do_incorporate (DONE - untested)
		new contains the early addresses.  Some remain in ib
                 and/or ui.
		the buffers much be swapped, so ib has the early address.
                ui needs to be attached to new
		return 2; - then new uninc needs to be split

           lafs_incorporate
                case 2 - horizontal split
                case 3 - vertical split
  12feb2008
   Bother - uninc_table is a problem (again).
   We can currently add at any time with just a spinlock.
   So when we split a block horizontally, 


   Still need to
          share out children and uninc_table in do_incorporate
	  share out credits in do_incorporate

14feb2008
   Still need to do incorporate as above but took a break to...

   Counting allocated blocks now works - stat show right info, hopefully
     storage is correct too. - DONE

   next: truncate?  orphan thread? 
      Then segment usage and the cleaner.


   thoughts:
    truncate - removing blocks doesn't need to erase them...
    - nothing forces a cluster_flush promptly!!!  We need a timeout
         or at least we need a flush before truncate_inode_pages...

    - in lafs_truncate we need to make the block an orphan an pin in 
      all in a checkpoint.

21Feb2008 (Research morning)
   Discard checkpoint thread created on demand in favour of a cleaner
   thread that runs all the time.  It cleans and checkpoints and
   orphans and scans.

     want to:
        do segment scan and get a real list of free segments and
	free-space info!

25Feb2008
 - segment usage scanning to count free blocks
 - fix up re-reading of erased blocks
 - FIX truncate can still block waiting for writeback to complete.
 - FIX allocations aren't failing when we run out of free space
 - FIX df doesn't agree with du.

 problem:
   Truncate when an index block has addresses in uninc_table.
     The summary for the new address has already been performed.
     We need to deallocate the new without disturbing the old.
     However a simple allocation may not be possible.
     I guess we can prune them all to zero, then incorporation
      can proceed.

 TOFIX: when truncating a recently created file, it is still depth=0 so
    nothing happens.
    We really need to increase the depth to 1 as soon as we dirty
    any block, then reset back to 0 if it fits.

26Feb2008
  We have a file that we have written to, and the data blocks have been
  written out and the addresses stuck in uninc_table.
  We then truncate the file.  Who releases the usage of those blocks?
  And who removes them from uninc_table?

  OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
  I really should make sure that inodes are getting freed properly and the
  inode map is clean and everything.

  BIG QUESTION
    Do we reserve segment-usage blocks.
     We cannot do it naively as we get infinite recursion.
     But we need it to be allowed to dirty the segment block.
     But we cannot pin them to this phase as we want to write them out
     after this phase
     This still needs more thought.  I avoided the recursion by setting SegRef
     before getting the ref.  But that isn't safe.

28Feb2008
  The table of cleanable segments is not working out.  Each segment appears multiple
  times which wastes space and adds confusion.
  We really want to be able to lookup by dev/seg and also find the least.
  'Find least' sounds like we want a heap but then we cannot discard the bottom half.

  We could have a skiplist for dev/segment lookup and do a merge-sort on
  a different link when we want to find the best segment.
  We then remember the best number found since a sort, and re-sort if the top
  is worse than the best.

  We keep all this in a fixed size table.  Each entry has
   seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
      addr-sort-skip links.
   This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
   Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
   One page of 16byte entries (256 of them)
   2/3 page of 24byte entries, 1/3 of 32byte entries.
   Total 2 pages, and 256+113+43 = 412 entries.

  But deleting random elements is awkward... but not too awkward.  We can delete
  lots of entries by marking them as old, then performing a single pass of the skip
  list deleting them.

  We should keep free segments here too, on a separate list.

  So how about:
   2 pages of 16byte entries
   1 page of 24
   1 page of 32

  free list randomly threads through all.

  When using from 24 or 32, randomly choose height of 2-5 or 2-9
  Two lists run through the skiplist entries.  One for cleanable, one for free.
  Remember the nth element for some small n (10, but it decreases as we pull
  things off the front) and if we add something less than that, we trigger a
  mergesort on the next time we want to clean.... maybe.

  Remember end of free list and add to there.  Maybe merge-sort the free list
  by addr occasionally.

  Quesitions:
    When can we clean, when can we free wrt checkpoints?
      - we an clean a segment as soon as we have a checkpoint after it.
        So we record the youth of the segment holding the (start of the)
        checkpoint, and can clean any segment with a lower youth.
      - we can free a segment after the checkpoint after itfs usage has reached
        zero.  So if usage is zero and youth....
        We could offset the usage by one (say - for the first cluster header..)
        then when we find a segment with usage of '1', we schedule an update to
        0 in the next checkpoint...
    Have about segments with different sizes - they get different weights.
       Need to divide by segment size:  usage * youth / size.

  TOFIX
   - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
   - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)

   - Lots of 'creates' makes lots of little clusters - need to optimise!
        Or it could be deletes as we currently cluster_flush for each
        delete.
         - I think this is fixed

29Feb2008
  Started looking at the cleaner.
  Need to understand how much to clean each checkpoint
  Need to track free-space-in-active-sectors while scanning.

3Mar2008
  TOFIX
    - the cluster head is currently limited to one page.  This is not good.

    - Should the cleaner start before the scan is complete after a checkpoint?
      Probably it can, but while the scan is still happening it might be best
      to be cautious ??

  STATE:
    try_clean is taking shape and has a few FIXMEs.
    need to write async find_block code and get it to watch for
       block in a cleaning segment.

28Mar2008
  - where can padding appear in a cluster? between miniblocks? at
    end of device blocks?
  - need to track phys block while parsing headers for cleaning.. why?
  - determine rules for avoiding block lookup during cleaning
    based on youth/snapshot age, and truncate generation.
     We need to load the inode from each snapshot
    Can we optimise based on snapshot age?
    only if we know the block is newer than the snapshot.
    So when we relocate blocks (cleaning) they must go in a segment
    that is marked as being old. we cannot really guarentee that.
    I guess blocks that are marked as 'new' can safely be skipped if
     segment is newer than snapshot. This 'age' is not the youth, but
    is the cluster_head->seq which is stored in creation_age.

 - Store the rootdir for a filesystem in the metadata for the root inode.
   Then 'struct snapshot' doesn't need rootdir.  It can have a root

30Jun2008
  Looking at lafs_find_block_async.
     Needs async flag to make_iblock. 
        Check that.  Can we block_adopt if there was an error?
             iblock will exist.
     setparent has async flag.
     lafs_leaf_find has async flag
     lafs_wait_block_async

  FIXME I wakeup the cleaner every time an IO completes. 
  Do I really want that?  Maybe only when number of async IOs hits
  half the recent maximum??

  FIXME need to ensure that lafs_pin_dblock flushed committed
    B_Realloc blocks.

  FIXME when we incorporate a dirty (non-realloc) address to an index block,
    we need to clear B_Realloc on the indexblock.

  FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without 
    giving it any credits.  Where should they come from?

  We don't seem to scan for free/cleanable segments often enough.

  FIXME we shouldn't start a checkpoint while cleaning is happening.

  FIXME need to be careful when cleaning about finding inodes that
    don't exist any more.

  FIXME give credits to realloc blocks.

  FIXME think about/document transitions between realloc and dirty,
    and what locking is needed.

2Jul2008
  Allowing for the FIXMEs above, the cleaner is now identifying
  blocks that need to be cleaned and marking them B_Realloc (I think).
  We now need to gather these into a write cluster and write them.
  They will all be on the clean_leafs list, so we can iterate that
   allocating or incorporating as needed.  This will be similar to
   do_checkpoint.
  Important question is: when?
   Ideally we would have some auto-flush mechanism.  The cleaner just
   keeps finding blocks to clean and when we start running out of
   resources we flush the cleaning queue.
   However we will still want to flush the cleaner always before a 
   checkpoint, so for now we cna implement that bit and wait for a
   need for the other to arise.


  FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
      don't record that location the same way.. how to handle?
     Should check that 'adopt' doesn't do the wrong thing with this block.


  Realloc blocks need to be pinned.  That makes sense.  Only that way
    will they get onto the clean_leafs list.
  When checkpointing we should probably examine clean_leafs to be
   on the safe side.
  

  Realloc and Dirty:

     Both of these hold a Credit.
     Both can be set at the same time.
     Cleaner ignores Dirty and sets Realloc anytime the block is in
      the wrong segment.  It also Pins the block.
     When the cleaner is flushing to the cleaning segment, it
      ignores Dirty blocks.  They get their Realloc cleared, but
      the remain pinned.  So they will get moved at the next checkpoint.
     How do we know whether an indexblock should be Dirty or Realloc?
      The Dirty/Realloc bit is cleared before we get to incorporation.
      Maybe we lafs_dirty_iblock the parent of any block we write
       out.  Then after incorporation, we set Realloc if it is not
       dirty.

STATUS:
  I think I'm pinning cleaner blocks now.
  Need to make sure the dirty ones are dropped. DONE
  Need to make sure the usage is transferred
  Need to get free segments back into use
  Need some more 'dump' options.  Maybe youth/usage files.
      Maybe tree.
  Need to make sure scan etc are triggered often enough.

  FIXME lafs_prealloc walks up ->parent without locking
    I think we want i_mapping->private_lock like lafs_pin_iblock.

TODO:
  1/ a 'dump' option that triggers a scan and prints everything out.
  2/ scan must mark freeable as such, then subsequently free them.
  3/ Look at code that decreases usage of old segments.
  4/ Review lafs_cluster_wait_all and decide exactly how long we need
     to wait.
  5/ Review 'FIXME that is gross' HZ/10 thing.
  6/ Review 'wait for checkpoint to flush' msleep(500);
           Maybe remove that altogether.

  FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
  FIXME BUG in lafs_allocated_block fired.
            from lafs_erase_dblock from invalidate_page from .. vmtruncate
             from lafs_setattr

  Current problem:
    An inode data block is dirty and pinned, but the inoidx is no longer
       pinned.  Presumably it isn't dirty.
     Recheck what 'dirty' means on the two blocks and see how this can happen.

10july2008
  Tree gets very big!  Lots of 'Realloc' blocks that should
   be long gone.

  WE are spinning in cleaner again, and not in try_clean.

  Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
  In general it shouldn't be.  The flush_cleaner process will remove
   the Realloc bits so the blocks fall off clean_leafs.  They then either
   go onto phase_leafs or get unpinned.
  But I currently have a problem with InoIdx/data.
  The Pin is transferred to the Data block, but it doesn't go from the
   InoIdx block because it has a pincnt.  Now that is probably a bug, but
   what if it weren't?  What if, while we were cleaning, a block got dirtied.
   That would pin the whole tree. 
  I guess the rule about not allocating an inodedata block while the
   InoIdx is pinned needs to be revised.  If the inodedata block is 
   Realloc (and not Dirty) while the InoIdx is not Realloc, we
   can go ahead (in a cleaning segment).

 FIXME to check:
   adir/big1 is garbage.... big1 was removed, so why is it even there?
              FIXED.
   echo tre > dump  # still too much stuff.


 Put cond_sched in checkpoint loops!


 Thoughts about cleaning and pinning.

  When cleaning we need to know how many dependant blocks are being cleaned
  so that we know when *this* block can be written - i.e. when the could hits 0.
  We cannot use the pincnt for this phase because there may be dependant blocks
  which are dirty.  They, and therefore this, may get flushed at next checkpoint,
  but they may not.  If we could be certain they would, we could just write
  to the clean-segment blocks which can become unpinned.  However if there
  is an index block being cleaned, and no dependant is being cleaned, but some
  are dirty but not pinned, then the checkpoint can go past without the block
  being moved.... but maybe we can detect that.

  Try this:
    We set B_Realloc precisely on blocks found in segments being cleaned.
    We pin these blocks and leafs which are Realloc go in clean_leafs.
    If a block is both Realloc and Dirty we clear Realloc but leave pinned.
       That way it gets written at end of checkpoint, but to main cluster.
    When we incorporate Realloc blocks into an index block, it gets marked
       Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
    On a checkpoint, we process both phase_leafs and clean_leafs


 FIXME do inode reads async better when cleaning...

 FIXME if a realloc inode has been allocated to a cluster when we try
     to dirty it, confusion can ensue as the writeout won't mark it
     clean, but will use up the credits.
     Maybe we need something similar to phasewait to not set PinPending...
      But normal dirtying doesn't phasewait.   I think we just need to
      detect this case and wait for the clean-cluster to flush.
      Messy...

 FIXME make sure incorporate is doing the right thing with credits.

 FIXME lafs_write_inode. We need to be careful about clearing Dirty
           when making an update.  Need some sort of locking.
           Need to review all inode dirty stuff and make sure we do
           write thing no matter when it is called.

 FIXME when blocks are attached to uninc_next, they don't have 'dirty'
        anymore so we don't know how to flag the index block.

2008jul13
 UPTO: unlink etc don't prealloc the inode that will be modified.
    And a warnon inode.c:579 is very noisy.

2008jul22
 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
     but it doesn't get set until AFTER lafs_reserve_block is called.

 Here I am...
   Cleaning cleans an InoIdx block which schedules the data block.
    Subsequent the InoIdx block gets pinned again.
    Now when we go to write the data block, we cannot because InoIdx is pinned
     in same phase.
     Maybe given that data block is pinned, we write it anyway...

 FIXME: when we realloc an block embedded in the inode, don't pluck it out
        and put it back in again.  Just realloc the inode.

 FIXME: when cleaning a directory that has shrunk, we think we have
     blocks that don't exist any more. FIXED - we thought '0' was in 
     segment '0'.

2008jul23
  FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
     flush finds no credit. for InoIdx block of 8501

  FIXME: do we do SEGREF on all the index blocks? do we need to?


2008jul24
  FIXME: seg usage for segment 0/5 isn't dropping to zero.
    Part of a file got moved off, but count is still there.
    FIXED - seg_move wasn't being called.
  FIXME: segusage file has inconsistent extents:
      Extent entries:
       0 -> 694 for 2
       1 -> 1291 for 1
       1 -> 15 for 1
   FIXED several bugs in walk_extent

  FIXME qphase:  any locking between that changing and lafs_seg_move??
    I don't think so.  Just that seg_apply_all must be called after qphase is set.

  FIXME make sure we don't try to clean the current segment!!

  FIXME 'Available' goes negative!
      Creating large file doesn't instantly reduce 'Used'.
      Deleting files plus sync doesn't increase Avail?
 
  FIXME a segment is in the table but doesn't print out!

  FIXME we don't cope with running out of free segments (not that we ever should).

  FIXME check all Credit usage and make sure credits are returned when
    ->parent is dropped.
    provide visibilty into credit counts.
    Make sure we are keeping enough space for cleaning.  We should always
     have a few segments unallocatable.

2008jul25
  FIXME cannot do io completion in cleaner thread as it can block on
     a i_mutex which might be waiting for completion. FIXED (keventd).

  FIXME as ->iblock isn't refcounted we need to be careful accessing it.
            If we 'know' we have a reference, e.g. a child with a ->parent
            link, we can access it without locking.
       So:
           lafs_make_iblock should return a counted reference.

       If we own an (indirect?) reference to iblock, we can access
        both iblock and dblock for free... but iblock can change???
       If not, we need to get a reference to on or other under a lock.

  FIXME block->inode should be a counted reference?

lafs_make_iblock OK
  lafs_leaf_find OK
    lafs_inode_handle_orphan OK
      inode_handle_orphan_loop FIXED
    __lafs_find_next OK
    find_block FIXED
  __lafs_find_next OK
    lafs_find_next FIXED
      dir_lookup_blk
      dir_handle_orphan
      lafs_readdir
      lafs_inode_handle_orphan
      choose_free_inum
  find_block - FIXED

 FIXME root->iblock should always be refcounted.  Is it?
 FIXME walking siblings - what lock?

2008jul28
 FIXME several times we clean PinPending without refiling, in dir.c in particular.
    that looks wrong. FIXED

  Maybe  lafs_new_inode should return a reference to the dblock
    Or pin it. or something. FIXED  And pinned (when needed).

 FIXME lafs_inode_dblock might return a block without valid data...
   Need to get valid data, then load block 0 in find_block rather than
       load_block.  FIXED

 FIXME we really should own a reference to ->dblock before calling
    lafs_pin_inode.  We don't want IO during a pin request.
    FIXED

 FIXME review use of PhysValid FIXED

 lafs_orphan_abort - what if lafs_orphan_pin not called?
   or if 'b' is NULL.  FIXED

 Do I Need to clean PinPending when retrying??
   Well, we need to be phase-locked when we set PinPending, so
    it must be Pinned to the current phase.
    So when we unpin a datablock, we must clear PinPending.
  FIXED we now clear PinPending in do_checkpoint.

 Does phase_wait do the right thing when pinning an inoidx block
   for an inode? FIXED


Pending
  Need to understand and document the lifetime of a page with datablocks.
    who hold what refcount, and when can it be freed?
   Then fix up locking in lafs_refile, __putref.

 FIXME how keep what refcount on orphan blocks/inodes??
 FIXME should dirty/pinned/etc hold a refcount?  they don't.


Later:
 FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)

 FIXME make sure empty files have depth of 1.

 FIXME Truncate proceeds lazily. All data blocks need to be gone

26aug2008
 If I call lafs_erase_dblock while a write is underway, we have a problem.
  We need to wait potentially for a checkpoint to let go of the block and
   a write to complete.
    This should be done with waiting for PG_writeback on the page to disappear.
  Check this out.

  When end_page_writeback is called, we must have dropped all references to the
   page.
  When we commit to writing a block, we have to set PG_writeback on the page
   so that truncate et al can wait for it.  Before we have committed, truncate
   can just remove the page.  Internally we differentiate by B_Alloc.
  So before setting B_Allocated we need to test_set_page_writeback(page).
  Be careful of races.
  I don't think we can ensure all references are dropped.  After all, that is
  the point of refcounts.  So dblock array must exist without page!
  But we need to ensure that we don't start a writeout after truncate
  has done wait_on_page_writeback.
  This is done with the page locked so when we want to write a page
    in a checkpoint, we need to lock the page first.  Once we have the lock,
    we check if the page is still dirty.  If it has been truncated it 
    will be clean.
   But how do we safely reference the page if b->page can be cleared?
    How about:
      When we clear PagePrivate, we take a counted reference to the page
      for db->page.  This is dropped when the page is freed by lafs_refile.
      But while it is held, it is still safe for db->page to be dereferenced.
    So before we commence writeout we have to lock the page and set
     PG_writeback.  After locking, we need to test if writeback is still 
     appropriate. 

  Maybe not.  I think we can submit blocks for writeout without setting the
  page to writeback.  If we do, then we need to be sure those writes
  finish before invalidatepage calls releasepage (block_invalidatepage
  calls discard_buffer which calls lock_buffer which waits).
  In our case invalidatepage need to make sure that no new write commenses.
  Maybe we should lafs_iolock_block before we allocate to a cluster and check
  again if the block is dirty.

  So:
    lafs_cluster_allocate does:
       lafs_iolock_block
       check if still dirty.  If not, unlock and return
       set allocate flag
       allocate and write
       when write completes, allocate is cleared.
                    unlock block

    invalidatepage does
       lafs_iolock_block
       clear Valid,Dirty,Realloc
       lafs_iounlock_block


2008 aug 28 - happy birthday.
FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
lafs_prealloc complains.

  mark_cleaning does too, but cleaning only happens well away from a checkpoint
  lock.
segsum_find is being called to reference a new segment when we flush a cluster.
 segment usage blocks are special.  Their index information doesn't
need to be written out in the current checkpoint.  We can do that, but
the backstop is to write just the data block in the tail of the
checkpoint and write indexing information later.

2008sep10
 unlink is getting "No space left on device".  This is when trying to
 pin the directoory block, the physaddr is 0, so it looks like we want
 NewSpace.  But we should even be trying to prealloc in that case becase
 there should already be a prealloc on the block.  i.e. there should be
 credits.
 Hmmm. after multiple 'syncs' how can the block not be written out.
 Maybe it is embedded in the inode?
 When we pin a block that was embedded in the inode it isn't clear what to
 do.  If we might grow the file so it doesn't fit any more, we need to
 allocate NewSpace.  If we know it won't grow. we use Release.
  This still needs a proper fix.

 Cleaning seems to be working nicely.  However we don't get all the space
 back that we should because lots of blocks still have credits that
 aren't being returned.

 So when should credits be returned?
 They are set when a block is pinned.  It then gets dirtied which
 consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
 then it doesn't need any credits.


 It seems that cluster_flush is not always writing things in the correct
  order.  Root gets written before some other things below it.
   Maybe they are temporarily out of the loop??
 No.  There are dirty blocks which one checkpoint doesn't pick up, but
  they aren't holding the index block pinned. so they lose allocation.

 But they must hold the indexblock pinned, even though they aren't pinned
 themselves.  We maybe do this just with the refcnt... maybe.  That will cause
 it to phase-flip rather than drop pinning, which I think is right.

 So: too many credits remain allocated.  Where are they?  There are 1464
   outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
   But things removed from the tree have credits removed.


FIXME roll forward ignores inodes.  But what about an inode that contains
   data.  Should that be ignored?  I think not.
FIXME delete adir/big2 then delete adir and it cannot release:
  Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
 presumably there is orphan processing or something to complete???
FIXME when files are deleted, the space isn't returned!
   This seems to be mostly fixed - need to test.
FIXME when I "rm [b-z]*" it waits for writeback on something???
   zfile again!!!  OK, I think that is fixed.


12sep2008
  Current problem:
    seg_apply_all dirties dblocks.  When should they be reserved?
    The originally get reserved by a lafs_reserve_block call in
    segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
    However: that block might get written before *and* after a checkpoint.
    So we need N* Credits.  These are usually only used for Index blocks.
    We can set these easily enough if inode type is TypeSegmentMap.
    We move them across to Credit in seg_apply_all.
    But when to we clear them if they aren't needed?  I guess
     when we drop the last segref.  Yes, we already do that.
    FIXME need to make sure these get flushed on next checkpoint
     if we cannot allocate new credits after a checkpoint.

  New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
    Think that is fixed.

  Some problems.
    1/ see above:  rm x/y; rmdir x -> BUG - FIXED
    2/ Spins on 'CURRENT=1' ??
    3/ if alloc_space gives EAGAIN while deleting, we don't survive.
    4/ When I create/delete a file, ablocks_used increments by one.
        The inode hasn't been allocated yet, so it seems the deallocation
         isn't adjusting ablocks_used??
    5/ open_namei (for dd) got caught on a mutex_lock.
    6/ When a large file is shrunk we don't reduct the level of the InoIdx block
       I'm not sure where we should and am not thinking very clearly.
       Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
    7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
       the lock?  Writeout that hasn't completed?
       Yes.  writepage calls lafs_allocated_block without calling flush.
       So the block could be sitting waiting for a flush.  How long do we
       wait??
    8/ It seems that some datablock can need NCredits.  Make sure these
       are handled properly re flush-or-refill after checkpoint and
       flip_phase rather than unpin.
    9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
       enough, and we lock up (see 7).  Need to flush the first block
       straight away, and the next one as soon as the first finishes, etc.
       Or something like that.  Then remove the comment from lafs_writepage.

8th December 2008

  I seem to be getting only 4 blocks to a cluster at the moment.
   This is good as it motivates the code to handle block splitting in
   the Btree.   But it shouldn't happen.

  ....
  Block spliting might work - it doesn't crash at least.
  But
  After deleting all files, the tree is full of stuff.
  Lots of inode data/InoIdx blocks.
  Many but not all a Pinned.  The others are OnFree
  The Pinned ones have outstanding references.
  Others

  ....
  Problem with the block splitting, when adding an index block.
  The index block is initially empty - we need to find things by looking
  at children.  But we don't.  We BUG_ON the iphys==0.
  In general, when we add a block below and index block and before we incorporate,
  the block must be found by finding the first indexed block and looking to
  see if there is a 'next' block that contains the address we need.
  FIXED

  But if we truncate a file while an index block is pinned and dirty,
  we spin on trying to incorporate it, which should make it empty.

11th December 2008
  deadlock.
  sync is trying to get lock in lafs_cluster_flush
  pdflush holds the lock and is stuck in cluster_flush_0xa40
    some wait_event I expect.
    Maybe we need an unplug ??

 - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
   This is in clean_free.  We try to update the 'youth' to mark
   the segment as free, and we don't have a reservation to do it.
   Maybe just reserve it there and then.


12th December 2008
  When doing a lookup in an index block, we need to check the unincorp
  address list.  It isn't enough to look for unincorp blocks as they
  might have disappeared.
  For INDIRECT and EXTENT this is easy enough as full information is in
  'uninc'.
  For INDEX it is a little tricky as we need to look at the full set of
   addresses to know where a particular address fits.
   We could force and incorporate first, but that has awkward implications
    if it requires a split.
   Maybe if we get from the lookup "start+range"....
     That is not enough as the 'start' might get zeroed by an update.


   rm adir/* doen't work as readdir doesn't get all the entries
    for some reason.
   Reason is that they are being put in the wrong block.
   lafs_find_next doesn't correctly find the 'next' block if it 
   hasn't been incorporated yet.
   Block can be:
     in index tree -- easy to find
     in uninc_table -- not too hard
     in only in the ->children list, or attached to a page.
   It would be nice to use find_get_pages but that isn't exported so try
    something else for now.
   For index blocks
        Look in index block for 'next

15th December 2008
   FIXME when we split an index block, we need to hold a reference to
   the original so it doesn't disappear until the split-off copy is
   written.  This is because we search from an index block to find
   split-off copies.
   [ note from Feb09.  This should be OK now. Both will need
   incorporation, and we now hold on to blocks until they are
   incorporated.]
   

23rd February 2009
  - index block.  What changes are allowed exactly.
     - splitting certainly makes sense.
     - merging two adjacent blocks is fine, of which a special case
       is finding that a block is empty and so removing it.
     - What about a 2->3 split which would require removing a block
        and adding another at the same time?
       or noticing that the first blocks addressed are all missing, so
       moving the index forward?
       In each case, searching down by indexes will find a block that
       has been replaced by a later address.  We could manage that as
       long as the new block is attached after the replaced block.
       So we cannot move a block.  We must delete and replace.

  - unincorporated index blocks..
    unincorporated data blocks are not pinned in memory.  Once they have
    been written out, they can be freed.  Their address is stored in the
    uninc-table.  This means we can delay incorporation while many
    extents are written out and freed.  When we come to incorporated, we
    may have many hundred of address in a few extents that can be incorporated
    efficiently without holding all that data pinned in memory.
    The same scale doesn't apply to index blocks.  An index block can
    reference only 102 blocks (for 1K block size).  And the uninc table can
    hold far fewer so we will naturally incorporate more often.
    So keeping index/indirect/extent blocks pinned until they are incorporated
    is reasonable.  And it makes lookup a lot easier, as we have
    guarantees about ordering of block in the children list that we
    don't have in the uninc table.

    Incorporation could have some atomicity issues.  There is no
    concern about bad stuff appearing on disk as the phase-change
    process handles that.  In memory it might be awkward if we split
    an index block before incorporating a block what would span them.
    That could conceivably happen if we only incorporate 8 blocks
    (size of uninc table) at a time.
    So maybe we should incorporate a full uninc list (not table) at
    a time.
    This means quite different code paths for incorporating leaf
    and internal index blocks....


  - uninc_table lists are a real problem.
    They can only be created during roll-forward so they hardly ever
    happen.
    But if the block is split while processing earlier things on the
    list, then splitting an uninc table would be very messy.
    Is there any way around this?
    Why not just do incorporation during roll-forward?
    We only need to incorporate leafs, not internal blocks because we
    don't use uninc_table for internal blocks any more.
    So during roll forward, all index blocks that are touched need to
    be held in cache...
    I think we live with that.  If it every becomes a problem, we will
    need to perform the roll-forward twice.  The first time collects
    the usage information so that we know where we can start writing,
    then the second just applies all the changes. to the rest of the
    filesystem.


   So:
     uninc table only used for leaves, and has no linked list
     unincorporated index block are stored on a list, which we
     sort before applying.
     All uninc index blocks are therefore kept in the index tree.
     Their order on the children list allows us to find the correct
     index. Each block for which the fileaddr is in the parent is
     followed by any blocks that have been split off and end after
     this one starts.  Blocks that have been emptied are Hole and are
     skipped over when looking for a block.

     When we split an internal block, the remaining uninc blocks
     must not start with a Hole.

   FIXME: what locking do I need around lafs_incorporate?
      i_mutex?? i_alloc_sem??
      i_alloc_sem is imposed by truncate (inode_setattr) and
         direct_io possibly.  So it is really about adding/removing
	 blocks.  Not updating internals.
	 Maybe our own mutex.  Could even be per-index-block !!
      Whatever it is, we need to protect walking ->children too.


24th February 2008
  "rm -r" problem from 12/dec/2008 fixed now.
  incorporate code got a make-over and is probably much better.

  New problems:  After test runs, cannot create files due to no space
     on devices!!  But directory tree is empty.
  I can see:

    free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0

  The problem is that we think 1425 has been allocated to data that
  might still need to be written, leaving not enough room for more.
  Index Dump shows
  ====================414 credits ==============================
  which doesn't explain everything, but does explain a lot.  There
  really should be nothing in the Index tree (except fs-root and
  tree-root)
  There is also:
  Some inodes which are OnFree and hold no credits.
    0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
    52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid

  Some other inodes which are pinned with lots of credits and are
    on the phase_leaf list
    0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
   299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc

  And that is about it.  some are not Valid, some are...
  checkpoint just wants to 'flip' them.
  They mostly have a refcnt of 1... I wonder who is holding that....
  The reference of on the dblock is held by the iblock.
  But what is the iblock remaining?  Who holds that reference?

  I restored some code to clean iblock, and now:
  free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
  ====================244 credits ==============================
  which saved 130 credits.  That helps.
  There seem to be many fewer of the many-credits blocks
  Lot of index blocks in tree are 'OnFree' and have a
  0 refcnt, but haven't been removed.  Why?
  It seems that the have ->parent == NULL, so lafs_refile never
  bothers to remove them.  I guess it should...
  OK, lots of InoIdx block have gone now with their DATA blocks.

  So, remaining blocks are pinned to their phase with lots of Credits,
    have not pincnt, mostly have physaddr==0.
   It is just the stray refcnt that keeps them there..
   inums are 40, 56, 62-73, 275-278, 280
    40 is f22
    56 is first adir
    63-69 are directories 2/3/4/5/6/7/8/9
    70-73 are looooong symlinks
    275 is cfile
    276 is dfile - same as cfile but truncated.
      Then some nbfile-X that were big enough.

   So: what do they have in common:
     Several only use the in-inode data block, but
       probably not all

    Can it be that it is refcounted on the Leaf list, and so
    cannot get off??  Yes, I think so!
    We only unpin things that have a zero refcount.

    So: what to do?
      checkpoint takes it off the list, then flips the phase and puts it
      on the other list with refile.  During that time it has a refcount
      it doesn't lose the pinning.
      Do we want to:
        1/ Not have it on the list despite being pinned.
	2/ Drop the PIN despite the refcnt.
        3/ have refile do the phase_flip so it has a chance to
           notice the refcount has hit zero.

      2 isn't really an option.  We need PIN to persist whenver we have
       a reference.  We could possibly use PinPending for index blocks too,
       but that would require a lot of thinking.
      1 requires another criterea for being on the list.  I suspect that would
       get messy fast.
      3 we used to do I think... But refile is in a big lock, and we
        cannot really do a phase_flip under that.. and phase flip calls
         refile anyway so we would get recursion.
      So:4 - get lafs_phase_flip to notice and de-pin rather than flip.

      FIXME use kzalloc where appropriate.

      FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.

25th February 2009
  Good progress.
  Only 54 credits in Index Tree now.
  Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
  plus '74', which seems to be schedules for deletion - root has uninc_table.
   ... and 'sync' got rid of that and left 44 credits.
  Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
    50  link
    55  zfile
    72  long84
    73  long85
    74  adir
  These seem to be the files that used data-in-the-inode
  They still have a refcnt of 1 (or 2 for adir).
  ... OK, that's gone now.  I fould a refcount leak.

  So now:  42 Credits in Index Dump.   No stray files.

  df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
  So we still seem to have 1085 blocks allocated.  42 are accounted
  for, so 1043 still missing... either we lost the count, or lost the tree.

  create a finy file, remove, and sync, now
  df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3

  so I lost 15, b ut now 48 are in tree.  Lets try again...
  df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
  and 44 in tree
  and again:
  df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3

  Definitely losing more thant the difference in the tree.

  Try creating empty files...
df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3

 very strong pattern there.
 What about 2 files at a time.
df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3

  Slightly different pattern - not as bad.
  Have to try 4 now.
df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3

  Strange, isn't it....

  Making sure we clear UnincCredit... result looks worse.

26th February 2009
  I fixed up the credit accounting 'incorporate' and then fixed a couple
  more little bugs.  And now:


====================48 credits ==============================
df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1

So we still have 720 allocated credits that aren't accounted for.
But we are nicely under 100...

.... and now


====================76 credits ==============================
df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2

That is different.  The count of missing blocks is way down,
but there is some extra cruft in the index tree.
Quite a few like
    0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
    0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
and even one
    0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
   330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
Time for a commit though....

and now
====================46 credits ==============================
df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1

so the strays in The index tree are gone. but still have 159 outstanding
credits.
Now change but now
====================36 credits ==============================
df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2


That is a little weird...
Hmmm. back to 
====================48 credits ==============================
df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1

Oh well.
====================34 credits ==============================
df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1

It seems that the unaccounted blocks are (or can be) created by
writing to a file then removing the file without a sync.
..but why is cb (cblocks_used) so high?

27th February 2009

 Got onto a bit of a tangent...
 What happens if we truncate a block while it is on a list to
 be cleaned?  Clearly we want to cleaner to drop it ASAP.
 But what if invalidate_page wants to drop it *now*
 Hopefully it is either still on clean_leafs and we can remove it,
 or it is now iolocked and we can wait for it.  So should be OK.

 I keep getting caught in "looping on..."
 We are truncating an inode and some index block which is now empty
 is not getting removed from the tree because there is an outstanding
 reference.... 327/0 depth=1.  I guess I turn on the tracing.

 ... and it seems that it is in the process of checkpointing.
 I guess I need to lock against that ... maybe with the iolock.

Credits = -1, rv=2
ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
------------[ cut here ]------------
kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!

 -------
 Every time I create/delete a file, I get an extra 'ab' which disappears
 on 'sync'.
   ablocks_used is:
     decremented when +ve summary_update on non-index
     increased on lafs_summary_allocate... should not be done for index blocks.

 OK:  after test run, filesystem is empty, but cblocks_used is around 360.
  cblocks_used:
        is loaded at mount time
        collects pblocks_used on a phase flip
        is updated in lafs_summary_update (unless pblocks is)
   So we must be missing a lafs_summary_update when phys->0


 Lots of problem:
   truncating big (multi-level index) seems to be bad
     Leaves 'pb-338 !!! and cb+689, even after sync.
   still 'looping on' occasionally
   Haven't found cblocks_used leak yet.
   Occasionally non-B_Valid blocks are actted on.
     I think I need to improve io locking.

---------------
1st March 2009
  Need some improvements to iolock locking.
  We use this lock to wait for a block to be written out (if that is happening)
   before we allow lafs_invalidate_page to complete.n
   It is also use in lafs_erase_{d,i}block (Similar purpose)
  We take the lock in lafs_cluster_allocate, and then make sure the block is
   still dirty.

  Also lock in lafs_new_inode as initing the inode is a form of IO ??
  load_block takes the lock
  We only clear_bit(B_Valid, ) under this lock.

  So the issue is this:
    A block that is going to be written is passed to lafs_cluster_allocate.
    This happens either after taking it of a _leafs list, or when 
    lafs_writepage requests the write.

    lafs_invalidate_page needs to be able to release the page, so there needs to
    be no transient references.  In particular, once the block has been
    removed from a _leafs list it must already be iolocked.
    Invalidate_page can then either remove from that list and erase the block,
    or use io_lock_block to wait for the IO to complete.
  So when a datablock comes out of get_flushable it must be iolocked, and must
  remain iolocked until after Dirty and Alloc are clear
  Index blocks belong entirely to the fs, so we can be more relaxed with them.
  If get_flushable finds the block already iolocked, it is either being invalidated
  or already has IO pending, so it can be dropped.


16th Match 2009

  FIXME  When we sync a small file, we just write out the inode.
     rollforward currently ignores data in inodes I think.
     Thanks needs to be fixed to ensure this data is safe.

 - stop iblock from disappearing so much.

 - I think...
    While cleaning a file, I truncate it.  This makes it appear
    to fit in the inode but it is very big and we get confused.
   We cannot allocate block 0 until all the others have been
   allocated to 0 and forgotten.
   But what if we truncate a file to 10 bytes, then fsync?
    We need to write the data promptly, but we like doing truncate
    in the background.
   When we extend a file we already need to wait for truncation
    to complete (FIXME do we do that?)  We could wait on fsync too.
   We cannot just delay block0 as it might be part of a checkpoint
    that has to complete promptly while truncation can take a long time.
   i.e. we have a very large file.  We update the first byte, then
    truncate to 2 bytes.... we don't need to write until fsync which will wait...
    Directory?? delete lots of entries so it shrinks to one block?
       There is no delayed truncate there.
   ?? Never clean an I_Trunc file.  
   If we try to allocate a file with other indexes:
     clear Realloc
     if Dirty and Pinned, just do normal alloc
     if Dirty and not pinned, skip.


  Sometimes I run out of credits while truncating a file.
  I need credits - maybe only briefly - to dirty the index blocks.
     -- FIXED I think.

  An indexblock remains pinned while the refcount is non-zero.
  A pinned index block can be on a _leaf lru
  The _leaf lru holds a refcount.
  This is an awkward referential loop.
  We break it at checkpoint time with special code in phase-flip.
  But there are other awkward times such as truncate.

  We cannot use PinPending like we do with data blocks because there
  could be multiple pending Pins (from different children).

  We could possibly treat checkpoint_lock like pinpending, but that 
  might be racy.

  We could not count the _leaf lru, but that might just make the race
  harder to find.

  I think we want to explicitly drop the pin when we truncate a block.
  Normally, once we Pin an index block is will become dirty so we don't
  want to de-pin before a checkpoint anyway...

  Just to clarify: an index block gets dePinned:
   - during checkpoint on a phase_flip if it is no longer dirty etc
   - on truncation when we erase it
   - during pre-emptive write-out which is a bit like an early phase_flip
           not sure that we implement that one yet.

17th March 2009
 Deadlock?
   - checkpoint calls incorporate call erase_iblock calls iolock_block
   - rm calls orphan_pin calls phase_wait
 The problem is in lafs_incorporate.  It expects the block to be iolocked,
  but can call erase_iblock which try to get an iolock itself...
 ...fixed that and it still happens.
 checkpoint calls phase_flip calls allocated_block (on uninc list) calls
    iolock_block before calling incorporate
 Maybe all of these should assume an IO lock.

 FIXME truncate assume truncate-to-zero.  We need proper ftruncate support.

 It nearly works....
  Things to do:
    - sort out individual patches and review DONE
    - allow compilation without refcount tracking DONE
    - don't hold a 'leaf' reference. NO
    - clean up *ref calls - differentiate those that can be called when zero DONE
    - use enum for B_* DONE
    - support truncate to non-zero offset DONE
    - "looping on" found an 'OnFree' block!
    - clean out lot of debugging

 Hmmm.... deadlock.
  rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
  checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
  Not cool.  i_mutex must not be taken by checkpoint
 Fixed that, though it is a bit of a hack....

 New deadlock:  checkpoint calls phase_flip which calls allocate_block,
    to move the uninc_next across, and that tries to iolock the parent to
    perform a partial incorporation.  But that seems to be iolocked.
    Generally that is ugly as ->uninc_next might be very long and require
    multiple splits, and direct-driving that from phase_flip is bad.
    I should just move the list across


19th March 2009
  Spent too long trying to remove refcount help by *_leaf lists.
  This leaves InoIdx block with zero refcount so Data block can get
  lost and bad things happen.
  I might be able to fix it up, but it is probably better to try the
  checkpoint_lock approach if I can only remember what that is.

Locking:
  Available locks:

   Spin:

    lafs_hash_lock
        Used in:
	   lafs_shrinker
	   lafs_refile ???
        Protects:
	   ib->hash
	   ->lru when on freelist 

    i_data.private_lock
        Used in:
	   lafs_shrinker
	Protects:
	   ->iblock / refcnt
	   ->dblock / my_inode
           ->children / ->parent within an inode
	   setting ->private

    fs->alloc_lock
        fs->allocate_blocks

    fs->stable_lock
        segsum hash table
        segsummary counters (in blocks)

    fs->lock
        _leafs lru
        ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
        Pinned consistent with lru
        ->checkpointing / ->phase_locked
        fs->pending_orphans
        ->uninc and ->chain ??  Should use parent->B_IOLock ??
	uninc_table - should use B_IOLock
	free list / clean list segtrack

   Mutex:

    fs->wc->lock 
      wc[0] .. something in prepare_checkpoint
       ->remaining etc
      cluster_flush
      mini blocks

    i_mutex
      inode_map
      orphans

   Other:

    B_IOLock
       erase_block
       incorporate
       cluster_allocate
       allocated_block
       IO
       Phase flip
       Initialising new inode
    B_IOLockLock
         IOLock across a page


--------------------
This is a list from 18 months ago, with updates

 - Understand how superblock 'version' should be used.

 -  Review and fix up all locking/refcounts.  See locking.doc
       Also lock inode when copying in block 0 and probably
       when calling lafs_inode_fillblock (??)
 -  lafs_incorporate must take a copy of the table under a lock so
         more allocations can come in at any time.

 - We don't want _allocated to block during cluster flush.  So have
   a no-block version and queue blocks on ->uninc if we cannot
   allocate quickly.  Find some way to process those ->uninc blocks.

 - Use above for phase_flip so that we don't need to _allocated there.

 - Utilise WritePhase bit, to be cleared when write completes.
     In particular, find when to wait for Alloc to be cleared if
      WritePhase doesn't match Phase.
       - when about to perform an incorporation.
 - make sure we don't re-cluster_allocate until old-phase address has
     be recorded for incorporation.

 - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'

 - Can inode data block be on leafs while index isn't, what happens if we
       try to write it out...

 -  If InoIdx doesn't exist, then write_inode must write the data block.

 - document and review all guards against dirtying a block from a previous phase
    that is not yet safe on storage.
          See lafs_dirty_dblock.
 - check for proper handling of error conditions
     b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 - review checkpoint loop.
       Should anything be explicit, or will refile do whatever is needed?
 - Waiting.
       What should checkpoint_unlock_wait wait for?
       When do we need to wait for blocks the change state. And how?

 - load/dirty block0 before dirtying any other block in depth=0 file

 - use kmem_cache for 'struct datablock'
 - indexblock allocation.
        use kmem_cache
	allocate the 'data' buffer late for InoIdx block.
	trigger flushing when space is tight
	Understand exactly when make_iblock should be called, and make it so.
 - use a mempool for skippoints in cluster.c
 - Review seg addressing code in cluster.c and make sure comments are good.
 - consider ranges of holes in pending_addr.

 - review correct placement of state block given issues with stripes.

 - review segment usage /youth handling and make a todo list.
      a/ Understand ref counting on segments and get it right.
 - Choose when to use VerifyNull and when to use VerifyNext2.
 - implement non-logged files
 - Store accesstime in separate (non-logged) file.
 - quotas.
        make sure files are released on unmount.

 - cleaner.
       Support 'peer' lists and peer_find. etc
 - subordinate filesystems:
     a/ ss[]->rootdir needs to be an array or list.
     b/ lafs_iget_fs need to understand these.
 - review snapshots.
      How to create
      how they can fail / how to abort
      How to destroy
 - review unmount
      - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 - review roll-forward
      - make sure files with nlink=0 are handled well.
      - sanity check various values before trusting clusters.

 - Configure index block hash_table at run time base on memory size??
 - striped layout.
         Review everything that needs to handle laying out at cluster
         aligned for striping.

 - consider how to handle IO errors in detail, and implement it.
 - consider how to handle data corruption in indexing and directories and
     other metadata and guard against problems (lot of -EIO I suspect).

 - check all uninc_table accesses are locked if needed.

 - If a datablock is memory mapped writeable, then when we write it out,
     we need to with fill up it's credits again, or unmap it.
 - Need to handle orphans asynchonously.

 - support 'remount'
 - implement 'write_super' ??

 - pin_all_children has horrible gotos - remove them.

 - perform consistency check on all metadata blocks read from disk
   e.g. don't assume index blocks are type 1 or 2.

23rd March 2009
 + looking at cleanup for unmount.
 - various more refcounts fixed up
 - B_SegRef is never dropped!  and we take a ref on a segment when
   we start a cluster on it, but never drop that reference.
  THIS is next thing - review all setting and clearing of B_SegRef.

30th March 2009
 - SegRef and lafs_reserve_block...
   There is room for recursion here, I need to be careful.
   To dirty a data block, all parent index blocks must be Pinned and must
   be able to be written.  That means their segusage blocks must be
   available for update.  And Pinning a segusage block for update requires
   all its parents.  So the segment for the block, the indexes, and the
   segusage and indexes and so-on must all be pinned.
   When we pin a block, we do it from the root down to avoid recursion.
   We probably wany whatever reserve_block calls, to return an unreserved
   block rather than call reserve_block itself.

  When do we clear SegRef?? We set it when Pinning, so I guess we
    clear it when unpinning.
   pin_dblock, mark_cleaning, prepare_write, truncate
   seg_move clean_free
  We it is really when Pinning, or Dirtying or Reallocing.
  So we clear when unpinning, or when a dblock gets written...
  Maybe just when we lose ->parent

6th April 2009
 - sometimes sugsum counter goes zero for random data block
     Something is going wrong in roll-forward.  The block looks transiently valid
     so doesn't get read, but has no good data in it.
 - After deleting a directory, the block might still have incorporation
   to happen, but is not marked dirty
 - at unmount, there are various blocks that are still dirty.
 - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)

12th April 2009
 - that rollforward problem above:
    When rolling the checkpoint, if we find segusage blocks we want to include
    them directly into file.  But by pinning the block we might preread a
    segusage block.. but we must be sure not to update it.
    So during the early stages of rollforward while still in the checkpoint,
    seg_inc must be called with in_phase == 0.
    so seg_move is called with phase != qphase.
    ditto for summary update.
    So the block must be pinned to the previous phase...
    Normally 'phase' changes at checkpoint-start,
             qphase changes at checkpoint-end
    So we probably want to start with qphase being 0 and phase being 1.
    When we reach the end of the checkpoint, we flip qphase to 1.

 - blocks still in phase_leafs at unmount:
    After we force a final checkpoint we still have Pinned:
        root InoIdx
        ino==8 InoIdx due to Dirty block0
        ino=16 InoIdx due to dirty block0
     and dirty:
        inode block 1,  inode usage map
                    2,  root directory
                    8,  orphan
                   16   seg usage
     Problems:
        inode blocks dirty but not pinned?  No InoIdx...
        Segusage dirty - probably by seg_apply_all - disable that at umount
        orphan dirty ??... but not pinned!
           This is possible - we don't pin for clearing entries, just for setting.
        The inode problem stems from the datablock being dirty while the
         InoIdx block isn't.  That is, at best, confusing.

13th April 2009
   segusage blocks aren't being pinned
   They need to be pinned  whenever dirty.
   and youth blocks aren't even made dirty some times.  They need to be
    pre-pinned in many cases.

   So: segusage gets changed when we write out a cluster, and when we
      delete/relocate blocks.
      In the first case we pin the block when it becomes part of the free list,
      and need to keep it pinned across checkpoint changes.
      In the second, we pin when the block is dirtied and again must keep it pinned.
      Youth gets changed when a segment becomes free and again when we allocate
      a segment to it.

      Keeping a datablock pinned across checkpoints is awkward - we currently need
      to repin for each dirty... I guess we can re-pin for each checkpoint
      in lafs_seg_apply_all.  That might work for segusage, but not for youth!
      If segsnum for ssnum==0 held a reference to the youth block, that might
      help.  Segstat on 'clean' or 'free' would imply a reference to that segsum.

      Is it OK to keep all youth/usage blocks for free/clean blocks
      pinned?  We can currently have 810 entries.  Only half will be clean/free.
      For each entry there can be two blocks, youth and usage.  So that could be
      810 blocks. 1Meg?  Normally much less.  If it became a problem we could
      reduce the number dynamically I guess.

      maybe segusage blocks need to get phase_flipped, as other blocks do
      depend on them,   pin_all_children wouldn't be able to find them though..

    1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
      Youth block.
      
14th April 2009
   I think I want to link dirty block to the space in free segments that we
   actually know about.  Each of those segments has youth and usage blocks
   pinned (at least parent pointer is active).  So we have everything we need
   to write everything that is dirty.  So 'free' or 'clean' implies
   a segsum reference which holds youth block.

   When we get low on space, we wait for cleaning/finding to progress.
   This would limit us to  400 segments, say 16Meg each, so 6Gig of dirty
   memory.  I guess that we need to scale the 'free' list based on available
   memory (FIXME).

   When cleaning needs a segment, it needs to load the usage blocks for other
   snapshots too.

   When cleaning in the presence of snapshot we need to be careful never to
   duplicate a block that is shared.  To allow for v.many snapshots, we don't
   even want to duplicate in memory.
   So we need to choose a 'primary' copy - probably first one found - and
   follow the peers link when possible...

18th April 2009
   (continuing).

   So clean and free segments in the list carry a SegRef.  But it could be
   excessive if all of them did - we shouldn't be required to pin more
   data than we need.
   So for segments with a usage of 0, we use the score to record if a
   segref is held.  0 means 'no', 1 means 'yes'.
   When space_alloc wants more space we need to find an entry and
   segref it.  Maybe we want free lists - reffed and not-reffed.

   Then again, SegRefs are fairly cheap as they are heavily shared.
   maybe 512 to a block.  If we hold 400 refs they could easily all be
   in one block.  We could possibly encourage this by sorting the list
   and discarding from one end if it is too full.
   Sorting is a good idea definitely.  It keeps youth/usage updates
   together.

   Just check the numbers.
   a 1TB device with 1K blocks might have 32M segments of which there
   would be 32768.  512 per block means 64 blocks or 16 pages (64K).
   So total segusage files is 128K plus snapshots.  Not worth worrying
   about surely.
   For 16TB, that is 2Meg plus snapshots.

   So
    - keep a SegRef for all free and clean blocks.
      This must include a youthblk reference.
    - sort the free list when 'clean' is merged or when a pass
          finishes.
	sort clean list
	fix youth value
	merge as many as fit into free
	sort 

   How is the code flow...
      add_cleanable is called during the periodic scan.  It could hold
               a SegRef easily.
      add_cleanable calls add_clean as does lafs_get_cleanable during
          clean.  That might block getting a segref, might even
          deadlock?
      add_free is also called by seg_scan

      So seg_scan should get a segref and leave it with everything!

    BUT.....
    A SegRef implies a 'struct segsum' for each segment.  We don't
    want to allocated one of these for every segment in the table.
    We only want a reference to the youth and segusage block, which
    are heavily shared.

    But these blocks need to be Pinned and SegReffed etc so we can
    write them at any time.

20th July 2009
  The refcount held by the 'leaf' lru is a problem.
  While it holds a count we do not unpin an index block, so it cannot
  be removed from the list.
  Thus we can only remove from the leaf lru on a phase change.....
  Or when doing lru based flushing... Maybe we can remove from the
  lru while holding the checkpoint lock.
  This happens when truncating..

  No, that is just too messy as it is too easy to get put back on the list.

  Maybe the leaf lru should not imply a reference count ... or maybe
  we need to split the refcount:  'inuse' and 'active'.....
  How about we test refcnt against list_empty(->lru)...

  ....

  During truncate, we need each index block to get unpinned so they can
  all be cleaned up.
  But the InoIdx block is held pinned by by the inode block being dirty.
  In this particular case, the InoIdx block is Invalid as the file is empty.
  But.... InoIdx should always be valid until after Inode is destroyed??


 umount
 I need to stop the cleaner and flush everything before trying to
 clean up.

 This is awkward though.
 The 'sync' of umount is done by kill_block_super, but I call
 that rather late, after checking that the tree is empty.
 There are pinned/dirty bits left after sync that we want to magically
  clean.
 We have:
   - segusage/youth blocks.  Maybe if we don't seg_apply_all...
   - orphan block.  Maybe don't mark it dirty when we remove things?
   - inode map?? why is that dirty

   - root directory is dirty still??  But it has been erased.
     InoIdx is valid-but-empty.  Inode Data is dirty
        Data block 0 is Dirty at block 0.

  ......
 Ahh... need to mark page dirty when block is marked dirty !!

 The seg usage blocks are now flushed out but not incorporated.
 I feel that might be correct - we don't want to care about
 incorporation as we will never use it.
 For this, segusage and quota are very special cases.

 Inode map is no longer dirty, but is pinned
 Orphan does have a dirty block still
    The orphan table contains the root directory.
 root is now clean and gone

 Segusage doesn't get incorporated after last checkpoint now
 so that is better.
 But now we have a circular reference for SegRef.  This should not
 be surprising given the circular problems we had setting SegRef.
 I guess we just erase the references in the segsum table...

22nd July 2009
 Hurray!!! I can unmount without crashing!
 Now I need to sort through all the fixes required to achieve that
 and make discrete patches, and be sure it is all OK.

DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
DONE - (block.c) Mark page dirty when block becomes dirty
DONE - (checkpoint.c) print orphan_slot with Orphan flag
DONE - Don't incorporate segcount etc after final checkpoint
DONE - Don't apply seg changes after final checkpoint.
DONE - Don't start opportunistic checkpoint after final.
DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
DONE - (cluster) be more flexible about credit usage when flushing InoIdx
DONE - (dir) do add_orphan when we abort as well as on success
DONE - use inode_dec_link_count, not i_nlink--
DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
DONE - change %d/%d to strblk
DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
DONE - (index) refile: when unpinning, remove from lru
 - lafs_refile: ->iblock can be non-null for inode 0.
DONE - Make sure I_Deleting gets cleared when deleting finished.
DONE - phase_flip should have something separate to call, not lafs_allocated_block
 - inode.c: lafs_dirty_inode: getref_lock used to get dblock
NONO - ?? getref_locked allowed if PagePrivate
DONE - segment: lafs_seg_put_all needed at unmount
DONE - segdelete_all: need to put intable references
DONE - lafs_free_get: put the intable references
DONE - lafs_get_cleanable: put the intable references
DONE - fix sort splitting in add_cleanable
DONE - add lafs_empty_segment_table for unmount
DONE - lafs_release: flush all dirty blocks
DONE - lafs_release: force a final checkpoint
DONE - lafs_release: move kill_block_super before final check
DONE - lafs_put_super: release orphans and segsum files.
DONE - lafs_destroy_inode: putref should be 'iblock'
 - lafs_destroy_inode: allow for iblock to be present but no ref held....
DONE - can roll forward call lafs_allocated_block without dirty???

27th July 2009.
 - I've re-arranged lafs_release so that the flush is all done in
   generic_shutdown_super.  However it calls invalidate_inodes, and that has
   problems with pinned inodes.  So we need for fsync_super to checkpoint
   out all inodes that we don't hold our own reference to.  
   If we do hold a reference, then invalidate_inodes will skip them,
   and ->put_super can be used to drop the references and perform the final
   checkpoint.
   fsync_super calls ->sync_fs. after syncing call files.  Maybe I can
   do some sort of checkpoint there...
   There almost is a checkpoint in there.... But only when called without
   'wait'....
   I need to understand 's_dirt'.
   This is controlled entirely by the filesystem, common code only examines it.
   If it is set:
          file_fsync (the generic 'fsync' method) will call ->write_super
          fsync_super will call write_super
          generic_shutdown_super will call write_super
          sync_supers will call write_super
          sync_filesystems(0) will call ->sync_fs
   sync_fs is called:
        twice from 'sync', once with '0', once with '1' for 'wait'.
             (though in emergency_sync, both are '0').
        once from unmount and remount with 'wait' set to '1'.
        We don't want two checkpoints for a 'sync', but we want to start
        on 'wait=0'.
        Maybe if we get called with '0', we set a flag and treat the '1'
        differently..  There is no locking to make this really safe, but
        it will probably be OK...  I could take a process_id, but then
        parallel 'sync's could race.
        write_super is called before the syncs.  So it could start the checkpoint,
        and sync could wait for it.
        write_super is called multiple times at shutdown,  We really need 
        to utilise sb_dirt to avoid some of these.
        We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
            - when we pin a dblock or dirty a this-phase iblock.

29jul2009
  at unmount, we iput the root inode which de-references the dblock
  before clearing ->iblock, which fails an assertion ... why?
   Apart from the shinker, ->iblock is only set to NULL in refile
   when we find an I_Destroyed inode... I guess the root block isn't
   getting Destroyed...
 The protocol for freeing iblocks is bad.  Should be:
   - it only gets freed by the shrinker
   - when inode dies, set ->inode to NULL
   - when InoIdx iblock dies, set ->iblock to NULL
   ...???
30Jul2009
  So, what exactly is the protocol?
    - index blocks live either in the parent/sibling tree, or
      on the inode's free_index list
    - when refcnt is 0, they live on 'freelist.lru'.  When refcount
      is elevated they stay on lru until they need to be 
      added to some other lru (leafs or cluster)
    - when shrinker finds block on freelist.lru with non-zero refcnt,
      it just removes from lru
    - when shrinker finds free block, it removes from free_index and discards
      the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
        I think not as such would either have children or be on an lru
    - When we destroy an inode, all index blocks get disconnected from the
      inode and freed.  This must include the ->iblock
    - When an index block becomes free due to index tree shrinkage,
      we set the ->depth to -1 so that it cannot be found by mistake,
      and leave it for shrinker or inode destruction.

   Confused about inode<->dblock dependence.
   We don't want the inode to refcnt the dblock as that wastes space.
   We don't want the dblock to refcnt the inode as that stops it from being freed.
   So each must disconnect from other when freed.
   What locking?
   inode takes private_lock, then checks dblock
   dblock cannot take private_lock before checking ->my_inode..
   Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
     drops ref

1Aug2009.
  Tracking down the 'credit' count and making sure it stays correct.
  It seems that I have a Dirty InoIdx block which is not pinned.
  Due to this it has no refcount and so the data block disappears so
  the InoIdx block is not visible in the tree.  This isn't a definite bug
  but it means I cannot count credits properly.
  And surely Dirty index blocks must always be pinned!!??

  When as small file is flushed to the inode we were dirtying the
  iblock.  That seems wrong - should dirty the dblock?  Need to 
  check that is valid

  I got a hang in 'rm adir/4'.
  rm is in lafs_cluster_update_commit_both
       getting a mutex.
  cleaner is in lafs_do_checkpoint+0xe4
  pdflush is in writepage/lafs_cluster_flush waiting on a lock
  so I guess cleaner is holding a mutex and waiting for something
   that wont happen?


  Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
   cleaner is at some point, holding a mutex to stop 'sh'.
  0e4 == 228

  ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
   to be allowed.
  So when something locks the checkpoint and needs to flush, we have problems....


  I seem to have fixed the above.  Now:
    Free space is a real problem.  When I remount after the successful unmount,
    we find a usage pattern like:
CLEANABLE: 0/0 y=10 u=34179
CLEANABLE: 0/1 y=0 u=65144
CLEANABLE: 0/2 y=0 u=65535
CLEANABLE: 0/3 y=32773 u=32910
CLEANABLE: 0/4 y=32772 u=149
CLEANABLE: 0/5 y=0 u=0
CLEANABLE: 0/6 y=32770 u=16529
CLEANABLE: 0/7 y=32769 u=35084
CLEANABLE: 0/8 y=32768 u=31877

    Which is ridiculous. 
   Better fix up what I have first...

 ...
 In rm /mnt/1/nbfile* we hang.. 
   rm is in lafs_phase_Wait from pin_dblock in unlink
wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)

   cleaner is in lafs_iolock_block from add_block_address in phase_flip
iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)

 So cleaner is probably deadlocking against itself via iolock_block.
  This is taken:
    - in lafs_invalidate_page just to wait for any io - it isn't held long
    - in lafs_erase_dblock while we erase and 'allocated_block'
    - in lafs_get_flushable to protect blocks being checkpointed
    - in lafs_writepage to call cluster_allocate (which releases), both for
             data block or for inode when data was flushed there.
    - lafs_add_block_address to process pending incorporations to make room.
         This is what is trapping the cleaner.
    - lafs_inode_handle_orphan when truncate finishes to erase_iblock
    - lafs_inode_handle_orphan again to incorporate all removal
    - and again to erase_iblock
    - and for partial truncate to incorporate some removals
    - and again....
    - lafs_new_inode to keep it from being cleaned while being created
    - roll_block to add addresses
    - lafs_load_block during IO

  So: who holds it?.... let's use the code to find out...
  And the answer is : lafs_get_flushable.
   So get_flushable iolocks the block then calls phase_flip which tries to
   incorporate other-phase children which try to iolock the block.  Deadlock.
   Do we need to hold iolock during phase_flip ??.  Not for all of it..

02August2009
   FIXME When erasing a block, do I need an uninc credit?  I usually don't
    have one and the need certainly isn't as great...

  Now... let's try to get free space accounting right.
   Observed problems:
     - unlink sometimes failed with ENOSPC
     - usage scan shows segmetns with enormous usage - 23039!!

  no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
  no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)

  no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)


  after umount/remount df says "4608 7 1544" but cannot
   create anything.
df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
============= Cleanable table (7) =================
pos: dev/seg  usage score
  0:   0/0        1 0
  1:   0/5        1 64
  2:   0/6        6 384
  3:   0/7        2 128
  4:   0/8        3 192
  5:   0/3        1 64
  6:   0/2        2 128
...sorted....
  0:   0/0        1 0
  1:   0/3        1 64
  2:   0/5        1 64
  3:   0/2        2 128
  4:   0/7        2 128
  5:   0/8        3 192
  6:   0/6        6 384
--------------- Free table (1) ---------------
12290:   0/4        0 0
--------------- Clean table (0) ---------------
CLEANABLE: 0/0 y=10 u=1
CLEANABLE: 0/1 y=32775 u=3
CLEANABLE: 0/2 y=32774 u=2
CLEANABLE: 0/3 y=32773 u=1
CLEANABLE: 0/4 y=0 u=0
CLEANABLE: 0/5 y=32771 u=1
CLEANABLE: 0/6 y=32770 u=6
CLEANABLE: 0/7 y=32769 u=2
CLEANABLE: 0/8 y=32768 u=3


03Aug2009
 Current issues:
FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
 4/ invalidate_page find Realloc set, even after iolock ..
     This is during umount  in generic_shutdown/lafs_put_super/iput
 5/ 


 Thoughts:
   If we flag a block for Realloc then Dirty before it is allocated,
     then all is fine.
   But if we have already allocated to a cleaning cluster... what happens?
    We need to treat this like it was dirties after being written, so
    it gets written to a regular cluster as well.
    As we only have one uninc bit for both Dirty and Realloc, we need
    to *not* incorporate the Realloc update if the block is still dirty.
   So:
        - block gets chosen for cleaning and allocated to a clean-cluster
        - block gets marked dirty.  This must not clear Realloc
        - cluster is flushed, block is dirty, so don't call lafs_allocated_block
        - Return the Realloc credit, but keep dirty and Uninc.
     Is there a race if Dirty is set after we enter lafs_allocated_block?
      As long as the index block gets marked Dirty, not Realloc we might
       be safe... though it gets awkward if the Dirty writeout falls in to
       the next phase.  But reserve_block will have provided NCredits for that.
     So:
        1/ don't clear Realloc when setting Dirty
        2/ do clear Realloc if cleaner finds the block is Dirty
        3/ avoid calling lafs_allocate_block when cleaning a dirty block.
                   This is an optimisation.

    Almost...  A B_Realloc block no longer has B_Credit so B_Dirty cannot be
       set.


  Thoughts3.
     When cleaning blocks we hold no reference to the inode and it can disappear.
     We don't want to hold the inode active, but need a reference much like
      the truncate code has.
     I think we need a subordinate refcount for both cleaning and truncate.
      These hold inode present but not active.
     Maybe every block->inode should be counted like this.
     And this might simplify the my_inode->dblock inter-relationship.
     For later..
       We need to ensure that if a new iget is called on an inode that still
       exists, we don't allocate a new one but just reuse the old.
       But that won't work as we cannot add an inode back into the hash table.
     So I think when cleaning a block we need to ref the inode.
      i.e. B_Realloc implies an i_grab

05aug2009
 So I have a problem with the cleaner wanting to hold and inode that
 the VFS is destroying.
 I don't want the cleaner to hold i_count as that delays truncate etc.
 So we need a second counter subordinate to i_count.
 This is held by the cleaner and by delayed truncate, and by i_count.
 Possibly ->my_inode holds this, which means it can be a single bit...

 When a lookup wants an inode, we need to load the inode data block and
 see if it has my_inode.  If it does, we insert that inode in to the
 hash table.  If not we fall back to regular inode creation....

 On reflection, that is too complicated and hard and error prone.
 When relocating a file we need the data so it had best be in the page
 cache so the filesystem really needs to know that the inode is still
 active.
 So cleaning needs to keep a reference to the inode.
 The cost of this is that if an inode is being deleted while it is
 being cleaned the truncate cannot happen until the cleaning
 completes.  This means that space usage will be wrong.
 When nlink becomes zero we can drop the cleaner reference.  When
 the inode is dropped/destroyed we can tie the cleaning in with the
 delayed truncate so that the final destruction doesn't happen until
 the cleaner has let go.

 So: how to track that the cleaner has a reference to the inode?
 Maybe every B_Realloc block owns a ref on the inode.... but dropping
 those references when i_nlink hits zero would be difficult.
 They could hold a secondary refcount which, if non-zero, implies a
 ref on the inode.

 So:
  - Set B_Cleaning when we look at a block for cleaning, and clear
    it when we find Realloc clear and ....????
  - Whenever a block has B_Cleaning set, it holds a counted reference
    on LAFSI(b->inode)->cleaner_ref
  - When cleaner_ref is non-zero and I_Deleting is not set, we hold
    a reference on the inode (i_grab).
  - when i_nlink hits zero, set I_Deleting and drop any reference
    held by the cleaner.
 DONE - cleaner must be careful not to process any block that has been
    truncated, or file that is dead.
 DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
  - What about filesystem inode... how do they fit in??


  Question. When are the index blocks for an inode flushed?
  We need to have them gone when the inode disappears.
  For deleted inodes, this happens in background truncate.
  For memory-pressure inodes it will hopefully happen well in advance,
  but we need to make sure in destroy_inode that everything is
  written. - FIXME


  Thinking again about B_Cleaning, any B_Realloc block will hold a
  reference through to InoIdx and so dblock will be present and the
  inode won't be freed.  So we only need an extra reference during
  the first little phase of cleaning when we are collecting blocks.
  After that a reference can be useful as it will delay flushing so it
  can be more efficient...

  Maybe this is all much simpler than I thought.
  If we hold a ref on the inode whenever the InoIdx block is Pinned
  and i_nlink is non-zero, then we won't be forgotten until all
  index blocks are written.  We may still be deleted, but as that
  is one-way we can hold on to the inode at little cost.

  getting/putting that ref at exactly those times turns out to be
  messy.
  It might be best to have a flag to say "We hold an extra ref".
  Then we occasionally call a function that validates the setting.
  It is most important to drop the count at the right time, so
  after unlink/rmdir/rename and when B_Pinned is dropped.

  B_Pinned is set in:
     set_phase which is called from:
          lafs_cluster_allocated when moving 'pin' across to data block
              so don't need checkpin
          lafs_pin_block_ph
              only need check_pin if dropping spinlock
          pin_all_children
              only pins data blocks (Index are already pinned if relevant).
          grow_index_tree
              where "inoidx block pinning" doesn't change
          do_incorporate_leaf
              No InoIdx involved
          do_incorporate_internal
              ditto
   So only need check in lafs_pin_block_ph and maybe pin_all_children...

08Aug2009
  - credits get out of sync from
      lafs_incorporate->refile->space_return from checkpoint.
      counter is one more than we can find.
      returning space on 
         i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
       Note it in an Index but not InoIdx.  The parent is still in the tree.
     This that is FIXED

  - and out by 8! at
      delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
    FIXED that.

  - BUG credits<0 in space_return from lafs_incorporate from add_block_address
     from phase_flip
Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
     from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
msg: (1,3,1)(1,1,-1)
Credits = -1, rv=1
ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)

    This is a predicted but not handled problem.
    The answer is that not all blocks need ICredit/UnincCredit.
    The purpose of this credit is to allow for a split in the parent.
    pre-existing index blocks can never split the parent themselves
    If an index block becomes full, it will split and this might split
    the parent.
    If an index block has free space, then it will only over flow if it
    gets multiple child updates and this will provide multiple credits.
    So an index block with space for 3 or more new addresses does not need
    and ICredit/UnincCredit.  So when we split we don't need to provide an
    uninc credit.
    In particular.
    When we have a fully InoIdx block and a single new child with 1 UnincCredit,
    each block already is either 'Dirty' or has a 'Credit', and the InoIdx has 
    an ICredit, then create a new intermediate such that
        InoIdx is Dirty and has an ICredit
        New Index is Dirty with no ICredit - it used the UnincCredit
        New child looses its UnincCredit
    When another block in the new index arrives, it's unincredit is used to
    provide an ICredit

    When a leaf block cannot fit a single address it will have ICredit.
    The block is split so that each has 3 spaces and so do not need ICredit,
    but as soon as ICredit is available, they take it.

    Worst case is that every ancestor is full and the leaf is split
    We then get two full branches, each block half empty so not needing ICredit.


  Then...
    free data being used in lafs_refile from cleaner.
    b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
    Answer: lafs_refile was derefering ->inode when it wasn't safe.
     Need to at least have a parent before it is safe.

  Hang:
     soft lockup cleaner->lafs_iget->ifind_fast ....
    Then (may be caused)
Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
.......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
------------[ cut here ]------------
kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!

    It seems the cleaner gets confused and goes spinning.


  So: space problems:
    After the run, we have -14 used and 2055 available (of 4608), and
    cannot create anything.
    4 segments ar free, one is cleanable.
   free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
or
   free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
or
   df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
   free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
   and very little free

  ablocks_used is going negative - why?
   Probably we erase a dblock without clearing Prealloc.
   Then when Prealloc later gets cleared, ablocks_used is
   wrongly decremented.... no...


10aug2009  (don't forget above problems)
  Another problem.
   read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
     getiref_lock triggers BUG.
   This is presumably because I have just fixed it to get the correct
     iblock and not the iblock of the filesystem.

  FIXME I hacked around this but I'm not sure the result is right.
    The question is about when the InoIdx should be dirty and when
    the inode data block should be dirty.
   In this particular case we are writing a page of a small file.
     cluster_allocate calls flush_data_to_inode which tried to dirty
     the inode dblock but finds that iblock is not pinned...
     When we dirty a data page we aren't pinning the parent!
   That might be OK - we only need to count and reserve the parent.
    We don't need to pin it until it becomes dirty.

   Still need to resolve when which block gets to be dirty, and also
    exactly when an index block needs to be pinned.  And how does that
    related to holding a ref on the inode when the inoidx is pinned.
    Maybe it should be when the inoidx is referenced.
   FIXME

11aug2009
   Another problem. unlink->handle_orphans->erase_dblock->allocated_block
    and get a zero from lafs_add_block_address but parent is not pinned.
  And... One unmount, orphan file still has pinned blocks so the inode
    isn't free.
  And ... root still old phase after lots of 'rm' then sync.
    Inode 244 has pinned inode block held by writepage0 and writepage
         this is adir/170

13aug2009
  - lots of bugs introduced by change to marking inode blocks dirty:
     writepage/cluster_allocate wants to Dirty inode data block with no credits.
         because I put credit in iblock!

  - ohhh.... The phase contour is broken.  When a block is added to a
    cluster for allocation it isn't in the phaseleafs any more, but prevents
    it's parent from joining.  So we cannot assume that if dblock is on
    list then iblock or a child will be too.
    So when we find dblock we do need to remove it.... done that.

  - root not changing because Data 1/0 is Pinned and IOPending
     and held by writepage!!
     Problem is that IOPending blocks aren't put back on lru.
     But that should only be blocks on the cluster list.....
     But that is where I am putting it.
     Maybe I need exclusion between checkpointing and any other
       code that writes to checkpoint so checkpoint can wait
       for that ... can we use wc->lock??  That doesn't lock
       against cleaner, but that isn't a problem...
   But now 0/228 is still pinned and in writepage and IOPending
    So there is more to it than that.
    When checkpoint finds an IOLocked block, it might be about to
     join a cluster, in which case we don't really want to wait, or it
     might be undergoing incorporation in which case we want to wait.
     or it could be being erased, so wait..
     Maybe I wait until it appears on some list.... yes.

14aug2009
    At unmount Index 8/0 with child and leaf is still pinned
  This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)

  and..

  A problem is that something goes wrong in the erase process.
  We find new children after we erase the inoidx block!

  This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)

  When/how do we erase indexblock and particularly inoidx blocks?
  Does and inValid InoIdx simply mean there is no indexing and does not
  reflect on the Data block?

.xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)

 Orphan problem:
nextfree = 0
reserved = 0
VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
[cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) 
  [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
  [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
  [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid

nextfree = 1
reserved = 0
  0: 1 0 0 304
VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
[cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
  [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)


erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
------------[ cut here ]------------
WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
unlink/orphan/erase_dblock_allocated_block
---[ end trace 61b8bd59512ea4da ]---
zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
   [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
   [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
------------[ cut here ]------------
kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!

BINGO.  When we remove last entry from directory we erase the InoIdx block,
 then when we add entries, we hit problems.


nextfree = 3
reserved = 0
  0: 1 0 0 306
  1: 1 0 0 307
  2: 1 0 0 74
VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...

This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
[cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
  [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)

This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
[cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
  [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)

This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
[cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) 
  [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)

We have stray 'cleaning' references.
It is taken -
   on a data block that was in a to-clean segment
     at which point we igrab the inode
     the block is put on the ->cleaning list.
It is put:
   when we get an error finding the block
   when we find that it isn't in the segment
   when an error occurs loading the block-to-be-relocated
   and when we mark that block for cleaning.
  i.e. always unless we got EAGAIN or some space error.
   If we still hold some blocks, try_clean returns 0.

VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
[cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) 
  [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
  [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
  [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid

NOTE these inode data blocks are not pinned and so did not get written!!

FIXME I should wait for the checkpoint to finish
nextfree = 1
reserved = 0
  0: 1 0 0 301
VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
[cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0) 
  [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)

16Aug2009
  When I clean and find an inode that is already deleted, I need to be
  very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
  to be.  lafs_delete_inode gets called a lot, but mostly for dead inodes.

  BUGS:
FIXED orphans don't get cleaned up.  It seems a 'create' fails and leaves
      and orphan block un-released.
   - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
   - Not sure that we handle complete truncation, then adding blocks properly.
     - what should the state of the InoIdx block be?
   - On remount, the filesystem contains rubbish.
   - create fails even when there should be free space.
   - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
   - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
          and 74 has similar issue
     327 = adir/big1   74=adir


17Aug2009
  Segusage blocks aren't always Pinned when we make them dirty.
  Yes. That is correct.  They are not forced out by phase change but by
  lafs_seg_flush_all at the end of a checkpoint.  So they need to be
  preallocated, but not Pinned.
  But, once we have finished the last checkpoint we don't want to
  dirty Segusage blocks any more.. I wonder if we are.
  No, but we were Pinning inodes without PinPending and they
  lost the pinning straight away!

  OK, other annoyance.
   InoIdx block and similar are getting erased at the wrong
   time.
   We can only safely erase them when they have no children.
   I guess what we really want is the incorporation leaves them
   existing but empty, and when we go to write them out, if they
   are empty we register an address of 0.
   When we drop the ->parent pointer of an Index block it 
   just goes away...
   So:
    When incorporate or truncate produces and empty index block
     it simply clears B_Valid.
    When incorporate want to add to an index block, we set B_Valid
    When cluster_allocate gets a non-Valid index block it call
    block_allocated with phys of 0.

    Yes, that seems to work.  Mostly

18Aug2009
  On remount, check_credits dies: 16/20-0
    In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.

19Aug2009
  OK, this index block clearing is a mess.  There must be a neat model I can
  follow that will make it "just work".
  The key seems to be children.  If an index block has children, then it
  really must exist.  If it has no children and no content, then it can
  be discarded, in which case it needs to be unlinked from its sibling list.
  What locking do we use here?  Probably IOLock on the parent index block.
  So we need iolock while looking in a parent for children, and we take
  IOLock while incorporating or pruning.
  Once the empty index block has dropped out it will never be found again.
  When we incorporate the zero address, the index block becomes invisible
  unless it is shortly after it's predecessor in the sibling list.  But
  that is hard to ensure, especially if the first child is the one that
  is being erased.  So if an index block is erased, then it must be
  discarded quickly and any children need to be relocated...
  Or maybe not.... maybe if there are children, we just write and empty block?

22Aug2009
  We need better locking of the index information.
  It seems best to use IOLock as that is already held during incorporation.
  So any code that accesses or updates and index block must hold IOLock.
  This might be a bit of a restriction if we try to do a lookup while
  writeout is happening.... Maybe we need a separate writeback flag for that.
  But I think it is good to use IOLock for now.
  Places we need this are:
     flush_data_to_inode needs to lock the InoIdx block
       - DONE
     lafs_leaf_find as it recurses down.  This should return a locked leaf.
       - DONE
     callers of clear_index
         erase_dblock for depth=0??
       - DONE
     incorporate should lock new blocks for consistency 
       - DONE

   Locking dependency rule is that if we hold a lock, we are allowed to
   lock a child index block, but not a parent.  IF we hold a data block,
   we are allowed to lock the an index block.


  The read/write completion seems all wrong.  It unlocks if the page was locked,
   and that isn't really safe, because it might not have been locked for read..
   We need to flag block0 to say if lock or writeback need to be cleared.
   Given that, I don't need IOPending any more:
    Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
    Write: We queue all writes, then set 'do_clear_writeback', then check.

  Now... can we use a writeback flag to avoid waiting to read while writeout
  is happening?  We would need:
     set writeback in cluster_allocate
     wait_writeback after some lock_block
     clear_writeback when writeout finishes.
     Extra checks where we already check for IOLock


24aug2009
 Lots of progress but....
   cluster_flush calls cluster_done calls refile call iput call
    drop_inode call write_inode_now calls writepage calls cluster_flush
  and we get a locking loop.
   I think we need the run that cluster_done from a different thread.


 We seem to have a refcnt problem with segsum.

25aug2009
 Lots more progress but.....

  orphan_release is finding that the orphan block has no credits.
  We can allocate credits and simply not do the update if they
  are not available:  having an extra entry in the orphan file isn't
  a problem.  However we need some mechanism to clean up other than
  waiting for a remount..
  I think we leave that until we redo orphan handling.

 and: adir sometimes loses one block so it and the contents don't get
   deleted.

 and: it seems we sometimes try to clean the segment being written
   to.  We must avoid that.

 (long ago I wrote::
  FIXME When pin fails, we need to remove PinPending from everything!!!
 and never followed up ... I wonder?
 )

25Aug2009
 Orphan handling.
  Every orphan block goes on a per-fs list and gets removed only
  if the B_Orphan bit is clear.
  There are two times when we want to expedite orphan handling.
  1/ on rmdir we need to know if the directory is really empty.
     This requires that we expedite the orphan handling of all
     blocks.  As soon as we find a non-orphan, we can give up.
     Then we need to make sure the index tree has collapsed.  WE
     can borrow that code from truncate.

  2/ When writing past Trunc_next.  We just pass the block to
     special orphan handling.

  This requires that orphan handling is re-entrant.
  For dir, that is protected by i_mutex, but rmdir needs to come
   in under the radar.
  For trunc, the iolock on the index blocks should be enough.
  I wonder if IOLock can be used on dir as well... allowing
  parallel orphan handling in the one dir even!!.

  We need to ensure exclusion of orphan handling, including:
      - only one orphan handler at a time
      - don't run orphan handler while still processing action
        that makes it an orphan.
  Maybe if we just use IOLock for that?  Does that work?  Maybe
  but it gets messy for directories (on first attempt anyway).
  For directories we can just use i_mutex.
  Maybe i_mutex for files as well?

27Aug2009
  Orphan handling is going well... but not perfect.
  I'm using IOLock to ensure exclusion for orphan handling.
  However:
    I'm not really implementing that on directories
    Inodes go bad because lafs_erase_dblock needs the lock too.
    The call from rmdir will always faile because we hold i_mutex.

  Bigger problem.  I'm IOLocking inodes across checkpoints to preserve
   Orphan status.  But that might stop the checkpoint proceeding.
   .. so use i_mutex, not IOLock - find.

  Now... it seems I've confused myself.  Orphans don't get handled
  immediately.  In particular, inodes should not be handled until
  they final delete_inode.  So setting the B_Orphan flag and putting
  on the list are two separate events.  The flag must come first,
  but the list may come much later.  So some of that mucking around
  with i_mutex is pointless.
  So:
    make_orphan makes sure it is in orphan file, sets bit, and removes
      from list (if present).
    add_orphan puts it on the list for handling.
 
    For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
        as does any unlink/rmdir/rename that fails.

    For directories: put it on list in commit/abort.


  And...
    I hit the BUG where find_leaf wants and address of 0.
      If an index block gets cleaned out it doesn't disappear
      immediately.. there is no leaf to find in that direction.
      We probably need to avoid non-Valid blocks or something...
  And...
    Orphans 0/299 to 0/329 and  0/280 are still on the list
     but are not orphans.
     Maybe I need to catch mutex_unlock to run the orphans??
  And...
    We underflow a segment through orphans are unmount.
      We are cleaning and truncating at the same time.
      The same block gets allocated to 0 and to 1225
      in quick succession.
      Problem is that we apply new address while in writeback
      so a new lafs_allocated_block

29Aug2009

  Review of inodes in orphan list:
    lafs_new_inode makes are orphan for a non-existant inode.
    If the inode cannot be created, orphan_release is called.
    If it can, a 'struct inode' is filled in with valid type
    and nlink==1 (!!) and attached.  The inode will only be
    detached when the refcnt hits 0, and the orphan list implies
    a refcount, so if we ever find something on the orphan list
    with a NULL my_inode, it must be very new and can be ignored.

    When we find an inode block with a my_inode there are a few options:
      if I_Trunc is set, we must progress truncation providing we can
            get the i_mutex
      else if I_Deleting we must delete the inode
      else if nlink is 0, we remove from the list
      else nlink > 0 and we must remove orphan status.
    This means that if nlink is elevated, we need to be holding the mutex...
    So don't elevate nlink any more...

    When nlink becomes non-zero the block need to be put back on the
    orphan list (it must already be an orphan).  Also when we set
    I_Deleting or I_Trunc it must go on the list.
   .. OK, I think I have all of that.


30Aug2009.
   I have some wierdness that seems to be caused by the orphan stuff,
   probably due to it all being async now.
   - A deleted inode clears I_Trunc and then sets it again.  The only
     explanation seem to be that delete_inode is being called again,
     so I must be igrabing it again, maybe from cleaning.
   - bits of directories aren't getting deleted.  Sometimes single
     blocks, though the referred files are deleted.  Sometimes
     the whole directory... More interestingly, those blocks then
     don't get cleaned, so something about them means that they
     don't get deleted and don't get cleaned either.

   Even weird... I just had a case where file 331 had a different
   index block for every 4 data blocks...


   FIXME:
    - What stops pinned blocks from being flushed by bdflush in middle
      of operation and so losing allocation?  Must make sure to set
      them dirty very late.
    - orphan_release can fail, so much make sure we can always call
      it, even if my_inode is NULL.... but how?


    - make_orphan could fail due to lack of space, which is not OK.
      I made it loop, but I'm not 100% sure that is right... it isn't.
      I need to pass down the 'I'm freeing space' flag, and I need to
      not require Credit of Dirty is set, etc.


    - I seem to have a deadlock and unmount.
       umount is waiting for lafs_checkpoint_lock_wait in
          lafs_put_super
       pdflush is in down_read in sync_supers
       lafs_cleaner is iget_locked/ifind_fast/inode_wait
                This is waiting for I_LOCK to be clear.
      

31Aug2009
  - When a file shrinks and becomes level-0, make sure
    old addresses get deallocated.  I seem to have
    a directory where they didn't.

  - Due to the fact that we over-preallocate, we really shouldn't
    return ENOSPC until we have flushed dirty data and performed
    a checkpoint??


  - When I removed the last index from an inode
    (Indirect type) it seems that I didn't write
    out the corrected block..??

1sep2009
 I ran my simple test run repeatedly overnight.
 It ran 208 times before I stopped it.
 There are 3 possible failure modes:
   1/ didn't completed within 500 seconds
   2/ triggered a BUG
   3/ appeared to complete, the number of blocks
      in use was not the correct '7'.

 74 (35%) did not fail!
 31 () did not complete
 40 () triggered a BUG
 2 did not complete but did not trigger a bug

 94 of those that failed did not have a BUG
 92 actually completed.  Of these:
      1 final blocks 1
      1 final blocks 110
      1 final blocks 23
      2 final blocks 12
      5 final blocks 0
      6 final blocks 10
     11 final blocks 8
     21 final blocks 11
     44 final blocks 9

 of the BUGs,
       1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
      1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
      2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
      6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
      7 BUG: unable to handle kernel paging request at 6b6b6bfb
     11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!


 super.c:655 is "block is still pinned" at unmount time.
  The block was always an InoIdx with a child.
  Either inode 0 or 16.
  child is held by various things:
      [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
      [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
      [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
      [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
      [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
      [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
      [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
      [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
      [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
      [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
      [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)

 The "unable to handle kernel paging request" is always in
 umount.
     invalidate_inode_buffers(26/46)/lock_acquire


 block.c:529
    This is iblock valid when erasing a block
    The block we are erasing is always 0/327 or 0/328.  It is
    an orphan we are handling, iolocked but not always pinned

 lafs.h:276
    Map an iblock which is not IOLocked
       always in lafs_clear_index for the InoIdx block for a directory
       which is in Writeback.
       Call is in lafs_allocated_block from cluster_flush.

 segments.c:351
    seg_inc reduces seg usage below 0
      - lots of blocks (inode 327) that were cleaned, where then erased twice.
      - 2 block (inode 328) were erased twice, both from prune
      - ditto

 segments.c: 1028
     The free list is empty.... odd as only first segment is currently
     in use.

 soft lockup:
     Still orphan: 0/328  Index(1) is in Writeback and Dirty
       again inode_handle_orphan2 is in Writeback

 inode.c:821
     inode_handle_orphan are end, child list is not empty.
       The children seem to be in Realloc - cleaner need to let go.

 cluster.c:1219
     my_inode is null while cluster_flush an inode and want to set
        WritePhase.


 block.c:485
     no ICredit for unincredit in dirty_dblock from dir_delete_commit
     from lafs_unlock.


 spinlock lockup in subsequent to real bug
 ditto for sleeping function.

 Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
 appear to have other strange values....

 A select '9' has two extra block for the directory '74'.
 But that directory is long gone.
 These dir blocks are currently fully populated with numbers.
 This seems to be the pattern with all non-7 blocks.


 02Sep2009
  Found a problem, possibly related to the dir blocks not being
  cleaned up.
  When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
  so that fact is never copied in to the datablock.
  On further exploration, the I_Dirty bit is set but never used, which
  isn't good.
  So: exactly when do we copy inode into datablock, and what do we do
  when dirty_inode is call (if anything).
  We could just set I_Dirty when dirty_inode is called, checking that
  the block is Pinned which it usually will be.
  Then we copy inode to data just before writing data block.
  However that defeats transactional properties.  We to copy in the
  same transaction, and that means either straight away, or when
  the data block's phase changes.
  So dirty_inode either copies to the block, or sets I_Dirty.
  When lafs_refile unpins an inode data block, it need to check
  I_Dirty and possibly re-dirty it.

  To redirty it we must steal the NCredits.  Any further dirty attempt
  will have to allocate more.
  The stealing is done automatically by dirty_dblock, so we just flip
  the phase and call dirty_inode ... making sure it doesn't try to
  prealloc too hard.

  Need to review when inodes get dirtied.
    - commit_write only sets I_Dirty !

    We call lafs_dirty_inode:
      dir_create_commit - a child of inode is PinPending
      lafs_create - ditto
      lafs_link - before dir_create_commit
      lafs_unlink, lafs_rmdir - data block is pinned
      lafs_symlink - before create_commit
      lafs_mkdir - before create_commit, or block pinned
      lafs_mknod - before create_commit
      lafs_rename - (moved to) before create_commit/update_commit
                     or data block is pinned
      lafs_dir_handle_orphan - (assured that) child is pinned.
      choose_free_inum - child is pinned
      lafs_incorporate - block is pinned

    So either the data block is pinned, or the index block is pinned.
    In either case it is OK to set something to Dirty.

    (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
    this is called from:
        inode_inc_link_count
	inode_dec_link_count
	..various quota ops...
	inode_setattr
	__set_page_dirty (Which we don't use)
	other buffer stuff
	other quota stuff we won't use
	touch_atime
	file_update_time
	page_symlink

    only the time updates are interesting.  Others we have locking
    for.
    file_update_time is called from generic_file_aio_write_nlock etc
    before ->prepare_write/->commit_write.  So they can pick up the
    change.
    Similarly before set_page_dirty is called.
    touch_atime is called from do_follow_link and readlink and
    file_accessed which is called all over the place.

    So what to do?
    If block is pinned, then dirty it to ensure writeout.
    If not, don't.  But copy data in any case.


4sep2009

    OK, I've decided that I don't like clearing B_Valid when an index
    block contains no indexes.  The final straw was that I seemed
    to need to initialise the index block when I didn't hold IOLock.
    That was probably fixable, but I'm sure more problems were coming.

    So: what to do instead?
    One issue that must be resolved is that an index block can still
    have valid children even when it become empty.
    This can happen if we erase blocks from a file, then add them back
    after a checkpoint, and so in the next phase.
    The checkpoint writeout could need to show an empty index block,
    but the next phase will see real addresses.
    We cannot easily avoid this, so we must handle it.
    This interact badly with the index lookup algorithm that finds
    the best index block currently in the parent, and then scans
    the children.  If there is no index block in the parent, we
    cannot find any children.
    This could be handled by responding to an empty index block by
    scanning all children.  But that isn't a full solution as if
    just one index block got erased, it's unincorporated siblings
    would still be lost.
    We could treat empty index blocks like orphans.  i.e. don't
    discard them immediately but leave them with possibly real
    addresses.  Then when they have no children we allocate the
    0.
    But we still need to ensure that index blocks off which siblings
    have been split but not yet incorporated remain present in the
    tree to mark the place for their siblings.
    There is another problem.  A horizontal split could leave the
    new block with no addresses and everything in the uninc list.
    Nothing can be found in there.

    So maybe we need to revise the lookup mechanism.
    The goal is to find an index block that starts at or before
    the target and contains an address at or after the target.
    Then out search can stop.
    In rare cases.....

7sep2009
    I thought about this more over the weekend and think I have an answer.
    We need to treat internal and leaf index blocks somewhat differently.

    An internal index block must never be empty (while unlocked).
    Any child block which has not had it's address incorporated must be 
    attached (simply in the sibling list) to a block which has been
    incorporated.  This will be the block that it was split off.
    The uninc block needs to hold a reference so that the primary isn't
    released.
    When a 'primary' becomes empty it cannot be discarded, so the
    addresses in the first dependent index block must be copied
    across.  This is awkward for indirect blocks so they might be
    allowed to be empty (they aren't internal so don't violate the
    above).
    When a horizontal split break a sequence of dependent blocks
    between two parents, the second parent must be incorporated
    immediately so that the first block in the second half of the
    sequence is incorporated.
    If an internal index block does become empty and it has no
    dependent blocks to fill from, it must be invalidated immediately.
    It cannot have any children - even in next phase - as at least one
    would have to be incorporated and so the block would not be empty.
    Invaliding involves allocating to address 0.
    If index lookup finds a block with PhysValid address of 0, it
    must look to the previous index block.  If there was none .... it
    gets a bit complex.

    Leaf index blocks can become empty, but we try to avoid it.
    If a leaf has blocks which have been created in the next phase,
    and others which have been deleted in this phase, it can be empty
    but still have children.  In this case we just treat it as a real
    index block that doesn't actually have any addresses.  We still
    write it out even though that is a waste of space.

    We have been working on the assumption that every address always
    has a corresponding leaf index block.  It is the leaf with the
    highest index at or below the target address.
    However this requires the every internal index block has a child
    with the same address as the parent.
    Preserving this requirement when the first child of an internal
    become empty requires either:
       - loading the 'next' child and reassigning this to the start
       - changing the address of the parent to match the first child.
    The former requires possibly reading a block from storage.
    The latter only involves modifying blocks that are due to be
    written out anyway, but makes block look up slightly interesting.
    When lookup finds an invalid block that is 'first', it needs to
    start again from the top.
    When incorporation creates an invalid block that is first, it
    needs to walk down from the top and any index block at the same
    address needs to be relocated/rehashed.  If the block is
    incorporated, the incorporated address needs to be updated.
    So:
     - flag for unincorporated index blocks which implies a reference
       on primary
     - after split, immediately incorporate second block
     - change lookup to retry when finding invalid block
     - When internal block becomes empty, either merge with
       first dependent or invalidate.  If first in parent,
       update address and parent and recurse.
       Need some 'clever' locking here.
       Before unlocking the invalidated block, we take i_alloc_sem,
       then walk up the ->parent tree locking blocks as
       required.
       The index lookup, when it finds an invalid block will take
       i_alloc_sem, then drop it, then start again.
       Or maybe some other lock than i_alloc_sem...
     - When leaf becomes empty, invalidate only if it has no children.
       When internal leaf becomes unpinned, check if empty.

21sep2009
   That locking doesn't look like it will work, and we can never 'merge
   with first dependant' as it is not valid to have a index block
   where the first child is at a different address.
   And we cannot always change the parent address, particularly if it
   is zero - increasing it then cannot work.
   And there is no need to load a block if we are just going to change
   its start address (not internal index blocks anyway).
   Let's drop the idea of relocating the parent.
   If an internal index block becomes empty:
     If it is last in parent, no loss, just discard
       If parent would be empty, need to recurse up.
     If it is not last relocate the next sibling to this location,
      rehashing it and updating the parent.
   If a leaf index block becomes empty we cannot just delegate to
      next as it might be indirect... not a problem if address is
      stored.  But that requires a format change... now might be a
      good time!
   
   
   So:
     If we hold an index block locked and it becomes empty and we choose
     to invalidate it, we need to ensure that doing so does not
     break any indexing paths.
     So we take a separate lock (i_alloc_sem??) and flag the block as invalid
     by setting physaddr to 0 while PhysValid is set, and unlock the block. 
     Any lookup that finds such a block must take and release i_alloc_sem,
     and then restart from the top.
     - If the block was not incorporated, we just remove from sibling list
          and all is done - the space in implicitly included in
          previous block.
     - If the block has a different fileaddr than the parent then update
          the parent directly, either removing the entry, or changing it to
          point to the first unincorporated sibling (if there is one).
          This requires taking the lock on the parent of course.  That is 
	  why we dropped the lock on the child.
          Then all done.
     - If the block has the same address as the parent we need to find
          a 'next block' to relocate to the start of the parent.
          It is either the first unincorporated sibling, or the next
          block in the index block, or nothing, meaning the parent is
          about to become empty. 
        We lock the parent (still holding i_alloc_sem), and rehash the
	  chosen child.  If it doesn't exist, or is not dirty, we need
	  to update the phys address directly in the 
          accordingly, erasing or replacing the first address.
          Then we need to rehash the index block, but we need to lock
          the parent for that.
          So set a 'busy' flag on the block, unlock it, lock parent,
          rehash, clear busy flag, and repeat.
      - We can never relocate a block with fileaddr of zero, as the
          InoIdx block cannot be relocated.  So leaf index block 0
	  must never be erased unless the file is empty.  So 

28sep2009
  New idea.
  We store the start address of an indirect block in the block.
  These means that the meaning of any index block is completely
  independent of the location of the block, so we can change the location
  easily and without touching the block.
  So if a block becomes empty, we simply move the next block back to
  fill the gap.
  i.e. when an index block becomes truely empty (i.e. no children)
   - if it wasn't incorporated, simply remove it
   - if it was,
       - if there is a dependent block, rehash it to take my address
       - if there is a next block that is dirty, rehash it
       - if there is a next block that is not dirty,
          update parent to merge my entry with next, and rehash next
          if it exists
       - if there is no next block but we are not first, just update
          parent
       - if no next block and we are first, parent becomes empty,
          recurse upwards.

12Oct2009
 - too long, I've forgotten what I was up to..
   + I've changed the format of indirect blocks to store an address.
   + I've handled incorporation of an empty block
   So now internal index blocks can never be empty - they get immediately
   unlinked if they are.
   Leaf index blocks can be empty while they have children.  We don't
   flag them as empty, but rather wait until another child gets incorporated.
   But I don't think I really like that.  It is an external ugliness based
   entirely on internal implementation details.  Empty index blocks should
   not get written out.  We need some way to reliably find an empty index
   block.  The address won't appear in the parent so a lookup will find the
   previous block which we cannot link to now as it may not exist yet.
   Worse - if first index block goes empty, we can only unlink it by moving
   the parent to start at the next block.  That would make this index block
   totally unfindable.
   So I think we have to stick with writing out empty index blocks very
   rarely.  So we need to be sure they disappear properly.
   The difficult case is if an index block becomes empty while it has some
   children which don't end up getting dirtied. e.g. an update aborts.
   We need to leave the block with enough credits to be written out.
   I guess the Ncredit should be enough...
   Maybe worry about that later.

 - what about InoIdx blocks when they become empty?  It would be helpful
   to flag them so that inode deletion can check....
   Maybe just set depth to 0..

 ARRGGG... I've completely lost it.  In need another ITO week.
  I just got a bug in summary.c:71!!

7 Jun 2010
 - summary.c:71.
   ablocks_used has hit zero too soon.
   This should be the count of blocks for which space has been allocated
   (B_Prealloc is set) but have not been given a phys address yet - at which
   point the usage count is moved to cblocks_used or pblocks_used.
   The last block (which may not be the cause of the problem) does not have
   B_Prealloc set, yet physaddr == 0.
   The block is 0/1, so the inode for the inode usage map.  This should have
   physaddr 8 !!
   We did find 8, then change to 73, but then changed to 0!
  Ahhh... recent fix exposed a subtle bug ... fixed.

 Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
     cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
     cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
     cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
     cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
   We are allocating an InoIdx block, but data block is not valid??

 That isn't very reproducible so I'll have to leave it for now...
    erasedblock had been called on the data block .. inode 17??

  Problem is that I keep changing the rules.
   I don't erase the InoIdx block any more.
   I used to, then change it to iolock_block/cluster_allocate->0

 Problem: When all files are removed, usage is still quite high, two
   segments have over 400 blocks (out of 512).  Cleaning keeps running and
   not making much progress.
  segment 6 has usage of 484.
  'cluster 3072' shows: cluster 3072, 3085, 3086 3092
    Inode 0:  blocks 267 272 276 
    Inode 277: blocks 0/4 6/2
    Inode 0: blocks 0/2 8 16
    Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
    Inode 16: 1/1
    Inode 17: 0/28
    Inode 283: 12/18
          etc.

  All 'old', so must be the product of cleaning, as you would expect.
  All (most) of this has been deleted though, but count didn't drop.
   'Count' add to 508, plus the 4 cluster heads makes 512 - good.
  lafs_seg_move definitely isn't being called on these blocks.
  it is only called from lafs_summary_update
  cblocks_used "exactly" matches the number of un-removed blocks.


  Another problem
bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
/home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)

 and
free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
Want dump of usage

------------[ cut here ]------------
kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
 free list is empty - that should not be.

and another...
/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
 [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
 [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
 [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
 [<c02256bf>] ? autoremove_wake_function+0x0/0x33
 [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]


08Jun2010
 Weirdness with truncating.
 The cleaner relocates a file resulting in the InoIdx block being
 Maybe-dirty and phys_addr == 0.
 Then truncate doesn't prune but just incorporates, finding
  something weird there..
  file 278, blocks around 4100
  seem to find 1949 instead??

 Note: When a non-InoIdx block is erased we set PhysValid
  and physaddr == 0 to record the fact because it will not be stored...

modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
Async ??
modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
Still Async ... wonder what it means.

- directory block got corrupted.  Maybe conversion to indexed??


Getting bug in remove_from_index because the addr isn't
there, possibly block is empty.  But incorporation is
??? instant?  No it isn't.
If an index block hasn't be incorporated it has B_PrimaryRef
set as it hold a ref to something earlier index.
But what if nothing is incorporated?


Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)

Then spin in a soft-lockup in lafs_inode_handle_orphan


-----------
 - grow_index_tree needs to do initial incorporation so things can be found.
    just like end of do_incorporate_internal.
   NO - cannot incorp yet as do not have phys addr.  Don't need to as
   lafs_leaf_find explicitly handles this.
   For truncate case we don't use the stored address, but ensure all
   leaf indexes must be dirty (or gone) so whole tree must be
   accessible for walking around.
 - do_incorporate_internal needs to set B_PrimaryRef and take the ref
 - when we remove a B_PrimaryRef without incorporating it, we need to
   drop a ref if the *next* in the list is B_PrimaryRef
 - need to use a constant to identify 'async' calls etc.
 - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
   it isn't found as async....

09Jun2010
 STILL struggling with incorporation.
 We have a premise that any file address is coverred by precisely
 one leaf index block.  Every leaf index has an implicit address
 and it covers all addresses from there to the next leaf.  The last
 leaf covers to EOF.
 So there must always be a leaf at address 0.
 This applies within the tree from an internal index block too.
 Beneath an internal index block there must be a leaf covering every
 address up to the next internal index block.  So there must be
 a first.  So storing the first address is pointless.  And harmful.
 When an index block becomes empty and disappears its coverage is
 included in the previous block unless there is none, in which case
 the next index block must be re-addressed.  If there is no 'next',
 this index block must be empty and so must disappear.

 BUT if we re-address an index block, we implicitly re-address the
 first child - recursively - so we need to move/rehash them all
 or lose them... or record where they are.  Or do lookup not by
 addr....
 I think just rehashing them all - with an iolock - is simple
 and safe.  So just do that.


 So:  I cleaned up index handling a truncation somewhat.
  Now running looptest to see what patterns emerge:

  block.c:197 (*9+1) During umount, the Root datablock is
        Dirty+Realloc
        Maybe just need for cleaner to become inactive
        during umount - hope that doesn't deadlock
        didn't event work...
  block.c:529 (*4+1)  erase dblock while iblock depth > 0
        When pruning InoIdx we want to set depth to 0.
        FIXME is this really want I want, or is depth=0
        only for data-inode ... FIXME
  cluster.c:533 (*2) cluster_allocate on invalid block
          Block is 8/0 in writepage from sync_inodes
          This is the orphan file.
                   blocks aren't dirty
          I guess the file gets truncated while we wait for it.
          Just need to re-test.
  index.c:1936 (*2).   An index block is Root - FIXED??
  modify.c:1056 - secondary bug, ignore for now.
  modify.c:1650 update_index fails to find target.
              second call, phys==0
              Code was bad ... may not be the cause though.
  modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
                   from orphan handler.
                Maybe just change the do/while back to 'do'.
  modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
               Index(0)/InoIdx
               in do_checkpoint
               uninc list gets set in lafs_add_block_address (parent of iblk),
		do_incorporate_internal,
               Maybe the InoIdx still had children.
  segments.c:1028.  (*4) The free list becomes empty.
  super.c:655 (*3)   Busy inodes after umount, and root InoIdx block
         is still pinned as inode 16 data block was still dirty.
         segusage slow.  Maybe same as block.c:197 ??
  invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
          finds invalid lock.
          presumably the inodes was freed before invalidated.
  spin on writeback during truncate (r3a) 8 times. now 10
        Probably because writeback cannot proceed while
        orphan processing keeps looping.
  kmalloc-1024 problems - (*2)
          A block - should be start of page - isn't not what it appears...

 Others complete with 'cb' ranging from 202 to 715


10 June 2010

 Looking at segment.c:1028
  We run a seg_scan every checkpoint, so that should keep free segments
  in the list.....
  Ahh.. do_checkpoint is looping because root isn't changing phase.

  Lowest block pinned to old phase is 
  [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
  which is not on leaf list because it has IOLock
  With more debugging:
  [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
  or better (that was in lafs_iolock_written)
  [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
  FIXED - I didn't unlock if it wasn't dirty any more.
  Well almost - it occurs much less now.
  Out of 48 runs:
      8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
      2 BUG: unable to handle kernel paging request at 6b6b6bfbt
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!

  So we now have 1/12 rather than 2/3.
  a/ pinned by IOLock from file.c:220 - FIXED
  b/ as above
  c/  Root is pinned by 4 children
      328/0  with 196 of data blocks in writeback/realloc, in a cluster
      0/1, 74/0, 0/8   all in a cluster waiting writeout.
     Don't understand this.
  d/ as a,b

  Of the 48, 11 ran to completion leaving blocks from 286 to 899
  

  Looking at the loss of blocks when truncating.
   tracing show small number of files with remaining blocks at delete.
     sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
   next attempt: 14+24+26*11 =324 cf cb=1124
   next attempt 26+6+15+68+29 == 144 cf cb=383
   26+18+14+19+284 = 361 cf 379
    files are (in order)
   49    bfile       - 30K
   325   nbfile-49   - 30K
   320   nbfile-44   - 30K
   296   nbfile-20   - 30K
      ??331??

11 June 2010

 Thinking about truncate and index blocks becoming empty while
 they still have children.
 For leaf indexes, we need to leave the block in place in case
 the children get written.  We need to find a time to ultimately
 delete it...
 For internal indexes,.... uhm, it just works, OK??

 When I drop an uninc block, I need to remove it from the
  uninc list, and from phase_leafs
  clearing dirty and refiling should remove from leafs.

 When we recurse to a parent, we need to remove
 *this* block from the uninc list for said parent.
 It should be the only thing in the list.
 But even when we don't recurse, the fact that we have
 incorporated means that we should tidy up the ->uninc
 list.


12 June 2010
  unmount hung after lafs_run_orphans from lafs_put_super
  There are two orphans in Writeback which cannot progress
  until the current cluster is written...
  But they keep getting re-written!
  Other time, one orphan, index block is Dirty on a leaf ???
  
orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
[cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1) 
LAFS_cluster_flush 1


orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
[cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0) 

 OK, problem is that when we truncate and remove an index block, the
 next index block expands backwards to fill the space.
 Then we apply prune_some, but don't check if anything was done.
 We always mark it dirty, so it has to be written and then
 we loop through again...
 So need to check if prune_some did anything.

TODO:
 - prune_some need to get more done at a time
 - let cleaner finish up before umount
 - use early segments first ??
 - look at write-clusters and check OK
 - check that df:cb= drops properly.

Bugs:
      1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170  - SECONDARY BUG
      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
      3 BUG: unable to handle kernel paging request at 00100104
      5 BUG: unable to handle kernel paging request at 6b6b6bfb 
      1 BUG: unable to handle kernel paging request at 7fffffff
      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
      9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828! 
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708! 
      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
     30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!

Quite a haul there!

super.c:655
    Pinned block in lafs_release:
         0/2 is Dirty with plenty of credits, so it is a child
         0/16 is Dirty/Realloc, or once Async
     Dirty, but not on a leaf list, not pinned

segments.c:332
    seg_deref with refcnt , 2 in lafs_seg_put_all

segments.c:1028
     No free segments - no real pattern.

modify.c:1708
     lafs_incorporate on non-dirty/realloc block
       328/0 Index(1).  1 in uninc_table - probably during truncate.
     Either we add uninc while not dirty
     Or we clear Dirty while uninc present
     or there is a race between the two.

     Don't know:  add a bugon
     Bugon in get_flushable didn't fire.

inode.c:843
     children present in truncate after final incorp...
       328/0.  64 children, no uninc list.  Maybe we ran the orphans too early??
      or invalidate_page isn't removing the children.
      Might want print_tree here?- added that.
     Answer: all the children are in Realloc on Clean_leafs
       Maybe erase_page needs to disconnect from cleaner too??

inode.c:828
     Orphan handling - uninc but not dirty: is Realloc (sometimes)
     Maybe like  mod:1708

block.c:67 *
      delref 'primary' from modify.c:2063 in the q2 branch.
      nxt has PrimaryRef... Maybe  move earlier, but that shouldn't make a diff.
      ditto at modify.c:2035  nxt is primary as was I, so drop mine.
      Don't know - looks like sibling list got broken.
      Tidied up a bit and added a print-tree.
      v.interesting result.  Lots of consecutive index blocks all holding primary-ref
            on single primary - which is wrong.
      1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
            on self, as are being inserted into chain
      2/ When splitting, new block must be addressed as first block which cannot
           fix, not first block which doesn't fit.  Else incorping in reverse order
           can make lots of tiny index blocks.

block.c:529 *
        erase with index depth > 1.
        0/328 in orphan handling.  Still have 8 or 15 blocks registered!
       Maybe caused by index block errors.  Added some printks.

block.c:479 *
        not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
        74/xxxx in unlink
        16/1 in seg_inc/seg_move...allocated_block/cluster_flush

        - writepage wrote the page??
        - checkpoint wrote it and didn't replenish the credits?

block.c:197 XX
        invalidated pages finds dirty block after EOF, after iolock_written
         0/0 Dirty/Realloc in unmount - all Realloc!
       Need to wait for cleaner etc to finish at unmount time.

NULL deref in 1b4  YY
    cleaner->cluster_flush->count_credits->lock??
    Trying to get a lock on an inode that has since been free??
	spin_lock(&dblk(b)->my_inode->i_data.private_lock);


001001 YY
     generic_drop_inode -- extra iput??  in lafs_inode_checkpin from refile
6b6b6b YY
      invalidate_inode_buffers!! in kill.  use-after-free

7fffff
    seginsert from scan_seg
     MAX/number-elements confusion.  Worked around for now.


18  June 2010
After a couple of fixes:
      1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
      1 BUG: unable to handle kernel paging request at 00100104
      5 BUG: unable to handle kernel paging request at 6b6b6bfb
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
     16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
                Realloc blocks confusing truncate
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
     19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!


TODO:
 - truncate gets confused by blocks being cleaned.
   Need to flush cleaner, or just removed the blocks.
 - when add PrimaryRef in middle of list, take the right ref.
 - fix up wait-for-cleaner at unmount time.

19 Jun 2010

      3 BUG: unable to handle kernel paging request at 6b6b6bfb.
      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
      5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
     22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
      9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
     17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
    251 SysRq : Resetting
      3 SysRq : Show State

 - We can erase a dblock while it is in the uninc_pending or 
   uninc_next - need to be careful
 - At umount, 0/2 is Dirty but not Pinned, so not written out
   ditto from 0/16
   16/0 sometimes is Async
      16/0 Async might be from the segment scan - so wait for that.
   Dirty but not pinned can happen when InoIdx is pinned.

 - I think the uninc_next list (At least) should be sorted before
    being allocated.

 - root block dirty/realloc/leaf in final iput
   Could be it was changed during last checkpoint so
   pushed in to next phase?  But why Realloc?
   Maybe still issue with losing inode data block.

20 June 2010 Happy Birtyhday Dad!!

420 runs.
      4 BUG: unable to handle kernel paging request at 6b6b6bfb.
     26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
     87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
     12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
      2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!

 Problems:
  - inode in i_sb_list has been freed.
  - block 0/0 is dirty/realloc/leaf after final iput
  - not all blocks freed by truncate
  - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
  - still children when truncate should have finished.
    all are Realloc
        Maybe inode has become unhashed and we re-load it??
        it is invalid after all!!
  - Index block not dirty when incorp - has uninc. ??
  - didn't wait for free segments
  - Data 16/0 is dirty but not pinned after final checkpoint - FIXED


watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'


 Unclear on dirtying index blocks.
   We normally mark it dirty first, then add the address to the uninc list.
   Note that this is the reverse of data blocks which are changed first, then
   dirtied.  So maybe we should mark dirty afterwards.  We then need to
   avoid incorporation while we are adding addresses else we might find it
   has addresses but is not dirty.  Only try if dirty?
   Maybe we should iolock the parent.  We need to do that anyway to flush
   incorporations when the table is full.   Yes, that fits the VM model
   better.  Always lock while updating and preparing to write.  Set
   writeback once write has started, then unlock.  Cool.
   Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..

21June2010
  Apart from tracking down the remaining bugs, I need to:
  1/ Decide on locking for incorporation and attaching new address to a block
    and implement it.
    In particular we need to not lose the Dirty flag before the update is done.
  2/ Resolve handling of pinned inode data/index blocks
  3/ Correct handling of empty index blocks, particularly when parent is in
    different phase.  Make lookup be more careful?
  4/ Wait for there to be enough free segments before allowing allocation.

  2:  Problem is that we cannot handle a pinned inode-data block while the
     InoIdx block is pinned in the same phase.
     We currently unpin it so it drops off the leaf list.  But then we
     need to re-pin it when the InoIdx is unpinned or phasefliped, and that
     gets ugly.  Possible though.
     An alternate is to treat it like a parent and keep it off the list
     while the InoIdx is pinned/same-phase.  So we would need to
     re-assess it after unpinning or flipping the InoIdx.  That is probably
     a lot easier than re-pinning it.

  1: We would normally set 'dirty' after changing the block.  But we need
     to differentiate Dirty from Realloc, so we set before adding addresses.
     This requires that are careful not to write an index block while there
     are pending changes.  The fact that pinned children stop any writing,
     as do pending addresses in a list should ensure this.

  3: When an index block becomes empty we need to make sure that
     future lookup doesn't get confused by it.  Specifically future
     index lookup must avoid the block so nothing new gets added.
     Possibly a previous block will split again, but this block must remain
     unused.
     However we cannot update the parent block immedatiately as it might
     be in a different phase.
     So we must record both "don't touch this" and "where to look instead"
     elsewhere - in children.
     If the block being deleted is *not* the first child in the parent,
     then we direct index lookup to the earlier block.
     If the block being deleted *is* the first child in the parent,
     then redirect to the second child if there is one and we weren't just there.
     If there is no other block we flag the parent as empty and retry
     from the top.
     We flag a parent as empty with B_EmptyIndex.

     What locks do we need to walk around the sibling list?
     the inode private_lock is minimal, but we cannot hold that to take a
     iolock - just to get a reference.
     I guess we
        - iolock the parent
        - try to find a good block using private_lock
        - get a ref and wait for it.
        - check if it is still a good block.  If not, start again

     If we find an EmptyIndex block, it must be directly addressed by parent.
     It will never be followed by a PrimaryRef block because if there were
     such a block, we would have readdressed it back and hidden the EmptyIndex.
     So we need to look around for an address in the parent that leads to
     a non-EmptyIndex block.

     If all children are empty, we need to make the parent empty.  But
     what if it is InoIdx?
     Maybe I am making this too hard.  I could just use i_alloc_sem to
     block lookups while truncate is happening.  That doesn't address
     single block removal e.g. from directories.
     So I need to be able to wait for incorporation to happen on an
     empty index block.  We hold iolock on the parent.  If there blocks
     on ->uninc, we just process them immediately.  If there are blocks on
     ->uninc_next, we wait for the checkpoint to complete

     What does lafs_incorporate actually do with EmptyIndex blocks?
     Providing that match currently incorp addresses, they just cause
     those addresses to disappear.

     If a block is in the uninc list for its parent, then is phase_flipped
     and changed and written out it could get a new physaddr before
     it is incorporated.
     I guess we never allocate a B_Uninc block which is in a different phase
     to the parent.  Currently we wouldn't do that anyway except in truncate
     though memory pressure on index blocks might one day??
     Truncate?  We cannot allocate directly in lafs_incorporate.
     We should get lafs_cluster_allocate to notice and DTRT.

     Only hash index blocks when they are incorporated.  Not needed before then.
     When processing an uninc list, if an address appears twice, prefer the one
     that isn't EmptyIndex...

22June2010
    I need a clear picture of the "Steady state" for an internal index block
    with it's children.
    The internal index block contains 1 or more addresses.  For each address there
    maybe a child index block.  If there is it maybe the head of a list of
    blocks with B_PrimaryRef set thus holding the whole list in place until
    incorporation happens.
    Each of these children can be on either ->uninc_list or ->uninc_next,
    or possibly neither if they haven't been queued for writing yet.  Any
    PrimaryRef block will be Pinned.

    When a child is incorporated and found to be Empty it is flagged as such
    and then must never be returned by index lookup.  Index lookup will either
    add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
    block and so have to start again from the top.

    When a PrimaryRef block becomes empty it is simply removed from the
    PrimaryRef chain so it cannot be found.  The space now belongs to the
    previous block.
    When a non-PrimaryRef block which isn't the first becomes empty it is
    flagged and left in place so that following blocks can be found.  The
    address space now belongs to the previous block.
    When the first child (fileaddr matches parent) becomes empty - what?
      We could re-address first child but that forces early address change - 
          old might not be incorp yet
      We could re-address the parent, but that doesn't work for InoIdx
      We could leave it there with physaddr == 0

    Last sounds promising.  So we never re-address an index block.

   So: From the top.

    Index blocks, Indirect blocks, extent blocks each have an address
    that never changes.
    When a block becomes over-full it splits - a new block appears with
    a new address thus implicitly limiting the address space covered
    by the original.

    When an index block becomes empty and has no pinned children it is
    marked as EmptyIndex (under IOLock).
    When an EmptyIndex is allocated it goes to phys==0
    An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
    is never used again.  Its address space is ceded to the previous
    index block - which could split several times...
    An EmptyIndex which is first can be re-used.  Once it gets pinned
    children the EmptyIndex is cleared.

    An Index block always has an entry for the first address.  It might
    be implicit to phys==0.  Loading such a block creates an empty
    block.

    InoIdx doesn't get EmptyIndex, rather it gets ->depth=1

    Indirect *doesn't* store the first address any more.

    Changes:
DONE     - remove forcestart from layoutinfo
DONE     - remove start-address from Indirect blocks
DONE     - only hash index blocks when they are known to be incorporated.
DONE     - when incorporating an uninc list, ignore phys==0 if also a block with
       same fileaddr and phys!=0.  so sort phys==0 first
DONE     - Create EmptyIndex flag
DONE     - Clear the flag when adding child pin to index block
DONE     - avoid EmptyIndex non-start blocks during index lookup
DONE     - allow index blocks to be loaded with ->phys==0
DONE     - allow EmptyIndex index block to be "written" to phys 0
DONE     - ensure index lookup finds implicit start address, possibly 0

So now after 36 runs
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
     10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
     14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
      4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
      3 SysRq : Resetting


index.c:1939
   block 0/2 is Realloc and being allocated from cluster_flush while
   parent is not Realloc or dirty
   That is bad as Realloc gets set in lafs_allocated_block ... except
    that the code was bad.  FIXED.

index.c:403
  cleaner is pinning a block (299/25) which is not Realloc,
    and phase isn't locked.  We are only meant to pin data blocks
    for updates while holding a phase lock.
    Ahhh - bad code again. FIXED

inode.c:605
   Truncate doesn't clean up properly. 
    327 has 60+1
    331 has 108+1
    327 has 34+1
    327 has 60+1
   No sign of any children.

   Very weird.  Signed in incorporation going wrong.
     Added more debugging.

Found 4084 4 12 at 890
Added 4084 4 12
Found 4089 4 16 at 878
Added 4089 4 16
Found 4094 2 20 at 866
Added 4094 2 20
Found 2561 2 22 at 854
Added 514 2 22
Found 2564 4 24 at 842
Found 2569 2 28 at 830
Found 0 0 0 at 818

Why are 2564 etc lost?  No sign of alloc-to-0

segments.c:1034
   no free segments - need to wait somewhere.

segments.c:624
   allocated_blocks has gone over free_blocks!
   in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
   Wanted CleanSpace to reserve the youthblk
   Maybe related to not waiting - ignore for now.

super.c:657
  block 0/2 was dirty but not pinned.  Should not happen to inodes.
  block 0/0 was Pinned because it had a child - as above.

  Maybe we don't carry the pin across when we collapse dir
  into inode??... looks quite likely


23 June 2010

116 runs.
      1 BUG: unable to handle kernel paging request at 6b6b6bfb
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
      3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
      7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
     61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
      1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
     42 SysRq : Resetting


6b6b6bfb:
  invalidate_inode_buffers called on at shutdown.
  Still wierd

block.c:497  FIXED??
  block 16/1 is not dirty with no credits.
  Maybe writepage got to it?

dir.c:710
  ouch! dir lookup failed in unlink.
   No real hints.  Must be hash based - some off-by-one probably.
   Need to stare at the code.

inode.c:606  FIXED
  Blocks still present after truncate.
  typically about 60, but in 1 case '4'.  No index blocks.
  So probably content of second index block.
  Yes, lafs_leaf_next was doing the wrong thing for addresses
   before start of block.

segments.c:1034
  same old

super.c:657  FIXED
  dir inode 0/2 is still Dirty but not pinned.
  Maybe lafs_dirty_inode should be pinning the block
 
  But now this triggers for 16/X still dirty.


How and when to write blocks in a SegmentMap file?
 - We don't want normal write-back to write them unless they have
   no references
 - We need to write them in tail of checkpoint, and index info must
   follow in the next checkpoint.

lafs_space_alloc is called from
  - mark_cleaning:  always CleanSpace, failure is OK
  - lafs_cluster_update_pin: ReleaseSpace.  -EAGAIN is OK (CHECK THIS) but failure
              is not - or shouldn't be.
  - lafs_allocated_block: CleanSpace, checking if parent of Realloc block 
        can be saved separately from any Dirty version.  Failure OK, blocking not.
  - lafs_prealloc - general space allocation.
  - 
lafs_cluster_update_pin is call from:
  - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
    lafs_mknod, lafs_rename,
  - lafs_write_inode
     So best to return -EAGAIN, and it should be handled adequately.

lafs_prealloc is called from:
  - lafs_reserve_block, after modifying the alloc_type extensively.
  - lafs_phase_flip to re-fill the 'next' credits.  If they aren't available
      we simply pin all children so they aren't needed.
      So failure is OK
  - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
       If this fails .. what?? lafs_reserve_block fails. so...

lafs_reserve_block is called from
  - mark_cleaning - CleanSpace
  - lafs_pin_dblock - type is passed int...
  - lafs_prepare_write - on failure write will fail or retry after checkpoint
  - lafs_inode_handle_orphan - to help with delete. On failure we allow
         cleaning to happen
  - lafs_seg_move - should be elsewhere.  Failure BAD !
  - lafs_free_get - as above, failure BAD
  - clean_free - update youth for new clean blocks - Failure BAD

lafs_pin_dblock is called from
  - dir_create_pin - fail or again handled
  - dir_delete_pin
  - dir_update_pin
  - lafs_create etc
  - lafs_dir_handle_orphan
  - choose_free_inum
  - inode_map_new_pin
  - lafs_new_inode
    ...
  - lafs_orphan_release !! cannot handle failure
  - roll_block should use AccountSpace

So:  It seems we need a new allocation class that will never fail.
  Maybe it is allowed to BUG though?
   AccountSpace - i.e. space need to account for the use of space.
     Must never ever fail.

Then we must ask where blocking should happen on -EAGAIN.
  dir.c does "lafs_checkpoint_unlock_wait", then tries again.
  prepare_write does too.

For that to work we must start a checkpoint on returned EAGAIN.... Don't
we want to wait for some cleaning to happen first though?  Maybe an extra
flag, and a count of the number of empty (but not clean) blocks.

- Should I skip orphan handling when tight on space?  Probably not.  It will
  just keep failing while we keep cleaning...
- roll_block should use account_space .. or not

- lafs_space_alloc simply allocates space, or fails.  'why' is used to
   guide watermark choice.
- lafs_prealloc allocates space to a block and all its parents base on
  'why' for watermarks.  It either succeeds or failed.

- lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
  to failure as -ENOSPC or -EAGAIN based on 'why'.

- lafs_pin_dblock simply passes on the failure, which must be handled.

So: What to do when we return -EAGAIN?
 We need to wait until there are *enough* clean segments, then cause a checkpoint
 so they become free.
 So a flag that says 'waiting for free space' and a count of segments
 required.

 But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
 Maybe we don't ??  Or do it later.

Still to do:
- Audit all AccountSpace and justify them
 + lafs_seg_move is probably wrong.  Should have allocated when the
   free segment was allocated
- lafs_orphan_release called lafs_pin_dblock but cannot handle failure
- Need to wait not just for "enough space" but for "enough clean segments".

- how is 'free_blocks' set - what does this tell us??

   free_blocks is the sum of known-clean segments.
   We probably want:
         clean segments
         remainder for each active segment
   then reserve some segments for cleaning.
   And separate 'allocated_block' for each ?

Notes:
 segments.c:647 fired: AccountSpace had no space available.
   Reserving space to write the segusage of youth block for a newly
   allocated segment.
 super.c:657 STILL
    0/2 is Dirty but not Pinned  Maybe we need PinPending
 soft lockup
    in the cleaner!
    Maybe I need cond_resched??

Maybe I want two separate 'free_blocks' counters.
 One that includes all free blocks for use in 'df' etc.
 One that only includes completely free segments for use in allocation...


24 June 2010

 Something is wrong with cleaning and segment tracking
 We have 5 free segments and we get them all without writing
 anything!  We consumer them all with cluster_flush!
 It seems that the root inode is not changing phase!
 Nothing is on the phase leafs.
 Most children are in Writeback on cluster. and are Realloc
 Others have pinned children.
 They are all in 'cluster', but 'flush' doesn't flush them,
 so they must be in a different clister???  Is the cleaner still
 cleaning?  Yes, they are on the cleaner 'wc' list so they are
 queued but not flush for the cleaner.

25 June 2010
 At last it looks like I nearly have a working FS. Out of 361 test
 runs, 9 triggered BUGS and one hung at umount.

 I need a new TODO list, starting with 6 jul 2007(!) and adding any
 FIXMEs etc.

DONE 0/ start TODO list
DONE 1/ document new bugs
DONE 2/ Tidy up all recent changes as individual commits.
DONE 3/ clean up the various 'scratch' patches discarding any tracing that
    I don't think I need, and making the rest 'dprintk' etc.
DONE 4/ check in this README file
DONE 5/ Write rest of the TODO list

DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
    It is Dirty but only has *N credits.
    16/1 ...

DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
   I guess we should getref/putref.

DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
    and doesn't cope well.

DONE 5d/ At unmount, 16/1 is still pinned.

 6/ soft lockup in unlink call.
    EIP is at lafs_hash_name+0xa5/0x10f [lafs]
 [<d0a56283>] hash_piece+0x18/0x65 [lafs]
 [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
 [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
 [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
 [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
 [<c02602c1>] vfs_unlink+0x4e/0x92

  Don't know. Looks like cleanup up a chain in dir_delete_commit.
  Added a BUG_ON.
 
  Would we be spinning on -EAGAIN ?? 4 empty segment are present.

 6a/ index.c:1947 - lafs_add_block_address of index block where parent
          has depth on 1.
looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)

 6b/  check_seg_cnt sees to be spinning on the 3rd section 
    the clean list has no end!
    we were in seg scan 
CLEANABLE: 0/0 y=0 u=0 cpy=32773
CLEANABLE: 0/1 y=0 u=0 cpy=32773
CLEANABLE: 0/2 y=0 u=0 cpy=32773
CLEANABLE: 0/3 y=32773 u=6 cpy=32773
CLEANABLE: 0/4 y=32772 u=124 cpy=32773
CLEANABLE: 0/5 y=32771 u=273 cpy=32773
CLEANABLE: 0/6 y=32770 u=0 cpy=32773

of 
0 0
1
2  
3 6
4 124
5 273
6 0
7 496
8 0


 6c/ at shut down, some simple orphans remain
    missing wakeup ???

DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
   truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
SEGMOVE 1499 0
Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
.......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
Why have I no credits?
/home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
      
   Cleaning is racing with truncate, and that cannot happen!!
   Actually it could - if i_size changed at the wrong time.

DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
   block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
   in touch_atime.  I think I know this one.

 7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
               i.e. 1510, 1945-2038, 2448 of 5378
    Appear to be looping in first loop of try_clean, maybe
     group_size_words == 0 ??
    Add BUGON and wait.

DONE 7c/ NULL pointer deref - 000001b4
     Could be cluster_flush finds inode dblock without inode.
     Have a BUG_ON of this now.

DONE 7d/ paging request at 6b6b6bfb. 
    invalidate_inode_buffers called, so inode_has_buffers,
    so private_list is not empty.  So presumably use-after-free.
    But is on s_inodes list.
     Probably cleaner is still active (if this is first call to
     invalidate_inodes in generic_shutdown_super) so list gets broken.  
     We need locking or earlier flush.

DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
     Check for Realloc too.

PRESUME-FIXED 7f/ index.c:2024 no uninc credit 
        [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
      found during checkpoint.  Maybe inode credit problem.

PRESUME-FIXED 7g/  inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
      ->uninc blocks.  This is during truncate.  Need some
      interlock with cleaner maybe?
      Probably the same race between cleaner and truncate.

DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs

NOLONGERRELEVENT 7j/ resolve space allocation issues.
    Understand why CleanSpace can be tried and failed 1000
    times before there is any change.

DONE  7k/ use B_Async for all async waits, don't depend on B_Orphan to do
     a wakeup.
     write lafs_iolock_written_async.

DONE 7l/ make sure i_blocks is correct.
          set on 'import_inode'
          decreased when lafs_summary_update assigned block to '0'
          changed when lafs_summary_allocate changes e.g. quota.

      lafs_summary_update is called when a block is assigned to a location,
        or to zero.  It is real usage.
      lafs_summary_allocate is called when we set Prealloc on phys==0 or
         clear Prealloc on phys==0
      So allocate must be followed exactly.
       update is already counted for setting !=0, so only dec on ==0.
      So all is good.
     What about quota? - hidden in quota_allocate / qcommit

7m/ delete inode could not progress through inode_map_free, so
   ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
   was permanently an orphan.

DONE 8/ looping in do_checkpoint
   root is still i Phase1 because 0/2 is in Phase 1
  [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
   Seems to be waiting for writeback, but writeback is clear.
     Need to call lafs_io_wake in lafs_iocheck_writeback for when
     it is called by lafs_writepage

DONE 9/ cluster.c:478
    flush_data_To_inode finds Realloc (not dirty) block
    and InoIdx block is not Valid.
  [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
  I wonder if it was PinPending, or where it was IOLocked (or if).

   I guess we truncated, then added data, then tried to clean.
   Probably just a bad 'bug' given recent changes.
   No, I think it is the race between truncate and clean which is now fixed.

SEEMS TO BE GONE 10/ inode.c:606
    Deleting inode 328: 2+0+0 1+0

    2 level index.
    first index at level 1 was full and prune properly.
    Nothing else found empty.
    Somehow the second index block and contents were lost.

ASSUME_DONE 11/ super.c:657
    Root still pinned at unmount.
     0/2 is Dirty:  [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
                    [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
                    [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
                    [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
                    [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
    maybe dir-orphan handling stuffed up
    Or maybe it is the I_Dirty issue.  Assume fixed.


ASSUME_DONE 12/ timeout/showstate in unmount
    umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
    That looks similar to 8

DONE 13/ delete_inode should wait for pending truncate to complete.
    Document I_Trunc somewhere - including that i_mutex is needed to set it.
    Verify that assertion.
    Actually it requires i_alloc_sem, or the inode to be deleted.


DONE 14/ Review writepage and flush and make sure we flush often enough but
    not too often.
    Probably just remove the cluster_flush from write-page as lafs_flush
    will do that.
    But leave for now as it encourages heavy indexing.

DONE 14a/ use bio_add_page to write clusters.

DONE 14b/ Figure out what backing_dev to present for the filesystem.

DONE 15/ The inode map file lost some credits.  I think it losts a PinPending because
    it isn't locked properly.  Don't clear PinPending if someone else might
    have set it.

DONE15a/ Find all FIXMEs and add them here.
    

DONE 15b/ Report directory size less confusingly

DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)

DONE 15d/ What does I_Dirty mean - and implement it.

FIXED 15e/ setattr should queue an update for the inode metadata.
     and clean up lafs_write_inode at the same time (it shouldn't do an update).
     and confirm when s_dirt should be set.  It causes fsync to run a
     checkpoint.

15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
## Items from 6 jul 2007.  

15g/ test directories with non-random sequential hash.

DONE 15h/ orphan deadlock
    lafs_run_orphans- lafs_orphan_release can block waiting for written
     in erase_dblock, but that won't complete until cleaner gets to run,
     but this is the cleaner blocked on orphans.
    

DONE 15i/ separate thread management from 'cleaner' name.

DONE 15j/ review rules in getref_locked - and document them

DONE  - fix accesses to iblock

DONE 15k/ newblocks should probably be a count of segments.  Review that.

DONE 15l/ make sure checkpoint_youth is decayed properly.  Review youth decay.

DONE 15m/ consider combining .orphans and .cleaning lists.  If something is an
    orphan, we probably don't want to clean it just now(?).

DONE 15n/ consider if lafs_pin_dblock should check for iolock.  Maybe 
     iolock or PinPending (which must be set under iolock).
     Just require PinPending and always get iolock_written for that
     except in special cases.

DONE 15o/ Can there be async blocks when checkpoint starts?  Could they
     pin blocks in old phase?  Do I need to check for them?

DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
     yet' thing - or somehow avoid the yuckiness.

DONE 15q/ check checksums when reading cluster_header for cleaner
       This is already done!

DONE 15r/ consider further optimisation in cleaner to avoid lookups.

DONE 15s/ memory barrier for i_size check in cleaner???

DONE 15t/ review usable-space calculations in clean.

DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode

DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
     common helpers.

DONE 15w/ review cluster updates and make sure space used is accounted properly.

DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
     try to create it.  Would this actually save anything significant?

DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?

DONE 15z/ make sure symlink creation is safe - do I need to log the body??

DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.

DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
     after lock is taken on block?

DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
      writeout.

DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
        if more is needed.

DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
    before removing from lru.

DONE 15af/ Review all locking in lafs_refile

DONE 15ag/ Don't allocate data part of InoIdx block.

DONE 15ah/ Is there a problem with lafs_allocated_block putting an 
    about-to-be-truncated block on an uninc list?

DONE 15ai/ When allocating a new segment during checkpoint, delay the
    youth-block update until after the checkpoint

DONE 15aj/ When roll-forward finds a new segment, make sure youth number is
    updated.

DONE 15ak/ Load orphan file during roll-forward and make every block an
    orphan.

DONE 15al/ set filesystem update_time somewhere.

DONE 15am/ filesystem 'name' needs to be handled uniformly.

DONE 15an/ can we be sure 'b' will be non-null in delete_inode?

DONE 15ao/ determine what locking is needed to walk the children list
    in lafs_inode_handle_orphan.  Probably the address_space private lock.

15ap/ Make sure write_inode has been cleaned up.  See if this applies to
    rollforward of a symlink (see FIXME)

DONE 15aq/ change inode map to be little-endian, not host-endian

DONE 15ar/ understand what to do about errors in lafs_truncate

15as/ handle errors from lafs_write_super ???

DONE 15at/ More wait_queues to wait for different blocks.
    just use wait_on_bit / wake_bit

DONE 15au/ How should iocheck_block set the page error?
       and block_loaded <- this gets it right.

15av/ ditto for write errors?

DONE 15aw/ when lafs_incorporate makes a new block where the
      old is Realloc, the new should be Realloc too.

15aw2 / When a block is a snapshot block it can never be dirty
    so we only need credits for realloc...

DONE 15ax/ Think about what happens when we relocate a block
    in the orphan list (lafs_orphan_release), particularly
    if the block isn't actually loaded.
    FIXME still need to make sure errors will loading the orphan
    file are handled correctly - I guess we mark all bad orphans as
    type==0 and when we find those during release, reduce the size
    of the orphan file.

DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup 
    when an inode or dir mutex is released.
    No, there isn't.

DONE 15az/ Sanity check all values in cluster head during roll-forward
      i.e. in roll_valid.  If the head isn't complete, we can still
      use this to commit some previous checkpoints.

DONE 15ba/ roll forward should not BUG on bad data like inodefile in
    non-primary filesystem.

DONE 15bb/ Do I need to sync something before copying an update over part
    of an inode, then reloading the inode.

DONE 15bc/ Handle DescHole in roll forward.

DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
    in roll forward, just for consistency.

DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
    are actually the correct type.

DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
   a lookup - or at least we can test for that.
   lafs_seg_apply_all has similar problems and needs a good solution.

DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
    if parent splits.  See what to do about that.

DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
  or handle if it has.

DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
  to twostage.

DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
   segdelete.

DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean

DONE 15bl/ better checks for 'valid state block address' in valid_devblock
    include that segment_count is credible
    also in valid_stateblock

15bm/ make sure everything gets free properly on error during mount / lafs_load

15bn/ How does refcounting of 'struct fs' work with multiple filesets?

DONE 15bo/ use put_super to drop last refer to superblocks

DONE 15bp/ review all superblocks - maybe use more anon??

15bq/ check readonly status in lafs_get_sb

DONE 15br/ sync_fs should probably wait for something if 'wait'.

DONE 15bs/ set f_fsid properly in lafs_statfs

DONE  - use new write_begin / write_end

15bt/    - review how we ensure that credit remain with block.

15ca/ When pin inode data block, pin it as well as index block I think
    It is still kept of the leaf list until the index block is done with
    I think.

15cb/ Layout issues:
     DONE - subset filesys still needs a parent pointer
     DONE - cluster head needs mtime/ctime to log these.
     - need better tracking of which devices are in this array??
            Need to be able to have read-only devices that are shared
             among arrays.
     DONE - need multiple parallel write-clusters to allow parallel writes.
     - record tuning in state block:
           - max_segs
     DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
     - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
     - policies of whether old or new data is allowed on each device
     - policies of how much duplication of metadata is required
     DONE - inode map - not host-endian
     DONE - segments > 16bit:
        segusage file - what about youth?
        cluster_head Clength

15cc/ free any stray B_ASync block found in destroy_inode

15cd/ Some code assumes a cluster header does not exceed 1 page.
     Is this safe?  Is in true? Is it enforced?p
     roll-forward now handles large cluster_head.
     Need cleaner to handle it, and need to possibly write large
     cluster head when making new clusters.

15ce/ classify BUGs as
        - internal logic errors
        - IO errors
        - unusual conditions I want a warning of
        - data corruption errors

DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
     This is needed for the cleaner - the cleaner needs to hold a ref somehow.

15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
      stuff??

15ch/ Review values of youth and checkpoint_youth and think about off-by-one
     issues.

15da/ Replace directory updates!!!!!

15db/ Decide how version string will be used.

15dc/ resolve table_size - it should be stored in the segusage file and validated
      based on device geometry.

15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
      cluster on same device to verify previous.

15eb/ When multiple devices and lots to do and plenty of free space,
	allow multiple segments, one per device, to be open at once,
	and possibly be writing multiple clusters at once using
	VerifyDevNext2

15ec/ Implement i_version tracking.  This should be a 64bit numbers
	that appears to change every time the file changes.  We only
	need a new number when someone looks at the value with
	getattr.
	We could simply use mtime with the sub-millisecond part being
	a counter of times that getattr sees a change in the same
	millisecond.
	However as mtime can go backwards we might get i_version going
	backwards, which is awkward.  I wonder if I care.
	Otherwise, leave for an inode extention later.

16/ Update locking.doc

17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
    calls  lafs_iolock_written.  How do we know that won't block on cluster_flush?

18/ See if per-fs shrinker is available yet and consider it for index blocks.

19/ Review WritePhase and make sure it is used properly.

20/ Review places where we update blocks and be sure they are not in writeout
    or in a different phase.

21/ Review and document all lru uses (locking.doc) and make sure they are
    all locked properly.

22/ Check possible failures:
    - thread allocation
    - memory allocation
    - reading critical metadata
    ...

23/ Rebase on 2.6.latest.  Done for .38

24/ load/dirty block0 before dirtying any other block in depth=0 file,
    else we might lose block0

25/ use kmem_cache for
        datablock
        indexblock - probably a mempool because we cannot allow failure when
                     splitting an index block.
        skippoint (mempool?)
        segsum - mempool??
        others?

26/ Review seg addressing code for 2-D geometries.

27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.

28/ Make sure youth blocks are always referenced properly.

29/ Make sure new segments are referenced properly.  I think there might be
    some double referencing.

30/ Decide when to use VerifyNULL or VerifyNext2

31/ Implement non-logged files

DONE 32a/ Store access time in a file
32b/ Make it a non-logged file
32c/ Avoid writing out dirty atime file blocks when not necessary.
      i.e. keep the page clean and active, and trigger 'write'
     on release_page.

33/ Support quota : group / user / tree

34/ handle subordinate filesystems:
     ss[]->rootdir needs to be array or list
     lafs_iget_fs needs to understand this

35/ review snapshots:
      - peer lists and cleaning
      - how to create
      - failure modes
      - how to destroy

36/ review roll-forward

DONE 36a/  make sure files with nlink == 0 are handled well
DONE 36b/  sanity check before trusting clusters
DONE 36c/ handle miniblocks which create new inodes.
DONE 36d/ Handle DescHole in roll_block
DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
     than just iolock, for consistency...
DONE 36f/ What to do if table becomes full when add_block_address in
     roll_block ??
DONE 36g/ Write roll_mini for directories.
DONE 36h/ In roll_one, use the cluster counting code to find block number and
     make sure we don't exceed the segment.
DONE 36i/ add more general error checking to lafs_mount - 
            lafs_iget orphans and segsum.  Check type is correct.
         errors from lafs_count_orphans or lafs_add_orphans.
         alloc_page failure for chead - maybe allocate something bigger??

37/ Configure index block hash_table at run time base on mem size??

38/ striped layout
        review everything needed for safe RAID5

39/ How to handle all different IO errors

40/ Guard against data corruption at every level.

41/ Add checksums on index blocks and dir blocks and Inodes and ???

42/ Store duplicates of some blocks.  At least index and inode.

43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
    Make sure ->page_mkwrite sets up credits properly

44/ Examine created filesystem and make sure everything looks good.

DONE 45/ mkfs.lafs

46/ fsck.lafs

47/ Write good documentation

48/ Review all code, improve all comments, remove all bugs.

49/ measure performance

50/ Support O_DIRECT

51/ Check support for multiple devices
    - add a device to an live array
    - remove a device from a live array

DONE 52/ NFS export

53/ 'overlay' support
        So I mount one device read-only an another device
        writable which gets all the updates.  metadata on first
        device not updated.

54/ cluster support - is this possible?

55/ is any useful variant of reflink  possible?

56/ Review roll-forward completely.

57/ learn about FS_HAS_SUBTYPE and document it.
    This is for fuse in particular so users can know the real type

58/ Consider embedding symlinks and device files in directory.
    Need owner/group/perm for device file, but not for symlink.
    Can we create unique inode numbers?
    hard links for dev-files would be problematic.
    What do we gain?  Maybe something for short symlinks.
    40 seems a good length to get 70% of symlinks.

59/ Fix NeedFlush handling so we don't drop-then-retake
    a mutex as that isn't sensible.

60/ Introduce some fs state recording that fsck is needed and possibly
    identifying what sort of fsck.

61/ Try to make the inode struct smaller - maybe move some of the
    fs metadata into a separately-allocated struct.

62/ System/trusted extended attributes:
         fileset max size
         directory hash/seed
         
63/ user extended attributes.

64/ wonder if index blocks can be flushed out by memory pressure somehow.
   e.g. if a data block is written by reclaim, flag the index block.
   When a flagged index block has no children, it is incorporated and written.
    ??

65/ review why lafs_allocated_block needs the new_parent label.  Should not
   lafs_incorporate leave all parents dirty? Maybe it is just the need for
   B_Realloc - so maybe lafs_incorporate should leave the new block either
   realloc or dirty rather than lafs_allocated_block doing it.?
   See also 15ad below.

66/ Delay writeout of directory updates until an fsync.  If a checkpoint happens
   first, discard the updates (and fsync waits for checkpoint to complete).
   If a cross-directory rename happens care is needed:  either flush updates
   first or ensure that a flush does happen before the cross-directory
   update is flushed.
   Note that if the target of a rename is a directory, it must also be fully
   flushed before the rename can proceed.

26June2010
 Investigating 5a

   Normal sequence is to surrender UnincCredit, then to clear Dirty,
    then to write.  If anyone re-dirties after Dirty is clear, they
    will naturally have to add an UnincCredit having reserved space first.
   However it seems that the Cleaner gets in the way as the block in question
   has just previously been cleaned, which consumed the UnincCredit
   Do we need ReallocUnincCredit?? I hope not.
   We generally need a way to say "I might want to write to this" so cleaner
   doesn't write it early.
   For index blocks that is pincnt.  For data it is 'PinPending'.
   This keeps index blocks off clean_leafs until they are ready, but
   not data blocks.
   And in any case, TypeSegmentMap blocks don't get PinPending as they
   get written *after* the checkpoint.  That is a rather ugly exception.
   Maybe we make their different handling more explicit.  We put them on
   a separate list unpinned so the rest of the checkpoint can complete.
   Then we flush that list?
   Then PinPending keeps them off the clean_leafs list.

   So to clarify the plan:  If a block is already Pinned to this phase,
   we can "clean" it by marking it Dirty rather than Realloc.  This is
   appropriate for blocks that are likely to change soon (as blocks written
   to the cleaner segment are not likely to change soon).
   For data blocks we take "PinPending" to say "might change soon".  For
   index blocks ... we don't know if it is pinned by Realloc or Dirty or
   PinPending children.  So we set Realloc and wait for any children to
   be unpinned for whatever reason.  If it is only pinned by Realloc blocks,
   it will end up on clean_leafs and be processed to the cleaner segment.
   If it is pinned by anything else it will be found by the checkpoint and
   processed to the new-data segment.

   So Index blocks always get Realloc, PinPending blocks get Dirty,
   Other data blocks get Realloc.  Good.

   Must review PinPending usage... always set, then maybe-dirty inside
   checkpoint lock.  In cases of unlocked usage (inode map) we don't clear
   PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
   It is likely to be changing though, so not a big cost.  Even good.

   Could make the distinction later.  PinPending blocks don't go on
   clean_leafs.  So if they are still realloc at the checkpoint, we Realloc
   to the new-data segment.  This has the same net effect but is arguably
   cleaner.  It means that if a realloc block gets pinpending set, it
   immediately stops being a clean leaf and so is safe.
   So: just keep PinPending blocks off clean_leafs.  Keep them on phase_leafs.
   However there is no mechanism for moving things from phase_leafs to clean_leafs.
   So maybe they stay on clean_leafs, but when the cleaner gets to them, it
   dirties them and drops them.... that would work.

   So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
   Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
   to pick up.

   BUT:  Does this work for TypeSegmentMap blocks?  They aren't PinPending.

   We could treat them specially in the cleaner.  Or we could set PinPending
   and pin them to the phase, but treat them differently in checkpoint.
   If we gathered them onto a separate list, then flush the list after
   the phase had changed, it might be quite neat.  No more getting writepages
   to do our work for us.
   They would need to be re-pinned to the next phase, then written out.
   Or just unpinned, and let seg_inc re-pin as appropriate... except that
   seg_inc is too later to pin.  It dirties.  We need to pin when we get
   SegRef.  We currently reserve but we don't pin.
   We really do need to phase_flip these segmentmap blocks.  But that requires
   getting extra credits, and Pinning everything if new credits are not available.
   And we don't really have a good list of 'everything' that depends on a segment.
   But seeing the space_alloc never fails for these...
   So Pin them, and flip them with AccountSpace

   So:
    - split out common 'flip' code
    - add 'flip' for data blocks
    - create list of accounting blocks and flip accounting file blocks onto
      that list during checkpoint
      Flush should write that list,  not the files.
    - Get cleaner to ignore pinpending blocks, marking them dirty.
    - pin segusage blocks while ref on them is held.
    - writepage no longer needs special case for TypeSegmentMap, just PinPending
    - lafs_prealloc just tests PinPending


   [[aside: quota files seem to be handled like segmentmap files.  Is that 
     right??
     We only track usage of data blocks based on various 'owners' of the file.
     We need to know if a block was written in one phase or the next, and
     only count blocks written/allocated in the one.
     Data blocks can slip into 'this' phase quite late - any time before the
     parent is finally incorporated.  So we don't write quota blocks
     until checkpoint is done.  So yes, they are like SegmentMap
   ]]


  segsums....
   If there are hundreds of snapshots, then a block being cleaned (whether to
   cleaner segment or new-data segment) could affect hundreds of segment
   usage counters.  That would be clumsy to work with.  Every block in the
   free table would need to hold references to hundreds of blocks.  This
   is do-able and might not be a big waste of space, but is still clumsy.
   I could change the arrangement for accounting per-snapshot usage by having
   a limited number of snapshots and having all the counters for one segment
   in the one blocks. So 1024byte block could hold 512 counters (youth plus
   base plus 510 snapshots).  Half that if I go to 4byte counters.
   In more common case of 32 snaphots, could fit counters for 8 segments in
   a block.  This means using space/io for all possible snapshots rather than
   all active snapshots.  It would also mean having a fairly fixed upper limit.
   I wonder what NILFS does....
   Worry about this later.

  Still trying to get pinning of SegmentMap blocks right.
  Normally we need a phase-lock when pinning a data block so that we
  don't lose the pinning before we dirty.  But as we phase_flip
  these it doesn't matter... So just add that too the test??

28June2010
 Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
  datablock not, and doesn't cope.
  We either prealloc both, which seems clumsy, or always defer
  to InoIdx if it is present and pinned.
  lafs_prealloc does both Index and Data blocks for inode.
  But Data could lose as writeout while index will replenish at
  phase_flip, so maybe not a good idea.
  If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
  credits across to the data block (on non-cleaning segments) so the
  Data block doesn't need to have credits.

  dirty_inode gets called:
     {__,}mark_inode_dirty{,_sync}
     inode_{inc,dec}_link_count
     [[various quota ops]]
    inode_setattr
    touch_atime
      file_accessed
    file_update_time
      generic_file_...write
      do_wp_page

  updates through inode_setattr go to lafs_setattr so the
  data block will be pinpending and the checkpoint lock will be held.

  updates through inode_*_link_count happen in filesystem and the inode data
   block is PinPending, or a block in the file is pinned and will be
   dirty, so it will get written.

  updates through touch_atime or file_update_time are unexpected and
  cannot be prepared for.  file_update_time changes will be caught by
  normal file writeout.  atime changes will be lost until we get the
  atime file working.

  So:
    dirty_inode cannot change the block as it might be in writeout, and
    it cannot lock anything as it might be in touch_atime which shouldn't
    block and cannot fail.
    So just set I_Dirty and use that to flush inode to db at writeout.
    Any changes which must be in the next phase will come via setattr and
    so will wait for incompatible changes to be written out.

 Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
  my_inode is set
     lafs_import_inode
         iget and mount-time stuff
     lafs_inode_dblock

  my_inode is cleared
    When I_Destroyed is set and the last ref on the block is dropped
    When inode_map_new_prepare claims an inodeblock

  So we could easily not have a my_inode - e.g. just cleaning the data block.
  ->my_inode cannot disappear while we hold the block, so a test is safe.


 ----------------------------------------------
 Space reservation and file-system-full conditions.

  Space is needed for everything we write.
  Some things we can reject if the fs is too full
  Some things we can delay when space is tight
  Some things we need to write in order to free up space.
  Others absolutely must be written so we need to always have
  a reserve.

  The things that must be written are
       - cluster header  - which we never allocate
       - some seg-usage and youth blocks - and quota blocks
         Whese continually have credit attached - it is a bug if there
          are not enough. (We hit this bug)

  Things that we need to write to free up space are
   any block - data or index - that the cleaner finds.

  Things that we can delay, but not fail, are any change to a block that
   has already been written or allocate.

  When space is needed it can come from one of three places.
     - the remainder of the current main segment
     - the remainder of the current cleaner segment
     - a new segment.

  Only Realloc blocks can go to the cleaner segment, so the
  'must write' blocks cannot go there, so unused + main must have enough
  space for all those.
  Realloc blocks can go anywhere - we don't need a cleaner segment if things
  are too tight.

  When we run out of space there are several things we can do to get more:
   - incorporate index blocks.  This tends to free up uninc-credits which
     are normally over-allocated for safety.
   - cluster_allocate/cluster_flush so more blocks get allocated and so
     more can be incorporated.  See above.  This is probably most helpful
     for data blocks.
   - clean several segments into whole cleaner segments or into the main segment.
  Much of this happens by triggering a snapshot, however we should only do that
  when we have full cleaner-segments (or zero cleaner segments).

  When cleaning we don't want to over-clean.  i.e. we don't want to commit
  any blocks from a second segment if that will stop us from commiting blocks
  from the first segment.  Otherwise we might use one cleaning segment up by
  makeing 4 half-clean.  This doesn't help.


  So: we reserve multiple segments for the cleaner, possibly zero.

  We clean up to that many segments at a time, though if that many is zero,
  we clean one segment at a time.
  lafs_cluster_allocate only succeeds if there was room in an allocated segment.
  If allocating a new segment fails, the cluster_allocate must fail.  This
  will push extra cleaning into the main segment where allocations must not
  fail.

  The last 3(?) [adjusted for number of snapshots] segments can only be allocated
  to the main segment, and this space can only be used for cleaning.
  Once the "free_space - allocated_space"  drops below one segment, we 
  force a checkpoint.  This should free up at least one segment.

  We need some point at which we stop cleaning because the chance of finding
  something to clean is too low. At that point all 'new' requests defintely
  become failures.  They might do earlier too.
  Possibly at some point we start discounting youth from new usage scores so
  that the list becomes sorted by usage.


  Need:
    cut-off point for free_seg where we don't allow cleaner to use segments
      3? 4?

    event when we start using fixed '0x8000' youth for new segment scores.
       Maybe when we clean a segment with usage gap below 16 or 1/128
    event when we stop doing that.
       Maybe when free_segs cross some number - 8?

    point when alloc failure for NewSpace becomes ENOSPC
       same as above?

    point when we don't bother cleaning
      no cleaner segments can be allocated, and checkpoint did not increase
      number of clean segments (used as many as freed).
      Clear this state when something is deleted.


   Allocations come out of free_blocks which does not included those
   segments that have been promised to the cleaner.
   CleanSpace and AccountSpace cannot fail.
     We *know* not to ask for too many - cleaner knows when to stop.
   ReleaseSpace fail (to be retried) if available is below a threshold,
     providing the cleaner hasn't been stopped.
   NewSpace fail if below a somewhat higher threshold.  If we haven't entered
     emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.

   
   Possibly limit some 'cleaner' segments to data only??


  So: work items.
    - change CleanSpace to never fail, but cluster_allocate new_segment
      can for cleaner segment.  This is propagated through lafs_cluster_alloc
    - cleaner pre-allocates cleaner segments (for new_segment to use)
      and only cleans that many segments at a time.
    - introduce emergency cleaning mode which causes ENOSPC to be returned
      and ignores 'youth' on score.
    - pause cleaner when we are so short of space that there is not point
      trying until something is deleted.

30june2010
  notes on current issue with checkpoint misbehaving and running out of
  segments.

  1/ don't want to cluster-flush too early.  Ideally wait until segment is
   full, but we currently hold writeback on everything so we cannot delay
   indefinitely.
  2/ row goes negative!!  let's see...

    seg_remainder doesn't change the set, but just returns
        the remaining rows times the width

    seg_step  move nxt_* to *, stepping to the next ... row?
             save current as 'st_*

    seg_setsize - allocate space in the segment for 'size' blocks plus
         a bit to round of to a whole number of table/rows
               nxt_table nxt_row

    seg_setpos initialises the seg to a location and makes it empty,
       st_ and nxt_ are the same

    seg_next reports address of next block, and moves forward.

    seg_addr  simply reports address of next block

   So the sequence should be:

     seg_setpos  to initialise
     seg_remainder as much as you want
     seg_setsize when we start a cluster
     seg_next  up to seg_remainder times
     seg_step  to go to next cluster (when not seg_setpos).
            or maybe just before seg_setpos

     Need cluster_reset to be called after new_segment, or after we
     flush a cluster but don't need a new_segment.

   I think I'm cleaning too early ...  I am even cleaning
   the current main segment!!!!

   OK, I got rid of the worst bugs.  Now it just keeps cleaning
   the same blocks in the current segment over and over.
   2 problems I see
      1/ it cleans a segment that it should not touch
           We need to  avoid cleaner segment increasing the
             checkpoint youth number.
      2/ it has 6 free segments and doesn't use them

   clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
   watermake is 4 segs, so free < 4.  So we have 3 allocated to cleaner,
   3 in reserve and so nothing much to clean!

   The heuristic for returning ENOSPC is not working.  Need something more
   directly related to what is happening.
   Maybe if cleaning doesn't actually increase free space.

   !Need to leave segments in the table until we have finished
   writing to them, so they cannot be cleanable. - DONE

   WAIT - problem.  If cleaner segment is part-used, the alloc_cleaner_segs
   doesn't count that.  Bad?

   When nearly full we keep checkpointing even though it cannot help.
   Need clearer rules on when there is any point pushing forward.
   Need to know when to fail requests.

02 july 2010

  I am wasting lots of space creating snapshots that don't serve any
  purpose.
  The reasons for creating a snapshot are:
    - turn clean segments into free segments
    - reduce size of required roll-forward
    - possibly flush all inode updates for 'sync'.

  We currently force one when
       newblocks > max_newblocks
          max is 1000 , newblocks is never reset!
          probably make that a number of segments.
       lafs_checkpoint_start is called
          when cleaner blocks, and space is available
          at shutdown
          on write_super is s_dirt
             __fsync_super before ->sync_fs
               freeze_bdev
               fsync_super
                 fsync_bdev
                 do_remount_sb
             generic_shutdown_super before put_super if s_dirt
             sync_supers is s_dirt
               do_sync
             file_sync !!! is s_dirt

      I think I should move checkpoint_start to
            ->sync_fs


 After testing
  - blocks remaining after truncate - one index and 1-4 data
  - truncate finds blocks being cleaned
         FIXED - move setting of I_Trunc
  - orphans aren't being cleaned up sometimes.
        Hacked by forcing the thread to run.
  - parent of index block has depth==1
        Don't reduce depth while dirty children.
        Probably don't want uninc either?

  - some sort of deadlock? lafs_cluster_update_commit_both
     has got the wc lock and wants to flush
    writepage also is flushed.
   Not sure what the blockage is.
   I think the writepage is the one in clusiter_flush, and it
    is blocking

  - Async is keeping 16/0 pinned during shutdpwn
03July2010

  Testing overnight with 250 runs produced:
 - blocked for more than 120 seconds
      Cleaner tries to get an inode that is being deleted
      and blocks, so inode_map_free is blocked waiting for
      checkpoint to finish - deadlock.
     Need to create a ->drop_inode which provides interlock with
     cleaner/iget
 
    But this is hard to get right.
    generic_forget_inode need to write_inode_now and flush all changes
    out and then truncate the pages off so the inode will be
    empty and can be freed.  But flushing needs the cleaner thread
    which can block on the inode lookup.
    Ahh.... I can abuse iget5_locked.
    If test sees I_WILL_FREE or similar, it fails and sets a flag.
    if the flag was set, then 'set' fails


 - block.c:504 DONE (I trink).
    unlink/delete_commit dirties a block without credits
    It could have been just cleaned..
    It looks like it was in Writeback for the cleaner when
    unlink pinned and allocated it....
    or maybe it was on a cluster (due to writepage) when
    it was pinned.  Then cluster_flush cleared dirty ... but
    it should still have a Credit.
    Maybe I should iolock the block ??

    On reflection it wasn't cleaning, just tiny clusters
    of recent changes which were originally written as tiny
    checkpoints. Maybe lots of directory updates triggered the clusters.
    I guess writepage is being called to sync the directory???
    Or maybe the checkpoint was pushed by s_dirt being set.

    So use PinPending and iolock to protect dir blocks from writepage.
    
 - dir.c:1266 DONE
    dir handle orphan find a block (74/0) which is not
    valid
    This can happen if orphan_release failed to reserve a block.
    We need to retry the release.
 - inode.c:615
    index block and some data blocks still accounted to deleted file.

    No theory on this yet.  Always one index block and a small number
    of data blocks.  Maybe the index block looked dirty, but was then
    incorporated with something that was missed from the children list...
    Or maybe I_Trunc is cleared a bit early...
    Or trunc_next advanced too far?? or too soon
    ??

 - segments.c:640 DONE
     prealloc in the cleaner finds all 2315 free blocks allocated.
     no clean reserved.
    Need to be able to fail CleanSpace requests when cleaner_reserve
    is all gone.??

    or just slow down the cleaner to one segment per checkpoint when
    we are tight..  Hope that works.
 - super.c:699
     async flag on 16/0 keeping block pinned
   Maybe clear Async flag during checkpoint.  Cleaner won't need it
   No, just ensure to clear Async on all successful async calls.
   
     orphan file 8/0 has orphan reference keeping parent pinned
      [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
   Orphan handling is failing to get a reservation to write out the
   orphan file block?  Not convincing as there should be lots of space
   at unmount, and 'orphan sleeping' has become empty.

 - Show State
     orphan inode blocked by leaf index stuck in writeback:
   [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5) 
   [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0] 

    This is in the write-cluster waiting to be flushed


9July2010
  Review B_Async.
    If a thread wants async something, it 
         - sets B_Async
         - checks if it can have what it wants.
           + if not, fail
           + if so, clear B_Async and succeed

    If a thread releases something that might be requested Async,
         it doesn't clear Async, but wakes up *the*thread*.

    This applies to
        IOLock      - iolock_block
        Writeback   - writeback_donem iolock_written
        Valid        - erase_dblock, wait_block
        inode I_*   - iget / drop_inode

     orphan handler, cleaner, segscan - all in the cleaner thread.

  107 runs,
   2 hit 'Show State' with a blocked orphan inode.
    Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
    Both NoPhysAddr

   Several runs blocked in cluster_flush or waiting for writeback.

   - first case: looks like cluster flush should run but doesn't.
        cluster_flush runs:
           checkpoint, cleaner, cluster_allocate when full, update,
           writepage, sync_page
        So we have no timeout or other flush.
      I guess if we are waiting for writeback, we need to trigger a
      cluster_flush.

   - other case - cluster_flush was called but is waiting for pending count
       to go down.
       Looks like cluster_reset shouldn't be changing pending_next

   New hang.  Orphans not being processed:
        inode, because InoIdx is on leaf and checkpoint isn't pushing
        it along.
        dir block 0 is Dirty leaf

     Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
     
10July2010
  Over night it looks *very* good.
  Have one infinite loop with 31770 repeates of 
  ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
                   Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)

  So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
    lafs_add_orphan too short.
    tracing shows after truncate_inode_pages.
    must be blocked in inode_map_free - maybe use AccountSpace??
   But why isn't the the truncate progressing?
   Probably same reason:  No ReleaseSpace available.
   Maybe we aren't cleaning because there is a free segment, and
   we aren't checkpointing because there aren't enough yet...

   Probably the cleaner has halted while CleanerBlocks - fix that.

  - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
        Need a checkpoint to release the orphan?
   ditto for 0/331 - 331/0
    XX/0 is InoID

VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice 
day...
This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
,UninCredit,PhysValid leaf(1) intable(6) release(1)
 [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
Valid leaf(1) intable(6) release(1) Leaf0(0) 
------------[ cut here ]------------
kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!

Forgetting 0 0
724 != 7  (st->free.cnt afte segdelete, close_segment, close_all)
------------[ cut here ]------------
WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn

we called segdelete on something that was on the freelist.
This happens when the final cluster starts a new segment.
Need to improve the fix though.


 lafs_inode_handle_orphan can make progress without leaving
 anything async.  Maybe we need a return status:
  -EAGAIN - try after async
  -ENOMEM - try some time soon - hope memory will be better
  0 we called orphan_release
  anything else loops.


 - we allocate a segment in last checkpoint we don't
   take references properly.

 - orphan handle spinning on: 

  ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
   26402 calls.
   stuck in delete_inode?? ?


  never-ending cleaning? Maybe just computer slow ??

11July2010  - on plane to Prague.
  How can we safely access ->iblock?
   normally iolock, but how do we get iolock?
   - flush data to inode
   - cluster flush takes private_lock
   - private_lock is used to set to null.
  I guess we use private_lock to get a reference
  then iolock and revalidate
  but I can probably test for NULL at any time? though that can change under private_lock
  If we own a reference to a child with a parent, then we can use
   rcu_dereference to get a ref which might change

12july2010

 ->write_inode is called by write_inode() called by __sync_single_inode
  to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
 Do we care?

 change to addresss we already handle with checkpoints
 change due to setattr we can handle directly if we want
 that just cleans mtime/ctime and atime.
   mtime/ctime calls ->dirty_inode
   as does atime

 So:
  getattr changes set I_Dirty so that when cluster_allocate
  happens all the changes get saved.
  
  when dirty_inode is called, we set I_Dirty but don't dirty
  the inode block.
  If anything happened to justify an inode write, it will
  be dirty anyway.  If it isn't, this is just atime

  So on dirty_inode we check if atime has changed and if so
  we schedule change to atime file

  sync_inode should write an update for the inode if I_Dirty
  but sync_filesystems should not

  Simple.  fsync calls ->fsync.  We get that to write an
  inode update, but nothing else does.

  Possibly all directory updates could be chained onto a
  directory and only written when fsync is requested before
  a checkpoint.
  both sides of a rename ??
  leave that for later.

WritePhase - what is that all about?
  We must not change a block while it is being written to previous
    phase, else we corrupt causality.
  But we probably don't want to change it any way as that would
  mess up any checksum or duplication.

 So we want to ignore WritePhase - scrap it.
 Before changing a block, we must iolock_written
  - all dir updates
  - inode update in fsync
  - orphan file
  - segusage?
  - quotas?

 But what about regular data.  If prepare_write finds a block in
 writeback, do I need to wait, or can I just mark it dirty in 
 commit_write?  If no checksum and no duplication applies, this should
 be fine.

16July2010
 BUT e.g. dir operations are in particular phases.  If the dirblock
 is pinned to the old phase, we need to flush it, then wait for io
 to complete.  So we need lafs_phase_wait as well as iolock_written.
 This is already done by pin_dblock.
 I wonder if we need a way to accelerate pinned blocks that are being
 waited for - probably not, they should be done early.

 So we probably want to iolock after phase_wait in pin_dblock.
 Though dir.c pins early.
 I need to review all of this and get it right.

 So:
  - we aren't allowed to block much holding  checkpoint_lock as
    checkpoint_start waits for that.  However phase_wait will only
    block if a new checkpoint has started already, so there is not
    chance of phase_wait ever blocking checkpoint_start.
    So it is safe to call phase_wait in checkpoint_lock.
    phase_wait will wait until block is written, added back to
    the lru clean, then found and flipped... I wonder if that is 
    good - it keeps parent from being a leaf, and so written, until
    child write has completed.
    We want to phase-flip a block as soon as it is allocated by cluster_flush.

    With directory blocks, i_mutex stops other changes, so an early iolock_written
    will leave the block clean and phase won't be an issue.

    With inode-map blocks.. we:
      set B_Pinned to ensure no-one writes except for phase change
        do that after lock_written so it starts safe.
      once we have checkpointlock, wait for phase if needed.
      then lock_written again which should be instant but ensures
      that block is locked while we change it...

  I think I want
    - refile to call phase flip if index is not dirty and is in wrong phase
       and has no pinned children in that phase.
    - Only clear PinPending if we have i_mutex or refcnt == 0
    - before transaction:
          lock_written / set PinPending / unlock
      the inside cluster_lock
          lock_written pin / change / dirty / unlock
      it will only wait for writeout if phase changed.
      so don't need phase_wait
     but want pre-pin then pindblock
     Transactions are:
        dir create/delete/update - DONE
        inode allocate/deallocate - on inode map DONE
        setattr  DONE
	orphan set/change/discard

     Orphans are a little different as when we compact the
     file, the orphan file block 'owned' by the orphan block
     can change.  As along as we keep them all PinPending it
     should be fine though.
     I think that every block in the orphan file will always be
     PinPending ???

    OK - done most of that.
    Early phase_flip is awkward.  We need an iolock to phase_flip,
    and we don't have one.  The phase_flip could cause incorporation
    which cannot happen until the write completes.  So I guess
    we leave it as it is.


   FIXME what about inode data block - cluster_allocate is removing
    PinPending after making them dirty from the index block..

  If all free inode numbers a B_Claimed,  don't think we allocate
  a new block... yes we do, as 'restarted' is local to caller.

 Also
  each device has a number of flags
   - new metadata can go here
   - new data can go here
   - clean data can go here
   - clean metadata can go here
   - non-logged segments allowed
   - priority clean - any segment can be cleaned
   - dev is shared and read-only - no state-block updates

  state block needs a uuid for an ro-filesystem that this is
  layered on.

  Is metadata an issue?
    We might want it on a faster device, but ditto for directories
    and for some data.  So probably skip that.

  Have separate segment tables for:
    - can have new data
    - can have clean data but not new. (this often empty)

  Clean data can go to new-not-clean if nothing else
  new data can go to clean-not-new ?? if not sync??
  Maybe call them 'prefer clean' and 'prefer new'

  I think we want:
    'no sync new' - don't write new data, unless it is in big chunks and
           can wait for checkpoint to be 'synced'
    'no write' - never write anything - this is readonly.
               used for removing a device from the fs.

  A 'no sync new' device can have single-block segments.
  This doesn't allow compression, but avoids any need to clean
  In this case we don't store youth and the segusage is 32 bits per segment.
  That means  - for 1K block size - 0.5% of devices used for segusage.  That
  feels high.  For 4K, 1/1024 so a giga per terabyte.
  Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.

  Other segusage for 29 snaps is 1/million of space used.
  So we 'waste' 0.1% of device for no secondary cleaning.
  Can still do defrag though.

  clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
  as does creating a snapshot.

18jul2010
 If lafs were cluster enabled we would want multiple checkpoint clusters,
 one for each node. When a node crashes some node would need to find and
 roll-forward.  For single node failure, it is enough to broadcast cluster
 address to all others.  For whole-cluster failure, need to either list all
 in superblock or link from main write cluster.

 When writing to multiple devices we may want multiple write clusters
 active for new data.  These all need to be findable from checkpoint cluster
 so linking sounds good.
 Having a single 'fork' link in cluster head might work but does scale to large
 cluster.  I doesn't need to be committed to other not does checkpoint end, so
 that should be ok.
 Could have a special group_head to list other clusters for roll forward.
 If we put fsnum first, a large value - 0xffffffff - could easily mean
 something else
 
 Or every  cluster head could point to an alternate stream, and if we want many
 quickly, each simply points to another, so we create a chain across all writers.


 Another issue...
  When we 'sync' we don't wait for blocks until after the checkpoint is started,
  and we know that will be driven through to CheckpointEnd which will commit and
  release everything.
  However 'fsync' doesn't have the same guarantee.  The sync_page call will ensure
  the data has been written, but we don't know it is safe until the next
  header is written.  So we need to push out the next cluster promptly.

  So if sync_page is called on a page in writeback, then we mark the cluster as
  synchronous.  When a sync cluster completes, the next (or even next+1) clusters
  are flushed out promptly.  Hopefully they won't be empty on a reasonably busy system,
  but it is OK if they are.

  If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
  as the write completes the block will be released.

  So: to clarify sync_page:
    This can be called when page is in writeback or locked.
    If locked there is nothing we can do except maybe unplug the read queue.
    If page is in writeback and block is dirty, then it is probably in
    a cluster queue and we should flush the cluster and the next.
    If page is in writeback and block is not dirty, but is writeback,
    just flush one cluster.
    But we don't want these cluster flushes to start while the previous is
    still outstanding else we stop new requests from being added.
    So as soon as the cluster can be flushed we flush, but no sooner.
    I guess we use FlushNeeded and make that be less hasty.

19June2010

  superblocks....
   We currently have a superblock for each device.
   I cannot see a good reason for that.
   We can just bdev_claim for 'this' filesystem.
   Rather we should have a number of anon superblocks,
    one for each fileset, then one for each snapshot.
   Do we use different fs types? probably yes
       lafs - main filesystem made from devices
       lafs_subset - subordinate fileset, given a path to  fileset object
                 can have 'create' option when given an empty directory.
       lafs_snap - snapshot - given a path to filesys and textname.

    Cannot create a snap of a subset, only of the whole filesystem
    Is it OK to mount eith snap of subset or subset of snap?
    It probably does, so need to use the same filesystem type for both.
    Maybe lafs_sub or sublafs. Needs path to directory.
    can be given 'snap=foo'.
    No: a given filesystem may not exist in a snapshot.  You need to
    mount the snapshot first, then the subset of the snapshot.
    So we have three types as above.  All subsets as 'lafs_subset',
    whether they are subset of main or of snapshot.

    Should we be able to create a snapshot or subset without mounting it?
    It doesn't really seem necessary but might be elegant..

    remount doesn't seem the right way to edit a filesystem as it forces
     some cache flushing.
    What do we want to edit?
          - add device,  remove device
          - add/remove snapshot by name
          - add/remove subset?  Not needed, just mkdir/rmdir and mount to convert
                     empty dir to subset.
          - change cleaner settings??
    Could have remount as an option. If problem find other option.

    While cleaning (which is always) we potentially need all superblocks
    available as we might need to load blocks in those filesystems to
    relocate them.
    Unfortunately each super needs to be in a global list so there is a cost
    in having them appear and disappear. I guess that is not a big deal.  They
    are refcounted and will disappear cleanly when the count hits zero.

    So:
     DONE - change all prime_sb->blocksize refs to fs->blocksize
     DONE - create an anon sb for the main filesystem
     DONE - discard the device sbs, just bd_claim the devices and add to list
     - use lafs_subset for creating/mounting subsets.

  Changed s_fs_info to point to the TypeInodeFile for the super, but
   for root/snapshot that doesn't exist early enough to differentiate the
   super in sget.
   So we make an inode before the super exists and attach it after.
   Need to do all that get_new_inode does.
        inode_stat.nr_inodes++   - just don't generic_forget the inode
        add to inode_in_use -   seems pointless - just set i_list to something
        add to sb->s_inodes - if we don't it won't flush - maybe that is good?
        add to hash - don't want
        i_state == lock|new - only really needed if hashed.
    but there is lots of initialisation in alloc_inode that we cannot access!!

   Problem is that we need s_fs_info to uniquely identify the fs with something
   that can be set in the spinlock, so allocating an inode is out.
   And also to get to the filesystem metadata which is in the inode.
   I guess we allocate a little something that stores identifier and later inode.
     for lafs  we use uuid
     for subset we use just the inode
     for snapshot we use fs and number


25July2010
  superblocks:
   - sget gives us an active super_block.  We need to attach to a vfsmnt
     using simple_set_mnt, or call deactivate_locked_super.
   - sget's set should call set_anon_super
   - kill_sb (called by deactive_super) should then call kill_anon_super

  If we have a vfsmnt, we have an active reference, so we can atomic_inc
  s_active safely.  So use this to allow snapshots and subsets to hold a
  ref on the prime_sb and thence on the 'fs'.

26July2010
 - DONE  need to set MS_ACTIVE somewhere!!
 - FIXME if an inode is being dropped when iget comes in, it gets confused
    and the inode appears to be deleted.

   We cannot really break the dblock <-> inode link until after write_inode_now,
   but there is no call-back before generic_detach_inode is complete.
   The last is write_inode which is only calledif I_DIRTY_something.
   Maybe when writeback completes on an inode dblock, we should check if
   the inode is I_WILL_FREE and if so, we break the link...

   Or maybe when we find my_inode set we can check the block and if it isn't
   dirty or being deleted we break the link directly... That makes more sense.

   So... what is the deal with freeing inodes???
     ->iblock is like a hashtable reference.  It is not refcounted
             It gets set under private_lock
      iblock is freed by memory pressure or lafs_release_index from
             destroy_inode
     when refcount of iblock is non-zero, ->dblock ref is counted,
     else it is not.
     dblock is set to NULL if I_Destroyed, or when dblock is discarded,
       (under lafs_hash_lock)
       and set to 'b' in lafs_iget and lafs_inode_dblock

     We can drop the dblock link as soon as iblock has no reference

    probably get clear_inode to break the link if possible, which it should
    be on 'forget_inode'.  Then lafs_iget can wait on the bit_waitqueue.
    or maybe do clear_inode itself

   FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
      dblock is not NULL.

28July2010
  So: ->dblock and ->my_inode need to be clarified.

  Neither is a counted reference - the idea is that either can be freed and
  will destroy the pointer at the time so if the pointer is there, the
  object must be ... but we need locking for that.
  ->dblock is reasonably protected by private_lock, though if ->iblock exists
  we hold a ref of ->dblock so we can access it more safely.

  Need to check getiref_locked knows ->dblock exists when called on iblock
  and lafs_inode_fillblock
   yes, both safe!

 But ->my_inode needs locking too so the inode can safely disappear without
 having to wait for the data block to go.  After all data blocks some in sets,
 and one shouldn't keep others with inodes.
 So something light-weight like rcu might work.
 We use call_rcu to free the inode and rcu_readlock to access ->my_inode

 Yes, that will work.  Occasionally we will want an igrab to, but not
 often.
 Should look into rcu for index hash table and ->iblock as well.
 Current ->iblock is only cleared when the block is freed .. I guess that is fine...


31Jul2010
  rcu protection of ->my_inode
  A/ orphan inodes - are they protected?  
  B/ orphan blocks - are the inodes of those protected? Probably...

  inodes are 'orphan' for two reasons
    1/ a truncate is in progress
    2/ there are no remaining links, so inode should be truncated/deleted
       on restart.

  The second precludes us from holding a refcount on any orphan inode,
  else it would never get deleted.
  So we must assert that an inode with I_Deleting or I_Trunc has an implied
  reference and so delete must be delayed... not quite.
  If we set I_Trunc but not I_Deleting, then we igrab the inode until
  I_Trunc is cleared.  While we hold the igrab, I_Deleting cannot possibly
  be set as that is set when last ref is dropped.

01Aug2010
  FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
    .. and in lafs_orphan_release
  Well... only iolock_written can be a problem, and our rules require that
  only phase-change writeout can set writeback.  So the cleaner can never
  wait for writeout here.  Maybe it can wait for a lock, and maybe we don't
  really need a lock, just 'wait_writeback'.
08Aug2010
  So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
   It is writeback waiting on 74/BIGNUM fromm file.c:329.  So writepage
   tried to write a block in a directory .. but it is PinPending so that
   must have been set after writepage got it...
   lafs_dir_handle_orphan gets an async lock, then sets PinPending.
   If write_page is before that, it will have the lock and dir_handle will try later.
   If write_page is after it will block on the lock, or see PinPending and
   release the lock.
   So someone else must be clearing PinPending!
     - checkpoint clears and re-sets under the lock, so that is safe
     - dir.c clears under i_mutex
         dir_handle_orphans always hold i_mutex ... or does it.
     - refile drops when the last non-lru reference goes.
     - inode_map_new_abort clears for inode
   No, not that - just bad test on result lof iolock_written_async ;-(

  Now have an interesting deadlock.
    rm in lafs_delete_inode in inode_map_free is waiting for the block to
    flush which requires the cleaner.
    The cleaner thread in inode-handle_orphan is calling erase_dblock
     on the same inode which blocks while inode_map_free has it locked....
     no, not same block - just waiting for writeout which requires cleaner.
     lafs_erase_dblock from inode_map_free must be async!
   pin_dblock in lafs_orphan_release must too.... no - only the setting of
   PinPending needs to be async or out side of cleaner, which it is.

  Ok, got that fixed.  All seems happy again, time for a commit.


09Aug2010
   14b/  What backing-dev to show the filesystem.
     backing-dev holds:
         congested state
         unplug function
         read-ahead info
         throughput measurements

    Much of that is for generic code to use.  We need to:
     - provide an unplug funtion that unplugs all devices
     - provide a congested function that which checks all devices,
       or for 'write' - at least the device we are writing to.

    How do we set the backing device?
    The 'struct address_space' point to one, as does struct super_block.
    set_anon_super establishes a null bdi, set_bdev_super gets it from the
    bdev->queue

    We need to bdi_init and bdi_register (if no error) our bdi.
    bdi_destroy calls unregister and reverses bdi_init
    or just bdi_setup_and_register
    but bdi_register_dev gives a better name - isn't this sick!!!

    Partly done ... but I'm hitting more bugs :-(

  -Checkpoint cannot complete because...
   Lots of dirty inodes that are orphans are not pinned!! I
   guess the InoIdx is ??
   Most of them don't have InoIdx(?)  Only '8' does.
   8/0 is also an orphan and is on wc[0]

   It seems that this block keeps getting re-written and stays in
   Phase0.
   Is that because it is a data block with PinPending.. No, that works
   as long as it become un-dirty: we drop pinpending, refile, and set again

   It is being dirtied again during writeout for the checkpoint
   so it doesn't get to changed phase when we lift PinPending.
   I gues we mustn't dirty it if it is in the old phase.

  -And twice inode 17 is deleted without B_Orphan being set!
   That is the only file that exists before we mount.
     Problem was orphan_release instead of orphan_forget
     I wonder why it only affected 17...

  -at shutdown we drop an inode and try to invalidate pages, but
   root inode is still dirty - I wonder why.
     The dblock is in a different phase to the iblock.
     In checkpoint we wait until root iblock changes phase, but
     not root dblock!


  UP TO:
    I'm testing subordinate filesystems, which don't work yet.
    I need to create the root directory and inode map.
    Obviously I cannot record the inode map file in the inode map....
      inode_map should ignore everything less than 16? 8? 2?
    Need to make sure creating with a given inode number works.
    Need to make make sure auto-allocate inum is never less than 16.

11Aug2010
 How to map from filesys inode to superblock?
  Need in
    lafs_iget_fs
    choose_free_inum - to get inode-1
    ditto in inode_map_free
    lafs_put_super has something odd with i_sb

  Could do an sget search..
  Or could just store it in the inode (but not in i_sb!!)
  inode already a bit large though.
  Do it for now, but make a note to trim the fs_md part of inode 
  into a separate allocation.

  lafs_new_inode should take an 'sb' not a 'filesys'.
  In fact, get rid of filesys.  It is
    MAP(i->i_sb->s_fs_info)->root.

 15f - timestamps for roll-forward.
    The writeout can be much later, but logging the mtime is fairly
    boring ... we could log mtime in the group head, which might be cheap
    enough.  How much precision is needed, and against what base?
    probably mtime of last checkpoint from superblock.  That should
    be not more than 2048 seconds ago, so 16 bits gets is 30msec...

14Aug2010
 15l - decay youth info.
    Need to decay:
         youth_next and checkpoint_youth in 'struct fs'
         all blocks in youth files on storage
         all scores in seg-tracker.
           - not needed, they'll get updated in normal progress
	     and being wrong for a while is no cost.
    ensure correct youth is stored in lafs_free_get
    check little-endian conversion of all youth accesses

    checkpoint_youth only used by thread, so no locking needed
    youth_next protected by fs->lock

 15m - share orphans and cleaning list_heads in datablock
   It certainly is possible to clean an orphan but it is very unlikely
   as it will have changed recently, or be changing soon.
   The cleaner could just dirty any B_Orphan it finds.
   But if orphan finds a block on the list, it must be careful...
   I guess when cleaner drops a cleaning ref, it should check if the block
   is an orphan, and re-queue if it is.

 15o - async blocks just have an extra refcount.
   This could:
     - keep PinPending set
     - keep an index block pinned - will phase-flip
     - keep ->parent link
   not not get in the way of a checkpoint.

   Should we clear any that we find though?
   Normally async is only used by cleaner, orphan processing, or segscan
   So it should all be finished when we do a checkpoint.

   So if checkpoint, or release_page, finds an async block, drop it.

 15r - further optimisations in cleaner to avoid lookups.
   We have fsnum,inum,blocknum and cluster seq number and trunc num.

   I want to introduce more async though.  Currently it only loads
   one inode at a time.
   To do more, I need to mark inodes as 'done' when they are and always
   restart from the start of the cluster (only do one cluster at a time
   for now).
   So if we get all the way though a cluster with no 'EAGAIN' we finish
   with the cluster.

 15y - when could a directory block become an orphan?
    - when deleting that last entry - we don't know if it can be fully
      deleted until we look in next block
    - when deleting an entry follows a chain back to the first block
    - when deleting the last entry in the block.

    So it could be an orphan if the entry found:
        - is at end of block
	- is first entry
	- is only entry
     or first entry is already deleted.

15Aug2010
  looking at flushing etc when run out of space.
  We often force a checkpoint when it won't do any good as
  nothing has been cleaned.
  In fact we write lots of dead checkpoints to 0/0 until it is full,
  then move on, clean 0/0 and suddenly have space.
  We shouldn't do that.  sync should be what pushes us forwards.
  Maybe that is fixed..

  InoIdx blocks still cause confusion.  Should they ever have credits?
  or do only the data block have those?  Certainly they cannot have
  SegRef.
  And there is confusion in my mind whether data blocks can be pinned 
  while the InoIdx block is - need to clarify that.


13Sep2010 - now, where was I...
 - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
 - test run: block.c:660 - no credits available while dirtying an InoIdx block during
   orphan handling.  lafs_reserver_block (under checkpoint lock) should have set credit.
   Only I just changed reserve_block to do that dblock instead - I wonder why.
   OK, I think I cleaned that up...


 - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
   so pin_dblock did too.  So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
   returned error.
   Problem is that we don't push a checkpoint when cleaner runs out of things to do.
   But we don't want to go back to pushing a checkpoint too often.
   Maybe the problem is that we only force the checkpoint when we have enough space to do
   new allocations, but we need to force it earlier if nothing new can be cleaned.

   Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
   we need to wake 'checkpoint_wait' then.
   But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
   ...anyway, that is sorted now at commit  95b6b05e460


  So: InoIdx blocks.
    - These never get SegRef as that is meaningless - done.
    - These can have credits.  It possibly isn't necessary bit it makes things
      easier.  They are 'written' by transfering the credits to the data block, or discarding them.
    - I think dblock and iblock can both be pinned
      The problem this caused was that the dblock might get processed as a leaf before iblock.
      We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
      is pinned to the same phase.
      lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
      it unpins an iblock
      So lafs_pin_dblock doesn't need to pin the inode instead.
   OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000

 15u - I don't need to get a segref there, but I need to have one from the original dirty block,
       so fix that up - commit Mon Sep 13 15:28:08 2010 +100

 15v - What do we have?
       lafs_dirty_dblock:  set Dirty, clear Credit clear NCredit
                           set Uninc, clear Icredit clear NICredit
       lafs_dirty_iblock:  set dirty, clear credit
                           test uninc, clear ICredit, set Unincredit - not essential
       mark_cleaning:      test realloc, / alloc / set realloc
                           test dirty / clear realloc/ set credit
                           set uninc clear icredit
       cleaner_flush:      set dirty, clear realloc, clear credit
                           test dirty, clear realloc set credit
       flush_data_to_inode:
       lafs_cluster_allocate - there is some odd code ther!!
       flip_phase
       lafs_allocated_block

       all rather different really.
       Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock

 15w/ Space used by cluster updates??
       It is all fine - just some confusion of function names.

 15z/ logging symlink creation.
      Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
      symlink itself.  So I guess we queue the block for writeout so it will go with the
      dir update.
      Yes, that works: Mon Sep 13 17:33:54 2010 +100

 15ab/ already did that in commit f90959e6f492b6


 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
    be freed after the write completes?  A checkpoint could do it, but that would write out index block
    that cannot be freed too.  A checkpoint would only be good after lots of data pages had been written.
    We could just wait and let other processes kick in..

    I don't think we need to do anything.  lafs_shrinker doesn't really know how tight memory
    is, and periodic checkpoint will free up any memory that we are pinning.

    .... but something is needed.  We need some trigger to write dirty index blocks
    Maybe:
	- a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
        DONE THAT.

    Not sure this is a complete solution.  I might want to incorp/flush index block when they
    have no dirty children, but I'm not sure about that.

14sep2010
  15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
     failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
    This could split the parent which means the block could have been re-parented - it would have been in the
    child list and so found and fixed.
    lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
    Inf this case, realloc isn't an issue, only dirty.  lafs_incorporate must have made it dirty and
    it won't get written while it has these in-phase children, so all is happy.

 15ae - refile race?  Someone might set B_IOLock before removing from lru, so
          onlru is 0 and refcnt is elevated so it doesn't seem to be unused.
          But then whoever has the refer will refile again when dropping it and
          so the right thing will be done.
        But more generally, do we really want the lru etc to own a counted reference?
        If it didn't:
          - we would need to refile when removing from any list
          - we would need to get a ref when removing from list.
          uhmmm..

    lafs_refile does:
            clear PinPending  if refcnt is low
            unpin   if not PinPending, or dirty etc and data or refcnt is low
            place on leaf list - if pinned etc - this can be earlier
            drop parent linkm if refcnt is low, and not pinned etc
            handle dblock issues

        if lru was not refcounted, then the only things we might do when refcnt isn't zero are:
            unpin a dblock once it is not dirty
            add to lru

       But if we don't count lru, then we can lose the refcount on dblock

     Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and
       don't get freed.

   I think I want things on 'leafs' list to not hold a counted reference.
   Things *only* get removed while walking the list.
   InoIdx blocks hold a ref on the dblock both when counted and some other time.  Possibly
    when pinned.  This ensure they are held InoIdx is while a real leaf.
   But: When we take that first ref, how do we know the dblock even exists?

   What is the lifetime of ->dblock?
         removed when page is released
         set by lafs_import_inode
         set by lafs_inode_dblock
         removed by clear_inode
   So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock
   This is currently callers of getiref_locked
          - erase_dblock_locked ?? shouldn't need a lock
          - ihash_lookup - never on InoIdx
          - lafs_make_iblock - already have dblock
    So none of those really need lafs_inode_dblock
    What about when we set Pinned
         only really from set_phase ... messy.
    What about when we set ->parent
           grow index tree - not relevant
           ditto do_incorporate_*
           block_adopt
              Can be called on InoIdx from:
                lafs_make_iblock  only!!

15sep2010

  I have tidied lafs_refile up a lot but I need to make locking a lot cleaner.
  In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref
  is taken until I have finished my cleanup.  I suspect the inode private_lock is the one to use.
  I also need to clean up getiref_locked and getref_locked - having both is awkward.

  So: when are they called?

   getref_locked:
     lafs_get_flushable - hold fs->lock
     first_in_seg       - holds private_lock, but shouldn't need _locked as hold a ref through child.
     (getiref_locked)
     pin_all_children   - hold private_lock
     find_better        - private_lock
   getdref_locked
     lafs_invalidate_page - to get a ref on each block to either erase or invalidate it
                          presumably page is locked
     lafs_get_block     - holds private_lock - plus once with only page_lock
     lafs_release_page  - holds private_lock
     (getiref_locked on dblock) - no locking
     lafs_inode_dblock  - private_lock of my_inode...
     lafs_delete_inode  - private_lock of my_inode
     lafs_destroy_inode - ditto
     lafs_drop_inode    - ditto
  getiref_locked
     erase_dblock_locked - private_lock
     lafs_get_flushable - fs->lock
     ihash_lookup       - lafs_hash_lock
     lafs_make_iblock   - private_lock

  So private_lock looks like a good choice.  Issues are:
       - what is the story with dblock on my_inode->private_lock
       - what is the lock ordering
       - what can refile negate that we need to be careful of.
         i.e. we want to keep things stable while refile does its tests, but what do we need to keep
           stable for others?
            + we break the parent link?? and so the siblings link
            + move things to freelist
            + can put_page
            + free dblock if not page_private

   Lock_ordering.  private_lock, then fs->lock, then lafs_hash_lock
   So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock

   This is getting messy - I need something nice and clear.
   So:
     Index Blocks.
        If Pinned, either has references or is on a leaf list - possibly both
        If no references and not pinned then not on leaf list, so can be on free list

        Pinned can only be set when there are references, and can only be cleared under private_lock
                  This is violated by phase_flip, which badly reads refcnt
        If refcnt is zero and not pinned, then can be moved to free_list
        If on freelist and refcnt is zero under hash_lock, can be freed

        So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore.
            Someone else must hold a ref and will put it and it will refile.  but that is pointless as
            it could immediately be cleared after we test Pinned.

        lafs_get_flushable should get a reference before deleting from list.  This ensure it won't be freed
         by lafs_shrinker, though it could be on the free list.  If it is, then it isn't pinned so it is not
         interestin to us.


       Data Blocks:
         These are removed from lru when freed - we just need the extra refcnt check after removing from list.
         No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt
         so they cannot disappear

    What is the story with my_inode->private_lock though?  This is used to protect ->dblock accesses.
    I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared?
              dblock is cleared in refile for the dblock,
              or in clear_inode under the inode rivate lock.

    So:
     There are various places that hold a non-counted reference to a block.
     These include
            - index hash table            lafs_hash_lock
            - index free list             lafs_hash_lock
            - phase_leafs / clean_leafs   fs->lock                        only if pinned
            - inode->iblock               lafs_hash_lock
            - inode->dblock               inode->i_data.private_lock

     Each of these is protected by its own lock, but not all the same lock.
     When we turn one of these into a counted reference, we increment refcnt under the local lock,
     then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has
     finished.  This must be done before changing/using the block in any way.
     To free an index block it must first be removed from _leafs list.  Then if the refcount is still
     zero it can be freed - or put on freelist and subsequently freed.
     An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference.
     To free a data block we similarly need to recheck refcnt after removing from leaf list.
     If it is in an inode file we also take that inode's private_lock to clear dblock.
         We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero.

17sep2010
   review lafs_refile - are some of those tests redundant? - yes, one is gone.

 So:
  15ah - What about truncated blocks sitting on an uninc chain?
       I don't see the problem.  It will eventually get incorporated and do the right thing...

  15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in
      a stable way.....
    No, I don't think that is really a problem.  It only gets written out in the tail of the checkpoint after
    the root.  I guess it could then get a youth number for a segment that it has no count for, if the root is
    written at the end of one segment and the segusage/youth written at the start of the next.

    But I think roll-forward is missing something.  Blocks in the next phase need to be counted into segusage.
    Are they?  oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space,
    but the important blocks picked up by the roll-forward are handled.

    So....

     A checkpoint could cover multiple segments.  We need to be sure these each get a valid youth number.
     Probably most of them will, but we need a consistent approach to be sure.
     They don't need to be added to the segtracker, except the last needs to be active, and it already is.
     So as we find a new segment we want to do much like was lafs_free_get does youth_update.
     But the data block - isn't that youthblk?  When it that set?
        segsum_find sets if it ssnum == 0

19sep2010
   15ak - run the orphan file at mount time.
     After roll-forward when we have a working filesystem, we need to read the orphan file, load each block
     mentioned, and register each as an orphan.
     This involves:
            - setting the orphan_slot
            - setting B_Orphan
            - lafs_add_orphan
         Just like at the start of orphan_commit
     We also need to initialise nextfree and possibly 'reserved'.
     But: can orphans be created during roll-forward?  They certainly can.  We currently hide that in a re-use of
     the orphan list..  But directory updates are possible too, and not handled.

     I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot
     change the orphan file.  Then after roll-forward, we read the original part of the file and set up
     any orphans that aren't yet.
     So we want to read once to get the size.  Then read again to process content up to that size.

   15am - filesystem name.
       This is only used for identifying snapshots

01oct2010
  - mkfs is done to an initial version of lafs-utils. !!!

 So: 15am - filesystem name - used to identify snapshots
   So the name is pointless in subordinate filesets.  So I could just shrink
    the metadata.  The primary metadata needs to be big enough to get a name
    easily though.

 15aw..
    When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'.
    But we have the same B_UnincCredit bit for both.  Is that safe?
    Processing the cleaner could absorb the UnincCredit while the blocks is
    reserved but not dirty.  Then when it gets dirtied, there may be not
    enough credits to split.
    We set Dirty from Credit, and use ICredit for UnincCredit.
    But when only Realloc (not dirty) we don't use those bits.  We allocate
    fresh credits or set Dirty if that fails.

03Oct2010
   Need lafs_iget_fs to work on other filesystems.  And other snapshots?
   We use it:
     in cleaner when parsing cluster head
     in orphan handler when loading orphan file or when rearranging it.
     in roll forward

   Each of these might need to kern-mount the fs - so we need to hold the ref
   somewhere.
   Cleaner also needs to explore snapshots.

   Don't want kern_mount - that is too heavy weight and includes a vfsmnt.
   Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need
   to hold.
   Similarly for snapshots.  Cleaner needs to consider all snapshots, so they
   all need to be mounted.

   So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops.
   Subset 'sb's can be attached to the parent inode and then only dropped when
   the inode goes... only sb currently references inode.
   So maybe the first ref to an sb doesn't ref the inode but others do - is that
   possible? No, as we don't see them being dropped.
   Every inode in the subset could ref the filesys inode.  That would keep it active
   the right amount of time, but release/destroy could still be racy.

   I guess cleaner/orphan/roll need to explicitly ref the fs.
     cleaner already refs inode when B_Cleaning, so hold fs too.
     B_Orphan seems to own and inode ref too.
     
   So:
       lafs_iget_fs gets a ref on the inode and the sb.
       need lafs_iput_fs to drop both references
       B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref.

    cleaner holds refs on all snapshots

    FIXME I probably need to hold inode/fs for B_Async too.
       No.  Async only refs the block, not the inode or fs.
        Something else would normally ref the inode - e.g. cleaner.
        When the inode is free, the page invalidation will notice the
         B_Async flag and release it.

    So that is all done now, except I don't hold refs on snapshots in the cleaner
    yet.

11oct2010
 DescHole
   - When is this used? directory etc don't need it.
   - a regular file might, but there is no API to punch
     a hole.... yet I guess.
   - So we just want to allocate these blocks to 0.

15oct2010 - happy birthday Daniel...
 Looking at 36:
  a/ files with nlink==0;
        If we happen to find them, we hold a reference until all roll-forward
        is done, incase a name is found - it is important not to start deletion
        early.

18oct2010
  36g - write roll_mini for directories.
   We get a name, an inode number, and one of:
      LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET

   The REN_SOURCE is linked with a REN_*_TARGET which could be in a
   different directory, so we need to stash the SOURCE until the TARGET
   arrives.
   We simply impose the implied change on the directory and update the
   link count in the target inode.
   So:
     load the inode
     possibly record REN_SOURCE for later

     calls prepare/pin/commit as appropriate.
     Put the inode on orphan list if appropriate - needs care
        as we retarget orphan list.
     update inode link count.

   (28Feb2011)
   Just a refresh on the purpose of these updates.
   1/ They allow us to fsync a directory without performing a full checkpoint.
     As directory blocks are not processed in roll-forward we need the update
     for data to be safe.  As fsync of directories are rare in some common
     situations we could avoid actually writing these.  Simply queue them
     internally and discard them on a checkpoint.  If an fsync comes before the
     checkpoint, only then do we write them out.  If there are any cross-directory
     renames then the preceeding updates in both directories need to be flushed
     before the cross-directory rename.  It might be easier to always flush on
     a cross-directory rename.
   2/ They ensure consistency of inode link-count wrt to names in the filesystem,
     but as link count is only updated by these (or a checkpoint) there is no
     problem with delaying.

   So: when replaying these we must update the directory content and the inode
   link count.
   It is OK to delay the write-out of these until an fsync, and not bother
   if a checkpoint happens.
   So add that to th TODO list - item 66.

28feb2010
  - roll forward directory updates ... I wonder if I got it right :-)(untested).


  I don't seem to have easy-access notes about the various meaning of
  'width' and 'stride'

  width:  The number of independent devices across which the (virtual) device
    is placed.  The normal goal is to write 'width' blocks on every single write.
    On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
    and it will keep all devices equally busy with writes.
    The 'width' blocks probably aren't consecutive.

    There are two different layouts - one with width*stride <= segment_size
    and one with width*stride > segment_size.

  width*stride <= segment_size
     This is a traditional striped layout like RAID0/4/5/6.
     The 'stride' is the chunk size, so 'width*stride' is the stripe size,
     and segment_size must be a multiple of this.
     In this case all addresses in a single segment are contigious.   We don't
     necessarily write them in order if we want to write less than one stripe.
     segment_offset will normally be a multiple  of width*stride though this isn't
     enforced as one could have a partition with an non-aligned start.

  width*stride > segment_size
     This implies a catentated layout.  If parity-redundancy is in use when the
     blocks which combine to form a stripe are 'stride' blocks apart.
     The benefit of this layout is that an extra drive can be added by simply
     zeroing it and joining it to the array - no re-stripe needed.
     This will make all stripes slightly larger so at first the space will not
     be available.  As cleaning happens the space will gradually become
     available.  This still requires restriping, but unlike a normal
     raid5 restripe, the space becomes available in small amounts immediately,
     when there is no demand for more space, the re-striping (cleaning) can happen
     at a very low priority with no cost.

     In this case the blocks in a segment are not contiguous.  
      'segment_size/width' are, then there is a large gap (in virtual address 
      space) to the next chunk.

     The segment_offset is an amount of space which is free at the start of
     each device.  0..segment_offset and stride..stride+segment_offset etc
     do not contain data and can be used for metadata.

  When width > 1 it makes sense to replicate each state block across
     every device - as we want to write the whole stripe anyway.
  For now we only write and read the first two copies at the beginning, and
  the last two at the end...

  Question:  what do we want to do about metadata on flash devices?  We really
   don't want a small number of locations to store the metadata, but a large
   number that we search through - possibly a binary search. 
   These could be all at start/end or scattered throughout the device.
   The later would make it impossible to find efficiently - there is no way to
   create useful linkage without writing something else at start of end.
   As many devices optimise for random writes where the FAT table would be,
   it make sense to just put the metadata there and not at the end.
   We should allow one 'page' for each metadatum, which probably meanss
   32K.
   So we should allow all state blocks to be near the start.

01mar2011 - Autumn arrives.

  Time to add handling of 'atime' and non-logged files.

  The idea is to have a separate file for storing only 'atime'
  This is separate from the inode file because the volatility of the data
  is very different and one of the principles of log-structured-fs is that
  differently volatile data should be kept separate.

  This does mean that an inode lookup requires getting data from two files,
  but it is hopped that the 'atime' file will mostly be in cache as each
  block contains the atime for lots of different inodes.

  The atime file contains 2 bytes for each inode, so with a block size of 4K,
  each block would hold info for 2048 inodes.  1 million inodes would require
  2 megabytes.

  The 16bits are treated as a positive floating point number which
  gets added to the atime stored in the inode.  The lower 5 bits are
  the exponent, the remaining 11 bits are mantissa.  Though there is a
  little complexity in interpreting the exponent.
     If the exponent is 0, the mantissa is used as milliseconds -
       so shift left 5 and multiply by 1000000 for nanoseconds.
       The smallest change that can be recorded in 1 millisecond.
       and values up to (2^11-1) milliseconds - or 2seconds can be stored.
     If the exponent is 1 to 10, the mantissa has a '1' appended as a
       new msb, and is shifted by the exponent-1 and then treated as milliseconds.
       This ranges up to 2^(12+9) milliseconds or 30 minutes, where
       the granularity will be 2^9 millisecs or 0.5 seconds

     For exponents from 11 up to 31 we add the 1 msb and treat
       the number as seconds after shifting (e-11).  So at e==31,
       we shift a number that is
       up to 4095 by 20 to get nearly 2^32 seconds or 136 years.
       At this point the granularity is 2^20 seconds or 12 days.


   So overall we can update the atime for 136 years without needing to
   update the inode, and can record differences of 1msec for the first
   couple of seconds, then gradually less granularity until we are
   down to one second an hour after the last change, and 4 hours a
   year later.

   To convert a number of seconds to this format:

   If >= 2048 seconds, we shift down until less than 4096 seconds
   counting the shift.  We add 11 to that number to form exponent,
   and shift the resulting mantissa up 5, or with exponent, and mask
   out bit 16.

   Otherwise we convert to milliseconds (divide nanno by 1000000 and
   multiply seconds by 1000, and add). Then if < 2048, we shift up by
   5 leaving a zero exponent and use that.

   Otherwise we shift down until < 4096 counting shifts, add 1 to the
   shift to form an exponent, and combine with mantissa as above.

   So that is the format - how do we implement it?

   We don't want to expose to user-space numbers that we cannot store.
   So any 'utimes' call updates that the inode directly can clear the
   value in the atime file.  Only updates due to accesses go to the atimes
   file.
   We define a 'getattr' function which looks at the atime stored in
   the vfs inode and if it has changed we need to deal with it.
    - if the inode is still dirty we simply update the lafs inode
      and use the number as-is, clearing the atimes entry
    - else we subtract the stored atime from the new atime.  If this
      is negative or exceeds 136 years we mark the inode dirty and
      store it there.  It we cannot mark the inode dirty for some
      reason we just store all 1s in the atime file.

    The same operation is needed when dirty_inode is called to make
    sure atime updates get saved even when no getattr is called.

    As we always need to be able to update the atime file, it needs to
    be permanently pinned whenever an inode is read in.  For
    non-logged files this should be cheap but we must do it anyway as
    the file might not be non-logged.
    So we need to keep a permanent reference to each block while the
    inode is loaded.  That can keep it pinned.


    We don't want updates to the atime file to be flushed in any great
    hurry, especially if it is a logged file.  We would be quite happy
    to only write at 'unmount' and probably 'sync'.
    So we want to stop the pages from appearing dirty in the page
    cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty
    (I_DIRTY).
    We can still keep them dirty in lafs metadata so if release_page
    is called we can schedule a write out then.


   So some steps:

    1/ load atime file at mount time - there is one for each
      filesystem.  It has inum of 3 and type of TypeAccesstime (6).
      Also release it on unmount.

    2/ loading an inode must take a ref to the block in the atime file
      if it exists.  A new inode flag records if this has happened.
      Unless mounted noatime, we pin the block and reserve space.

    3/ getattr and dirty_inode must resolve any issues with the
       atime.  So lafs_inode probably needs an extra field to be able
       to check for changes


  Hmm.. this is getting confusing...
  When atime is changed the only way we find out is by ->dirty_inode
  being called.  But that is called when anything is changed.
  Filtering out whether or not we need to update the inode itself
  is awkward... maybe there is some context we can use.
  ->dirty_inode is called by mark_inode_dirty which is called:
   - by touch_atime, if something changed
   - file_update_time  - at which time we also update iversion
   - setattr ... which has changed recently (2.3.37ish)
   - page_symlink
   - generic_file_direct_write - which increasing size of inode
   - set_page_dirty_nobuffers

  So either the inode is pinned, or it isn't.
  If it isn't, then this *must* be an atime-only update.
  If it is, then it could be anything, but in any case we update the
  atime directly.
  So: dirty_inode should try to get dblock and check if it is pinned.
   If it is pinned, then update the atime immediately and the offset
   in the atime file too.
   If not, just update the offset


03mar2011
  ARGggg... checkpin is interfering with unmount - it keeps an
    s_active count so unmount 'works' but doesn't release anything.

  checkpin is needed is needed to ensure that inodes remain safe while
  we are cleaning.  Particularly, while the inode index block is
  pinned, we keep the inode and fs referenced as well.  I guess the
  theory is that they won't stay pinned for long - but they do.
  e.g. segusage blocks are permanently pinned.


  We could have a rule about the prime filesystem always being mounted.
  Then we don't need refcounts, but kill off the cleaner before
  unmount...  which we sort-of do..

  All subordinate filesystems have references on the prime_sb so the
  prime_sb must be the last one to go.  When it goes it kills
  everything off...
  So we don't need checkpin to take a ref on the prime_sb.

  There might be still an issue with files in subset filesystems
  being permanently pinned so they stay around longer than they
  should... need to check on that somehow.
  The idea is that a quota file block is permanently pinned so it
  will keep the fs pinned.  That in turn will keep everything else
  pinned... Worry about that when we implement quotas FIXME

04mar2011
  I really need to sort this out, and it isn't easy...
  We really want to know when "all" filesystems have been unmounted
  so the block device(s) can be released and the cleaner stopped.
  But we don't have a count for that.  We could if that was all
  we counted - but that would mean that we only have a single
  struct super_block for all filesystems.

  So that is what I have to do.  A single super_block for all parts
  of the filesystem.  I probably still need to allocated other
  dev numbers stat->dev, but I don't need to use them internally.
  Maybe I even allocate superblocks... Yes - we need to use
  set_anon_super and kill_anon_super to allocate the numbers.
  lafs_inode will need a pointer to the filesystem - we use that
  instead of the sb.

  -------

  Testing...
   bug at block.c:658.  Block not B_Valid in lafs_dirty_iblock from
   lafs_allocate_block  from cluster_flush.
   Block is 74/0: InoIdx block of a newly created file I think.
    '74' was /f23, then  /mnt/1/adir.  We are creating file in that
   dir.
   This is a depth=0 InoIdx block - i.e. the data is in the
   dblock, so there is no index info, so it kind-a makes sense for the
   index block to not be Valid.
     yes- commit d268a566605bf006cf33c confirms that.

   So why are we trying to dirty it?..

   Maybe:
     We create a couple of directory entries, then flush and end up
     with an in-line data block.
     Then we add more, flush again and so try to dirty parent...
   Where to we turn depth=0 inodes to depth=1??
      - erase_dblock_locked - don't want that
      - lafs_incorporate
   So I guess the 'bug' is in error - it is OK to mark that invalid
   block as dirty.

04mar2011
  So - back to the super_block reworking.  We want only one
  superblock.
  So we use the TypeInodeFile inodes a bit more to hold the details
  of different filesystems.  We need to store a unique 'dev' number in
  there use set_anon_super/kill_anon_super on a local 'struct
  super_block' and copy s_dev in/out.

  As we only have one sb, we can only have one fstype, so we cannot
  use the fstype to choose what to do.
    - if dev_name is a block device we try an normal mount
    - if dev_name is a Inode file, we perform a subset mount
    - if dev_name is a lafs dir and '-o snapshot=name', we mount that
      snapshot
    - if dev_name is a lafs dir in root with perm zero and
      '-o subset=MAXSIZE', create a subset filesystem.

  - lafs_iget needs an inode rather than a superblock
    ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free,
    choose_free_inum, inode_map_new_prepare
  - lafs_iput_fs,lafs_igrab_fs, ino_from_sb

  - NFS filehandles need careful thought
     They are 'per-super-block', not 'per-vfsmnt' which might be
     better.
     We could change that but.....
     For non-snapshot files it is easy - just record two inodes, the
     fs and the target.
     For snapshots there is nothing that is really stable.
     Maybe we could have different superblocks for snapshots.
     The snapshot doesn't need the cleaner as it is read-only, though
     the cleaner can need the snapshot...

     So the cleaner might automagically mount a snapshot, but a
     snapshot will never invoke the cleaner or any other thread stuff.

  So I guess we want one superblock for the fs and one for each
  snapshot.
  The filehandle is then either inum+gen or inum+inum+gen where first
  inum must be TypeInodeFile

07mar2011
  ... though I could just put a snapshot number and partial timestamp
      in..


08mar2011
 This isn't a new to-do list, it is a list of the main features that are
 still not implemented:
   - full 2D layout
        + at very least I don't pad with zeros yet
        + if stripe size were multiple of 3*3*5*7*2^N, then changing
          width might be managable.
          e.g. stripe size: 40320 blocks.. But with megabyte chunksizes,
          we really want 32bit segsizes and 322560 block segments.
   - non-logged files - with interface to request access-time file
   - quotas
   - snapshots:  particularly cleaning
   - error handling
   - metadata (inode/directory/etc) CRCs and duplication
   - fsck / debugfs


  What would fsck do?
   - locate and validate device and state blocks.
   - locate and validate checkpoint cluster.
   - locate and validate filesystem root
   - roll forward to collect segusage and quota blocks.
   - load inode map, read inode file, validate each inode and make sure
     map is correct.
   - explore each file, following all indexing, count segusage for each
     segment and make sure segusage file is consistent.
   - check no block is allocated twice.  This might require multiple passes,
     each time we examine a different collection of segments.

   - checking a file requires:
          - checking inode is consistent
          - checking index blocks are consistent with depth
          - checking index/extent blocks are sorted with no overlaps
          - checking block/iblock counts are correct.
   - checking all cluster headers in the current segment to ensure they
     look consistent and agree with file information. i.e. if cluster_header
     identifies a block, the block must live there, or later in the segment.

   - scan all directories looking for consistency of hash etc.  Count links
     for all inodes.  This might need to be multi-pass too.
     Could use a bitmap for single-link files, and table for others.

   How to fix errors.
     - First must find segments which are not in use according to segusage file
       or according to block search.
       If there are none, require a new device be provided.
     - If anything looks incorrect, write corrected version to new segment
       Then write out new segusage files

   In some cases we might need to search all write-clusters for missing blocks??
   That could take a very long time!


   What do I really want to do about CRCs and hashes.
    It might be nice to store a hash for each block in the index block.
    But that wastes precious index-block space.
    If I store a CRC together with address info in the block, then I could
    be fairly sure it is the right block.  So e.g. inodes store the inode number,
    Index blocks could hold inode+depth+address.
    Last 8 bytes of each block could be a 4byte CRC and a 4byte identity.
    identiy is XOR of fsinum inum blocknum generation - or a CRC of these.

    Actually, we don't need to store the identity info - we just need to
    include it in the CRC.  That either saves space, or allows more bits to
    be used for the CRC, which is probably the best use of bits for detecting
    errors.
    Though it might be nice to store phys-addr in the CRC too, we cannot as

21mar2011
  My short-term todo list is:
DONE  - get 'lafs' to the stage where I can create an fs requiring roll-forward
DONE  - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
DONE  - Make lots of 'layout' changes - see 15cb

02may2011
  - 'run' goes to completion, but segusage isn't updated in the final cluster
       and the number left over from before looks wrong.
DONE  - 'ls -l' on a subset file gets confused.
  - fs created by 'lafs' has wrong Blocks and Inodes counts
  - we lose a ref to a segsum and sometimes put it too often.
REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP


03may2011
  Once I have these bugs sorted out I want to make some format changes.

   DONE - fs_metadata need a 'parent' link
        rename needs to be careful about what is updated!
        so does roll_mini
        lafs_get_parent needs some thought.

   DONE - roll-forward should get exact mtime stamps, and ctime.
     So each data block must have an exact timestamp
     of when the change actually happened.   Or the group_head
     has a timestamp for the most recent update to the file
     As we use nanosecond timestamps (pointless though they are)
     we need 30 bits for the nanoseconds and at least 11 for the seconds.
     So 48 bits (6 bytes) is plenty.
     So include a 64bit timestamp in the cluster_head and 48bit
     number to subtract in the group_head
     But saving 2 bytes per file isn't really worth it, and we may
     well lose it in padding.  So just store a 64bit timestamp in
     the group_head.

   DONE - use CRC in place of all checksums - lafs_calc_cluster_csum

   DONE - state block flags for inconsistencies found
	If any inconsistency found, fsck is advised.
	For some it may be imperative.
	Things that can be wrong include:
	- generic read error
	- segusage negative
	- index block incoherent
	- dir block incoherent
	- link count negative
	- cluster header incoherent
	-
	64 bits should be adequate and simple for this.
	Any unknown bit requires a full fsck.

   DONE - 32bit segment size
        With 16bit at 4K blocks we are limited to 256Meg segments.
	64Meg with 1k blocks.  This takes about 1 second to write on
	a modern drive.  On an array it will take even less time.
	24bits gives 16 to 64 gigabytes which is plenty.
	However 24bits is awkward to access. a 1K block holds 341 1/3.
	A 4K block holds 1365 1/3.
	But this wastes less space than 256 or 1024 and so causes less IO.
	But then we probably want to size segments to be very big.
	A few thousand segments should be OK, which is tens of blocks.
	I don't think the savings with 24bits are worth it, and I do
	think v.big segments could be useful, so lets go with 32bit segments.

	Youth is currently tuned to 16bits.  Let's leave it there and
	maybe waste some space.


   - parallel new-data write clusters.
	I think it is sufficient to include a second 'next_addr' in the
	cluster_head - or maybe two.  alt_next_addr[2].
	When a thread wants to start a new stream of clusters it allocates
	the segments then attaches to the next outgoing write cluster.
	Once that is written everything in the new cluster is safe.
	On a checkpoint every stream writes at least one checkpoint cluster
	and these are linked together through alt_next_addr.
	The 'next' cluster for each must be the checkpoint cluster and must
	carry linkage but unlike with first-link, there is no need to wait
	The data is already safe as long as the state block isn't updated
	until every cluster_end block is written.
	So really, one is enough.  I had though 2 would enable quick fan-out
	but there is no real need for that.

	As 0 is a valid write-cluster address we use 'this_address' to signify
	that there is no alt-next.

	It is possible that a block of a file could be written to two
	different streams at different points in time between two checkpoints.
	We need to ensure that roll-forward gets these in the right order.
	'seq' can be the same in two different streams so we cannot use that.
	timestamp could possibly be used, but as times can go backwards it
	is not ideal.

	NEW IDEA.  Just use one stream of clusters.  However it can
	bounce from one device to another easily.  So two different
	threads can be building up two different write clusters at the
	same time as long as they synchronise at some point to pass
	addresses around.  They also need some other Verify mode as
	VerifyNext or VerifyNext2 will destroy any parallelism.
	As the point of this is two write to multiple devices in
	parallel, maybe VerifyDevNext{,2} meaning the next header on
	the same device serves to verify this.

   - policies.
	This includes
		maximum number of segments written between checkpoints
		whether data can be cleaned to a particular device
		whether a device can receive new data
		whether metadata duplication is needed
		whether an RO device from a different array is allowed.
	Some of these are per-device policies.  Some are per-array.

	The 'RO Device' thing is special.  I think I want an alt_uuid.
	It works like this:  You assemble the RO array when you
	mount a new filesystem identifying the old as a component.
	So that 'state' block on the new devices must identify the alt_uuid
	and state seq number.

	Do we want to record more info about which devices are in the
	array?  Currently we just record how many.  If we find enough
	with the right UUID/seq, they must be it.. what else would we
	want?

	For all the other policy statements it is probably simplest to
	allow a set of simple strings. e.g. "noclean", "nonew",
	"dup=2" "maxseg=5"
	devblock currently uses 146 bytes, so room for 878
	stateblock uses 112 plus some for snapshots, so much the same.
	We currently don't use 'version' and have no concrete plans.
	The vague idea is to allow lafs to *know* that it cannot mount
	the array, so any incompatible feature gets set.
	We could keep those in the policy sets.  From that perspective
	there are 3 types of things.
	 - if you don't understand, don't worry
	 - if you don't understand, don't try to write
	 - if you don't understand, you cannot even read.

	That last is really best avoided.  We have version info
	elsewhere in the tree so that a new index style will simply
	make that block unreadable.
	So I think make the dev and state blocks a simple incrementing
	version number which apply to that block, and have "don't
	worry" and "don't write" policies distinguished by first
	letter.
	Capital is "If you don't understand, don't write"
	Lower is "if you don't understand, don't worry".

	These are space separated strings

   - etc.

   - what about i_version?  Include in timestamp?