So, let's try to write a kernel module that implements this filesystem.
It would be good to have a plan.

- Mount filesystem, providing empty root directory
   o parse mount options - DONE
   o find/load superblocks and stateblocks - DONE
   o present empty directory - DONE
   o Compile external module - DONE
   o test DONE

- Mount filesystem read-only with no roll-forward
   o IO address mapping
          sync_page_io or bread? - not bread I think
   o Index blocks management
   o search cluster-header for root inode
   o file read
   o Directory lookup/read
   o test

- Support roll-forward for blocks, orphans, whatever
   o manage segusage files
   o manage quota files

- Support writing
   o inode bitmap
   o cluster creation / block sorting

- Support Cleaning

- Interface for snapshots and other admin


------------------------
FIXME
 If a device is removed from the filesystem, we cannot reliably
 tell from the other devices or state that this is so.
 Maybe we need to update all devblocks with a new 'seq' number...
FIXME
 How do we specify mounting subordinate filesets?
 What superblock do they have?
 I suspect we do a -F lafs-sub mount from the original filesystem.

FIXME
 If mount fails, we seem to be leaving a super lying around,
 and sync_supers dies on it. - DONE

FIXME
  Umount appear to work, but a sync_supers dies. - DONE

FIXME
  subordinate supers aren't being locked as much - is that a problem?

FIXME
  index pages never get put on an LRU - how is this supposed to work?


--------------------------
Thoughts:
  Inodes live in an address-space, much like a file.  To load the
  first inode, we need an address-space, so may as well have an
  'struct inode' as we may want to expose it to user-space.

 Loading an inode, need
   fs (lafs filesystem structure)
   which subfs (maybe a lafs inode)
   which snapshot - this is implied by the subfs inode.
   and fs can be obtained from inode, so just inode, inum


UPTO
03nov2005
  review block_leaf_find and make_iblock
  need to do setparent and block_adopt next

10nov2005
  need to resolve locking for ->siblings list

24nov2005
  peer_find
  lock_phase
  lafs_refile
  
  I can read a file.....!!!!!
  Code review / tidy up.
     resolve locking buffer vs page

  Export on a web page somewhere??

16feb2006
 (I spent a while getting large-directories to work again in prototype..
  and some holidays).
 - Priority: clean mount and unmount
 - large directories
 - multiple devices.

  FIXME how do we record and handle write errors???

 The iput in lafs_release - which is needed - is oopsing
  at iput+0xe!

23feb2006
 Ok, I finally have a clean mount/unmount.
 .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.

 Next:
  - make sure we can handle 'large' directories.
  - make sure we can handle files with indexes
  - handle filesystems that span devices.

02mar2006
  Hurray - clean unmounts!!!
  There is a nasty circular reference of the root inode which is stored in 
  a block that it manages.  Maybe this should not happen, rather than having to be
  explicitly broken - the root-block can live elsewhere, not in the inode.

  Next multi-level index blocks.

  But first, need to understand memory pressure and pageout.
   How are dirty pages found to be cleaned?
   How is pressure put on a filesystem to clean up?
   How are clean pages reaped?

  - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
      lru_cache_add{,_active}(page) might be easier, but isn't exported.
  - call mark_page_accessed(page)  to keep the page 'active'.

09mar2006
  - make sure indexes work...


 lafs_load_block+0xf
  eax,bx,cx,dx,s1 all zero
from block_leaf_find 203

  ... OK, indexes seem to work.
   But 'lafs' have problems creating some large files. 
   Try 'tt'

   This is due to not handling error properly.. fix it later FIXME

16mar2006

  Must make sure the index address-space gets clearred up... I wonder
  how we find all the pages to free.  This might be one reason to keep them
  in a radix tree.  Though we should be able to walk our own data structures.


  Then work on mounting a 2-device filesystem.


  FIXME dir_next_ent always starts from the beginning rather than 
   remembering where it is up to... can this be fixed??


18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)

  Mounting snapshot needs a way to identify that it is a snapshotmount
  and which snapshot, and which filesystem.
  We could use a different filesystem type, but that isn't really needed

    mount -t lafs -o snapshot=name /original/mount/point /new

  This grabs the named snapshot of /original/mount/point and places it at 
      /new
  The 'snapshot=' option is the trigger.

  For a control FS, we
        mount -t lafs -o control /original/mount/point /new

  To grow a filesystem, we initialise a device (super/state blocks) and
        mount -t lafs -o remount,new=/dev/name whatever /original/mount/point

  as the dev_name isn't passed to remount

  So, mount options are:
        snapshot=name
        dev=/dev/device
        new=/dev/device
        control
    and various
          name=value 
    pairs matching what is exposed in the control filesystem

23mar2006
 - factored out super-block finding preparatory to finding snapshots.

 Thoughts:
    superblocks for snapshots and sub-ordinate filesystems do
    not get stored in the 'state'.  There is, however, a usage count so that
    the prime filesystem cannot be unmounted until all snaps and subs are gone.
    This should just refcount the prime_sb I suspect.

    So: a snapshot sb points to the 'struct fs' but doesn't .... what???

30mar2006
 - remove the super-block finding code by changing the layout to store
    superblock locations explicitly :-)

 - teach 'mount' to mount snapshots.

 - need to audit for bad use of ss[0]
 - need to find better way to map 'sb' to snapshot number.
 - need to make unmount work.

01apr2006 (no, really!!)
 - rewrite index to kmalloc index blocks and use a shrinker to free them.
   This means that indexblock no longer has a 'page', which makes sense.
   It also means they cannot live in highmem, which is sad, but could
   be fixed.

  Notes: superblocks and refcounts.
   Each device holding the filesystem gets a superblock.
    One of these (arbitrarily) is the 'prime' superblock and gets to
      manage the whole filesystem.
    Each snapshot also gets a superblock, as does each
      subordinate filesystem.  These are anon sbs - using anon dev.
    Each anon sb takes a reference to the 'struct fs', and also to the
      prime sb.... how about the reference relationship between fs and prime_sb???

    Need to ponder this,

   - problem with getting parent superblock due to semaphores...
   - when unmount, put_super isn't being called, so inode 0 isn't released!

13apr2006
  (Took a week off to play with rt2500 wireless cards)
  - Use different filesystem type for snapshots and subordinate filesystems.
    This removes the semaphore problem
  + OK, mount and unmount works for snapshots... what next?
     - review index block - worry about himem?
     - review ss[0] usage - OK
     - general code review

  FIXME - what should leaf_lookup/index_lookup return on format error?
      The currently return '0' which will quietly make an empty block.
      Many '-1' would be better to make an error block.
  FIXME check how other filesystem lock the setting of PagePrivate
     Maybe just need to lock_page
  FIXME combine find/load/wait into one operation
  Review dir, super, roll, link

  FIXME module refcount increases on failed mount!

18may2006
  I've been sick for too long, and not much has happened... However I think more than
  the above comment says.  I started looking at roll-forward and have the 
  basic block parsing in place so that it reports what it sees in the roll.
  Also, the format has been changes a little: the address in the state block
  is the CheckpointStart cluster, and we simply roll forward to the
  CheckpointEnd, and then keep going beyond there - there is no longer any
  walking back to find the start.

  Next step is to start incorporating rolled elements into the filesystem

   - data blocks: shouldn't be too hard.  Don't need to update the
           index pages just yet
   - inode updates: should be straight forward enough, but care is needed
           as the data might be in multiple places
   - directory updates: these are probably most interesting..


  Question: how are symlinks created?
    Currently we:
      log the inode creation
      commit the new inode
      log the directory update.
    This allows the 'value' stored in the inode to appear after the directory
    update.
    That might be OK for files (Which are created empty and then extended)
    but is bad for symlinks (which are created atomically).
    So, options include:
     - ensure inode is in a previous cluster to directory updates.
       This slows things down too much I think
     - log the content as well.  This is awkward if it is big, certainly if more
       than a block, which is possible.
     - directory updates could be dependant on the inode being valid.
       This is ugly.
     - log content if it is small, else write inode, flush, then create link.

    So the fast option is:
      log inode create, log content, log filename
    and the slow/safe option is
      log inode ceate, sync file, log filename

    So on roll-forward if we see the inode we just save the data.
    Saving the whole inode seems attractive, but we want minimal order
    dependance: an inode update in the same cluster as the new inode should
    still over-ride, even though it is earlier.

  Ok, rollforward is proceeding slowly.  I think I am now incorporating
  new blocks into the tree properly, though the code probably won't compile.
  It will be nice to test this and see the file have the right data.

  Next step would be to include the index incorporation code.
  Then
    - directory updates
    - segusage summary
    - quota
    - stuff..

08jun2006
 - what exactly should happen when rollforward finds a file with a linkcount of 0?
   Currently all updates get lost - I wonder if they are lost safely?
 - rollforward is getting the size right, but not the content
 - do I need to flag a block that ->phys is valid?

 : Ok, roll-forward picks up new blocks in a file OK,
  but umount has stopped working.
    Presumably because there are pages attached to the inode which aren't
    getting released.  What do we want to do here?
    Normally those pages, or their addresses need to be recorded before
    they are lost.  But on a read-only mount we don't care so much.

22jun2006    continuing above thought..

   When we roll-forward and pick up the pieces of a file, we don't
   want to allocate pages to hold those pieces (and definitely don't
   want to read them all).  We just want to attach the addresses
   to the parent for incorporation.  Similarly after writing
   dirty blocks in a file we want to be able to release them
   immediately rather than waiting for the addresses to be
   incorporated (as incorporation can be more efficient when delayed).

   We could just allow the page associated with a block to be released,
   except that the page provides the indexing to find a block.  We might
   be able to live without the indexing, and hunt down the indexblock tree,
   but living without the mutual-exclusion provided by block indexing would
   be more awkward.
   And the 'struct datablock' still contains a lot more than is needed.

   So maybe we should just have a completely separate structure attached to
   the indexblock which lists fileaddr/physaddr.  This could include
   extent information.  The trick would be guranteeing allocation.
   We could either allocate-late with a fallback of attaching the 'struct block'
   or performing an immediate incorporation, or allocate-early and block
   the dirtying of a page until there is space to record the new address.
   This last is bound to be easiest.

   So: what exactly do we use to store addresses?
    Probably a linked list of tables.
    Each table contains a link pointer and an array of
        fileaddr/physaddr/extentlen
    But we would need to allocate lots of these if there are hundreds of
      dirty pages, but possibly only end up using a few if they made
      extents very nicely.  That might be wasteful.

    Or we could allocate just one.  When it is full we perform an
     incorporation.  But if that causes a page split we are in trouble.
       We could have a spare page, split to it, write out one
        and wait for the spare page to be written and free.
        But we cannot just release the index page as it might still have
        children.

    (I think I've been here before).
    A worst-case scenario involves writing one block and that requires
      spliting every index up the tree to the inode.  This requires
      arbitrarily many pages to be allocated.  To accomodate this we either
      pre-allocate a spare page at every level of the tree down to the data
      block (a bit like storage space allocation) which seems very wasteful,
      or we make sure we can release one of the split pages, which seems impossible.

    I could decide not to worry about it.  Have a pool of index pages and hope
     it always works.  Afterall, most pages are data pages, and they can be 
     freed successfully.  We would only have a deadlock if all dirty memory were
     index pages, and that seems unbelievably unlikely.  If we trigger a 
     checkpoint when the count of locked-pages hits some limit we should be
     safe.

    So: Keep one table per index block.  Use simple append and sequential search.
     When table gets full, force an incorporation

     Do we allocate the table separately, or embed it in the indexblock??

     Probably embed it.  indexblocks that don't need it can be freed at any
     time so that space waste hopefully isn't significant.

     How big?
      If the file is written sequentially, then everything should gather into
      extents, and so it doesn't need to be enormous.
      If the file is written randomly then the index block can be expected to
      be 'indirect', so incorporation will be cheap.
     So 'small' seems ok in both cases.

     Let's say 8.

     But wait a minute.....
     On a checkpoint we can be getting phys updates for prev and next phases.
     next-phase updates cannot be incorporated until the indexblock has passed
     on to the next phase.  So in that case, I think we still keep a linked
     list of unincorporated blocks and live with the fact that we cannot
     free them until the phase change passes.  That shouldn't be a big problem
     as it is a limited time frame - especially for data blocks..

     But does this solve our initial problem??
     During roll-forward we want to keep the addresses but not the blocks,
     and we don't want to force incorporation. That means an arbitrary list
     of addresses attached to an index block.
     I guess we could possibly allow incorporation, but I would rather not
     as I want the fs to be able to be read-only nicely.
     So that means we need to have a list of address tables.
     Maybe the normal approach is 'add a table if possible, else incorporate'?

     OUCH... we may write a block a second time before incorporating the
     new address, so when adding an address to the table we need to check
     if it already exists.  That could be expensive.
     For index blocks might it even be a different address?  I think
     not but the vague possibility (in the future?) does complicate
     things somewhat.  Maybe we just keep thing in chron order and
     don't worry about duplicates until incorporate time, when we have to
     sort anyway.


     todo:
        lafs_find_block  DONE
	free_block must free tables DONE


     Unmounting still doesn't work.
     Problem is that an index block is holding a reference on parent,
     and parent references aren't getting cleaned up.
     On read-only unmount I guess we need to walk the list of leafs,
     discard any address info, and unlock the blocks.
     So that should be the first task for next time.

27jul2006
  Leafs are locked blocks which have no locked children.
  So any locked data block (non-inode) is a leaf
  Any locked index block with lockcnt[phase] 0 is a leaf.

  OK - fixed numerous bugs, but I can unmount now!!
  I can even rmmod and insmod and all is cool.


TODO:
 - review refile and get all the code in there from prototype
       DONE (I hope)
 - write a combined find/load/wait function and use it
       DONE
 - allocate inodes in single memcache and avoid generic_ip
       HALF DONE. (still using kmalloc, not doing initonce well)
 - review recording of new block addresses
    + make sure we lookup there on index lookup - YES
    + make sure ->uninc_next gets tranferred to table at phase change.
    + write incorporation code as it is tricky
 - review how directory updates can be incorporated into a RO filesystem.
    No, they cannot.  We need to update the directory.
 - write directory update code
 - write cluster construction code
 - make sure indexblocks with unincorporated addresses get on to inc_pending
    ?? or is locking them enough?


INCORPORATION - ARgggghhhhh.
 The current uninc_table doesn't really lend itself to building
  index block... though maybe....
 Question: what happens when an index block disappears? i.e. it has no
  addresses in it?
  We clearly need to remove it from the parent.  This should be trivial,
  a direct operation on the parent index block. etc some number to 0.
  Then the next incorporation pass with simply lose that entry.

 OK, that might be all well and good, but how do we sort unincorporated
  addresses so we can merge them?
 A linked-list merge sort is nice and open-ended, but does waste
  quite a bit of space in pointers.

 Or maybe I should just always do small-table incorporations.
 Is there a way that a bad ordering of writes could force very bad
  index layout in this case? i.e. cause a table split every time,
  but new blocks go in the first (full) table.
 OK Decision: always do small-table incorporation.
  i.e. not a list of blocks: just a table of addresses.

 FIXME check validity of index type when it is first read in,
   and reject early if it cannot be recognised.

24aug2006
 Took a break from incorporation.
 Looking at directories.
 Wrote dir.doc in module to sum lots of stuff up.
 Issue:
   dir blocks have an info structure attached.
   This included a counted reference to the parent.
   How long does this need to hang around for??

   - when there is any orphan issue happening, it must stay, via
     the 'pinned' flag.
   - when actually performing a dir op, we need to create and
     maintain this info.

   When last ref of a dir block is dropped, should drop
   the parent reference.


 Status:
    free list management mostly done.
    Next:
      create/delete prepare/commit/abort
      orphan handling
      dirty_block lock_block


 FIXME should dir_new_block zero out the block?
   How will commit_create know what to do with this block?

 NOTE another type of directory orphan is a free leaf block which
   is on the part-free list.

-------------------------------------------------------------
09spe2006 0 on the plane to Frankfurt
 Don't tell me I am rethinking preallocation again ???

 TODO 
   dirty_inode needs to record the phase it is dirty in
   inode_fillblock needs to check current phase and act accordingly.
     we inode.doc
   Make sure the B_Orphan flag is set and used - or discard it.

   How do we commit creating a symlink?
   If it is a full block in size we cannot make an update record.
    - maybe have two update records? We cannot guarantee they are in
      the same  cluster.
    ... but if we put the 'make dir entry' last it should work.

   Change 'struct descriptor' definition
   the 'block_type' aka 'length' 16 field becomes
      0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
      0x8001 -> 0xc000 -> miniblock upto 16K+
      0xffff           -> index block.

   Need to write IO routines which decrease pending-block-count in
     'wc'.


   Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
     gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
     - 512 segments per block - is 512 blocks in each seg usage file

12oct2006
 Need to write
 - lafs_lock_{d,}block  DONE
       Make sure the block has parents and allocation and set the locked
       flag and the phase.

 - lafs_flush
       Given a datablock, wait for it to be written out
       This is needed before updating a block that is still locked in the
       previous phase.
 - lafs_inode_init
       Used when creating a new object/inode
       Given a datablock which is to hold the inode
         and a type (Type*) and a mode,
       Fill in the data block with appropriate data so that
          when lafs_import_inode looks at it, the right stuff happens.
 - phase_flip
 - lafs_prealloc
 - lafs_seg_ref
 - lafs_lock_inode

lafs_dirty_dblock
lafs_cleaner_pause
lafs_dirty_inode
lafs_seg_flush_all
lafs_write_all_super
lafs_quota_flush
lafs_space_use
lafs_cluster_update_abort
lafs_cluster_update_commit_buf
lafs_cluster_update_commit
lafs_seg_apply_all
lafs_cluster_update_prepare
lafs_inode_phase_check
lafs_seg_dup
lafs_dirty_block
lafs_cluster_update_lock
lafs_checkpoint_unlock_wait
lafs_orphan_drop
lafs_free_get
lafs_find_next

2nov2006
 - I need to know if a block is undergoing write-io so that I can
   avoid modifying it in certain circumstances.  But I don't track
   this information.  Options:
    1/ track the info.  This means an extra field in the 'struct block'
        because I still need to know which wc has had a write.
    2/ For blocks that we care about copy the data on write...
        But we care about all inodes and directory blocks.  That is a waste.
   I think we put extra info in the block.
   We need to know which wc was used (0,1,2) and which pending cluster
   in there (0-3) which comes to 4 bits.
   But we only care about the block for wc=0. and we could include the
   which-pending in the b_end_io, or maybe put it all in low bits
   of the block pointer....  Need max 4 bits.  Can only be sure of 2...

   Maybe:
       'which' goes in bottom two bits of bi_private
       'wc' goes in ->flags


4apr2007  (What a long gap !!)

 - lafs_cluster_update_*
   How do we prepare for a cluster update?  How do we lock it.

   The important thing is that the update can be written.  That
   requires that there is space available.  So we need to preallocate
   space and then release it.
   It is possible that each update might go in a different cluster, so maybe
   we need to preallocate one block per update.  That sounds a little expensive.
   After all, we aren't preallocating a cluster block for every data block
   that is dirty.
   So: prepare does nothing
	lock preallocates the space - a full block.
	commit copies it in.
    For now at least.

24May2007

 - Can now create and delete lots of files.  This is cool.
  But:
    Orphan slots just grow and grow - never to be reclaimed - why?
    After rm f*, 7 files remain.  but rm f* again and the go.
         FIXED - readdir wasn't returning them
    Size of directory remains large.
    And sometimes, files become ghosts... (try just removing one after first rm f*).

  TODO - process those orphans to clean up the directory.

20June2007 (Happy Birthday Dad)

 - Creating lots of file and then deleting them leaves 5 orphan slots
   for the directory busy, and one for inode 0?? 

   Directory handling uses the following orphans:
    CREATE:
        A new index block is created by splitting.  This needs to be linked in.
    DELETE:
        The dirent block we are deleting from
           If it becomes empty, it needs to go on free list
        The index block we are deleting from
           If it has lots of free space it might need to be rebalanced.
     The inode that was deleted.

 
 - When a file is fully deleted, we need to drop any orphan info... DONE
 - Need to do orphan handling of free blocks in directory, and
   unmerged parents - but there doesn't seem much point as I am going to
   change the directory layout (again).

 So: writing to a file.
   We need prepare_write, commit_write, and writepage.
   Prepare loads and links the page and checks there is space.
   commit marks it as dirty so writeout is possible.
   writepage chooses a page to write out

25June2007 - HACK week, thanks Novell!!
 - write - DONE
 - sync
     Somewhat done.
     Need to revise the process whereby async completion
     clears PAgeWriteback,
     We need locking in there, and need to worry about
       'which' wrapping too soon.
     Need to not start IO before we set page writeback
 - chmod
     Maybe, but syncing to disk needs more thought.
 - 'df'
    Partly done, need actual content.
 - mkdir
    Can make directory, but creating first entry fails. - FIXED
 - symlink
 - readlink
 - new directory structure.

27Jun2007 - More HACK week :-)

 - new directory layout done - much easier!!
 - If I delete a file that was created, the blocks still have a ref-count
   and we crash.
 - mkdir doesn't increase link count on parent. - FIXED
 
 TODO:
   Orphan handling.
     Infrastructure to process orphans
     Handle specific cases
     flush orphans at key times.
     load orphans at roll-forward

   checkpoint
     Write out a checkpoint (when?)
     Make sure refcount goes back to zero on blocks I write.

  Check on inode_phase_check and checkpoint_unlock and inode_dirty
   in all directory operations.

 FIX: Writing a small file leaves something non-dirty but
    due to be written, and lafs_cluster_allocate complains.
  - seems to work now.

 FIX: dir_handle_orphan doesn't lock the orphan transaction required.

 FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.

 FIX: lafs_allocate hasn't been written!!!

 FIX: before updating any block in a depth=0 file, we must first load
      and 'lock' block 0.

29Jun2007 - still HACK week.
  Summary of how incorporation works.

  Each index block has a small table for unicorporated changes. i.e. 
  blocks number and their addresses.
  This supports efficient storage of extents, and is extensible by allocating
  more tables.  This last is done rarely.

  When a block gets a new address, this is added to the table or, if
  there is a phase missmatch, it is added to a list until a phase change
  happens (so the whole block is pinned pending the phase change).

  If the table is full then:
   - if the filesystem is read-only (including during roll-forward),
     a new table is allocated (else rollforward fails).
   - otherwise we incorporate the table into the block, then add the new
     address to the (now empty) table.

  If incorporation requires that we split the index block we allocate one
   from a pool.  If there are none in the pool, we wait.

  As the table is much smaller than a block, the incorporation into
  two block will always succeed.
  The 'uninc_next' and 'children' lists will then need to be shared
  between the two blocks before the new address is added to whichever
  table is appropriate.

  When looking for a block address, we must always check the table and
  then children lists.  We do not need to check uninc_next as they will always
  be children.

  How to ensure that the pool always has sufficient index blocks and we don't
  deadlock?
  We have two halves of the table, one for each phase.  Before we allow
  a block to be dirty in a phase, we ensure that the pool has adequate 
  index blocks for that phase.  e.g. twice the depth of the block.  If it
  doesn't we block the dirtying until space becomes available.
  For syscall writes, this is easy as we catch in prepare_write.
  When we perform a phase change, we must be sure there are enough index
  blocks for the deepest bloc that will stay dirty.  If there aren't, we need
  to flush all dirty block, and unmap all writable mappings before
  starting the checkpoint.


 FIX: need to work out life time rules so that inodes hang around while they have blocks.
    currently have an igrab that is never put.

 FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.

3Jul2007
 Checkpoint flushing is getting close.
 Current problem.
   InoIdx blocks are not changing phase.
   Phase change should happen when all children have been incorporated, and
    then the write has been triggered marking us clean.
  For InoIdx blocks, we need to be marked clean when the data block
   completes.

5jul2007 - a week off
 Checkpoint flushing seems to work !!!!
 FIX: what should filesize of symlink be?  
     other filesystems use len, but still zero-terminate for vfs.

 Problem.  A chmod is followed immediately by an unlink then a checkpoint.
   The chmod update gets into the checkpoint cluster, but the unlink completes
   before the checkpoint is finished so the new superblock sees the file
   as gone.  Roll-forward find the update and want to update a missing file.

   This isn't a big problem, but with slightly different details, it could be.

   One option is to ignore updates that preceed the updated block.  That might
   be awkward with e.g. directory updates and checkpoints that cross multiple
   segments.

   Another option might be to prohibit updates once a checkpoint has started
   unless they are known to be after the phase change.

 FIX: unlink isn't punching a hole in the inode file.
      Inode usage map isn't being updated. - FIXED (For create, not unlink).

 FIX: roll forward does not pick up inodes, only data blocks.
    But tiny files are synced to inode, so they might not be picked up.
    So we must process a level=0 inode like a data block.

6July2007
 Time for lots of clean up.

DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
DONE 2/ rename 'lock' -> 'pin'
 3/ Review and fix up all locking/refcounts.  See locking.doc
DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
         B_Alloc inside the semaphore
       Also lock inode when copying in block 0 and probably
       when calling lafs_inode_fillblock (??)
DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
         more allocations can come in at any time.
NotYet 3c/ cluster_flush should start all writes before calling _allocate
         as _allocate might block on incorporation/splitting.
       No.  We really want _allocate to not block, but to queue...
        I think this is too hard to get perfect just now, so I will leave it.
DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
     so that locking of lru doesn't have to be too global
DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
       lru system.
DONE 3g/ revise refile lru handling based on new understanding
 3h/ Utilise WritePhase bit, to be cleared when write completes.
     In particular, find when to wait for Alloc to be cleared if
      WritePhase doesn't match Phase.
       - when about to perform an incorporation.
 3i/ make sure we don't re-cluster_allocate until old-phase address has
     be recorded for incorporation.
 3j/ Check that index blocks cannot race when getting locked....
  k/ Check what locking is needed to set PagePrivate exclusively.
DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
     We need to get it done in process context I think and lock
      ->waiting access with fs->lock after changing it to ->lru
DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
        only when *all* have finished.
  n/ on phase change, uninc_next blocks need to be shared out.
NO 3o/ Make sure lafs_refile can be called from irq context.
 3p/   lock all lru accesses.
 3q/ Lock those index blocks!!!
 3r/ Can inode data block be on leafs while index isn't, what happens if we
       try to write it out...
 FIXED Why are extent entries only grouped in 4s? 
 If InoIdx doesn't exist, then write_inode must write the data block.
 4/ resolve length of symlink
   FIXED - long symlink followed by 'sync' crashes.
   FIXED - rollforward isn't calling 'allocated' on blocks, or something
   FIXED - I cannot find 'bfile'. (inode isn't written)
   SEEMS OK...- Must flush final segment of a cluster properly...
 5/ Review what does, and does not need to be initialised in a new datablock
 6/ document and review all guards against dirtying a block from a previous phase
    that is not yet safe on storage.
          See lafs_dirty_dblock.
 7/ check for proper handling of error conditions
     a/ checkpoint_start might fail to start a thread!
     b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 8/ review checkpoint loop.
       Should anything be explicit, or will refile do whatever is needed?
 9/ Waiting.
       What should checkpoint_unlock_wait wait for?
       When do we need to wait for blocks the change state. And how?
DONE 10/ rebase on 2.6.current
DONE     - use s_blocksize / s_blocksize_bits rather than fs->

 11/ load/dirty block0 before dirtying any other block in depth=0 file
 12/ Add writecluster flag for old-phase updates.
     Why is this needed?  updates should always go in the new phase???
 13/ use kmem_cache for 'struct datablock'
 14/ indexblock allocation.
        use kmem_cache
	allocate the 'data' buffer late for InoIdx block.
	trigger flushing when space is tight
	Understand exactly when make_iblock should be called, and make it so.
 15/ use a mempool for skippoints in cluster.c
 16/ Review seg addressing code in cluster.c and make sure comments are good.
DONE 17/ Make sure create inherits uid etc from process.
 18/ consider ranges of holes in pending_addr.

 20/ Implement rest of "incorporate"
 21/ Implement staged truncate
         use for setattr and delete_inode
DONE 22/ block usage counts.
 23/ review segment usage /youth handling and make a todo list.
      a/ Understand ref counting on segments and get it right.
 24/ Choose when to use VerifyNull and when to use VerifyNext2.
 25/ Store accesstime in separate (non-logged) file.
 26/ quotas.
        make sure files are released on unmount.

 30/ cleaner.
       Support 'peer' lists and peer_find. etc
 31/ subordinate filesystems:
     a/ ss[]->rootdir needs to be an array or list.
     b/ lafs_iget_fs need to understand these.
 32/ review snapshots.
      How to create
      how they can fail / how to abort
      How to destroy
 33/ review unmount
      - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 34/ review roll-forward
      - make sure files with nlink=0 are handled well.
      - sanity check various values before trusting clusters.

 34/ Configure index block hash_table at run time base on memory size??
 35/ striped layout.
         Review everything that needs to handle laying out at cluster
         aligned for striping.

 36/ consider how to handle IO errors in detail, and implement it.
 37/ consider how to handle data corruption in indexing and directories and
     other metadata and guard against problems (lot of -EIO I suspect).

 - check all uninc_table accesses are locked if needed.

And more:
  1/ fs->pending_orphans and inode->orphans are largely unused!
  2/ If a datablock is memory mapped writeable, then when we write it out,
     we need to with fill up it's credits again, or unmap it.
  3/ Need to handly orphans asynchonously.

---------
22nov2007
Free index block are on two lists, both protected by the global
hash_lock.
  1/ The per-inode free_index, so they can be destroyed with the inode
  2/ The global freelist so they can be freed by memory pressure.

11feb2008.   Where was I up to again?
   reviewing phase_flip and lafs_refile.

  UPTO
     Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
     Need to thnik about how index block really change.  How old blocks get 
      dis-counted from segment usage, and what optimisation are really good
      for re-incorporating index blocks.
        Operations to consider are:
              i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
          i/ leaf block splits, index block gets new entry at end, and replacement
                  for other entry.  Easy to handle
         ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
        iii/ probably caught in leafs.  May cause internal split so we add new
             index address, which is easily handled if there is space.
         iv/ same as iii, though split more likely.

       What about merging index blocks.  That just makes addresses disappear, which
        we handle the slow way.
       Do we ever re-target index blocks?  Would need to be careful about that.
       Make it look like a split where one block ends up empty as a hole.
     Need to write
           grow_index_tree (DONE - untested)
                  ib is a leaf inode that is getting full.  Copy addresses
 		  into 'new', and make 'ib' an index block pointing at new.

	   add_index/walk index (DONE - untested)
          
           end of do_incorporate (DONE - untested)
		new contains the early addresses.  Some remain in ib
                 and/or ui.
		the buffers much be swapped, so ib has the early address.
                ui needs to be attached to new
		return 2; - then new uninc needs to be split

           lafs_incorporate
                case 2 - horizontal split
                case 3 - vertical split
  12feb2008
   Bother - uninc_table is a problem (again).
   We can currently add at any time with just a spinlock.
   So when we split a block horizontally, 


   Still need to
          share out children and uninc_table in do_incorporate
	  share out credits in do_incorporate

14feb2008
   Still need to do incorporate as above but took a break to...

   Counting allocated blocks now works - stat show right info, hopefully
     storage is correct too. - DONE

   next: truncate?  orphan thread? 
      Then segment usage and the cleaner.


   thoughts:
    truncate - removing blocks doesn't need to erase them...
    - nothing forces a cluster_flush promptly!!!  We need a timeout
         or at least we need a flush before truncate_inode_pages...

    - in lafs_truncate we need to make the block an orphan an pin in 
      all in a checkpoint.

21Feb2008 (Research morning)
   Discard checkpoint thread created on demand in favour of a cleaner
   thread that runs all the time.  It cleans and checkpoints and
   orphans and scans.

     want to:
        do segment scan and get a real list of free segments and
	free-space info!

25Feb2008
 - segment usage scanning to count free blocks
 - fix up re-reading of erased blocks
 - FIX truncate can still block waiting for writeback to complete.
 - FIX allocations aren't failing when we run out of free space
 - FIX df doesn't agree with du.

 problem:
   Truncate when an index block has addresses in uninc_table.
     The summary for the new address has already been performed.
     We need to deallocate the new without disturbing the old.
     However a simple allocation may not be possible.
     I guess we can prune them all to zero, then incorporation
      can proceed.

 TOFIX: when truncating a recently created file, it is still depth=0 so
    nothing happens.
    We really need to increase the depth to 1 as soon as we dirty
    any block, then reset back to 0 if it fits.

26Feb2008
  We have a file that we have written to, and the data blocks have been
  written out and the addresses stuck in uninc_table.
  We then truncate the file.  Who releases the usage of those blocks?
  And who removes them from uninc_table?

  OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
  I really should make sure that inodes are getting freed properly and the
  inode map is clean and everything.

  BIG QUESTION
    Do we reserve segment-usage blocks.
     We cannot do it naively as we get infinite recursion.
     But we need it to be allowed to dirty the segment block.
     But we cannot pin them to this phase as we want to write them out
     after this phase
     This still needs more thought.  I avoided the recursion by setting SegRef
     before getting the ref.  But that isn't safe.

28Feb2008
  The table of cleanable segments is not working out.  Each segment appears multiple
  times which wastes space and adds confusion.
  We really want to be able to lookup by dev/seg and also find the least.
  'Find least' sounds like we want a heap but then we cannot discard the bottom half.

  We could have a skiplist for dev/segment lookup and do a merge-sort on
  a different link when we want to find the best segment.
  We then remember the best number found since a sort, and re-sort if the top
  is worse than the best.

  We keep all this in a fixed size table.  Each entry has
   seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
      addr-sort-skip links.
   This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
   Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
   One page of 16byte entries (256 of them)
   2/3 page of 24byte entries, 1/3 of 32byte entries.
   Total 2 pages, and 256+113+43 = 412 entries.

  But deleting random elements is awkward... but not too awkward.  We can delete
  lots of entries by marking them as old, then performing a single pass of the skip
  list deleting them.

  We should keep free segments here too, on a separate list.

  So how about:
   2 pages of 16byte entries
   1 page of 24
   1 page of 32

  free list randomly threads through all.

  When using from 24 or 32, randomly choose height of 2-5 or 2-9
  Two lists run through the skiplist entries.  One for cleanable, one for free.
  Remember the nth element for some small n (10, but it decreases as we pull
  things off the front) and if we add something less than that, we trigger a
  mergesort on the next time we want to clean.... maybe.

  Remember end of free list and add to there.  Maybe merge-sort the free list
  by addr occasionally.

  Quesitions:
    When can we clean, when can we free wrt checkpoints?
      - we an clean a segment as soon as we have a checkpoint after it.
        So we record the youth of the segment holding the (start of the)
        checkpoint, and can clean any segment with a lower youth.
      - we can free a segment after the checkpoint after itfs usage has reached
        zero.  So if usage is zero and youth....
        We could offset the usage by one (say - for the first cluster header..)
        then when we find a segment with usage of '1', we schedule an update to
        0 in the next checkpoint...
    Have about segments with different sizes - they get different weights.
       Need to divide by segment size:  usage * youth / size.

  TOFIX
   - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
   - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)

   - Lots of 'creates' makes lots of little clusters - need to optimise!
        Or it could be deletes as we currently cluster_flush for each
        delete.
         - I think this is fixed

29Feb2008
  Started looking at the cleaner.
  Need to understand how much to clean each checkpoint
  Need to track free-space-in-active-sectors while scanning.

3Mar2008
  TOFIX
    - the cluster head is currently limited to one page.  This is not good.

    - Should the cleaner start before the scan is complete after a checkpoint?
      Probably it can, but while the scan is still happening it might be best
      to be cautious ??

  STATE:
    try_clean is taking shape and has a few FIXMEs.
    need to write async find_block code and get it to watch for
       block in a cleaning segment.

28Mar2008
  - where can padding appear in a cluster? between miniblocks? at
    end of device blocks?
  - need to track phys block while parsing headers for cleaning.. why?
  - determine rules for avoiding block lookup during cleaning
    based on youth/snapshot age, and truncate generation.
     We need to load the inode from each snapshot
    Can we optimise based on snapshot age?
    only if we know the block is newer than the snapshot.
    So when we relocate blocks (cleaning) they must go in a segment
    that is marked as being old. we cannot really guarentee that.
    I guess blocks that are marked as 'new' can safely be skipped if
     segment is newer than snapshot. This 'age' is not the youth, but
    is the cluster_head->seq which is stored in creation_age.

 - Store the rootdir for a filesystem in the metadata for the root inode.
   Then 'struct snapshot' doesn't need rootdir.  It can have a root

30Jun2008
  Looking at lafs_find_block_async.
     Needs async flag to make_iblock. 
        Check that.  Can we block_adopt if there was an error?
             iblock will exist.
     setparent has async flag.
     lafs_leaf_find has async flag
     lafs_wait_block_async

  FIXME I wakeup the cleaner every time an IO completes. 
  Do I really want that?  Maybe only when number of async IOs hits
  half the recent maximum??

  FIXME need to ensure that lafs_pin_dblock flushed committed
    B_Realloc blocks.

  FIXME when we incorporate a dirty (non-realloc) address to an index block,
    we need to clear B_Realloc on the indexblock.

  FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without 
    giving it any credits.  Where should they come from?

  We don't seem to scan for free/cleanable segments often enough.

  FIXME we shouldn't start a checkpoint while cleaning is happening.

  FIXME need to be careful when cleaning about finding inodes that
    don't exist any more.

  FIXME give credits to realloc blocks.

  FIXME think about/document transitions between realloc and dirty,
    and what locking is needed.

2Jul2008
  Allowing for the FIXMEs above, the cleaner is now identifying
  blocks that need to be cleaned and marking them B_Realloc (I think).
  We now need to gather these into a write cluster and write them.
  They will all be on the clean_leafs list, so we can iterate that
   allocating or incorporating as needed.  This will be similar to
   do_checkpoint.
  Important question is: when?
   Ideally we would have some auto-flush mechanism.  The cleaner just
   keeps finding blocks to clean and when we start running out of
   resources we flush the cleaning queue.
   However we will still want to flush the cleaner always before a 
   checkpoint, so for now we cna implement that bit and wait for a
   need for the other to arise.


  FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
      don't record that location the same way.. how to handle?
     Should check that 'adopt' doesn't do the wrong thing with this block.


  Realloc blocks need to be pinned.  That makes sense.  Only that way
    will they get onto the clean_leafs list.
  When checkpointing we should probably examine clean_leafs to be
   on the safe side.
  

  Realloc and Dirty:

     Both of these hold a Credit.
     Both can be set at the same time.
     Cleaner ignores Dirty and sets Realloc anytime the block is in
      the wrong segment.  It also Pins the block.
     When the cleaner is flushing to the cleaning segment, it
      ignores Dirty blocks.  They get their Realloc cleared, but
      the remain pinned.  So they will get moved at the next checkpoint.
     How do we know whether an indexblock should be Dirty or Realloc?
      The Dirty/Realloc bit is cleared before we get to incorporation.
      Maybe we lafs_dirty_iblock the parent of any block we write
       out.  Then after incorporation, we set Realloc if it is not
       dirty.

STATUS:
  I think I'm pinning cleaner blocks now.
  Need to make sure the dirty ones are dropped. DONE
  Need to make sure the usage is transferred
  Need to get free segments back into use
  Need some more 'dump' options.  Maybe youth/usage files.
      Maybe tree.
  Need to make sure scan etc are triggered often enough.

  FIXME lafs_prealloc walks up ->parent without locking
    I think we want i_mapping->private_lock like lafs_pin_iblock.

TODO:
  1/ a 'dump' option that triggers a scan and prints everything out.
  2/ scan must mark freeable as such, then subsequently free them.
  3/ Look at code that decreases usage of old segments.
  4/ Review lafs_cluster_wait_all and decide exactly how long we need
     to wait.
  5/ Review 'FIXME that is gross' HZ/10 thing.
  6/ Review 'wait for checkpoint to flush' msleep(500);
           Maybe remove that altogether.

  FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
  FIXME BUG in lafs_allocated_block fired.
            from lafs_erase_dblock from invalidate_page from .. vmtruncate
             from lafs_setattr

  Current problem:
    An inode data block is dirty and pinned, but the inoidx is no longer
       pinned.  Presumably it isn't dirty.
     Recheck what 'dirty' means on the two blocks and see how this can happen.

10july2008
  Tree gets very big!  Lots of 'Realloc' blocks that should
   be long gone.

  WE are spinning in cleaner again, and not in try_clean.

  Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
  In general it shouldn't be.  The flush_cleaner process will remove
   the Realloc bits so the blocks fall off clean_leafs.  They then either
   go onto phase_leafs or get unpinned.
  But I currently have a problem with InoIdx/data.
  The Pin is transferred to the Data block, but it doesn't go from the
   InoIdx block because it has a pincnt.  Now that is probably a bug, but
   what if it weren't?  What if, while we were cleaning, a block got dirtied.
   That would pin the whole tree. 
  I guess the rule about not allocating an inodedata block while the
   InoIdx is pinned needs to be revised.  If the inodedata block is 
   Realloc (and not Dirty) while the InoIdx is not Realloc, we
   can go ahead (in a cleaning segment).

 FIXME to check:
   adir/big1 is garbage.... big1 was removed, so why is it even there?
              FIXED.
   echo tre > dump  # still too much stuff.


 Put cond_sched in checkpoint loops!


 Thoughts about cleaning and pinning.

  When cleaning we need to know how many dependant blocks are being cleaned
  so that we know when *this* block can be written - i.e. when the could hits 0.
  We cannot use the pincnt for this phase because there may be dependant blocks
  which are dirty.  They, and therefore this, may get flushed at next checkpoint,
  but they may not.  If we could be certain they would, we could just write
  to the clean-segment blocks which can become unpinned.  However if there
  is an index block being cleaned, and no dependant is being cleaned, but some
  are dirty but not pinned, then the checkpoint can go past without the block
  being moved.... but maybe we can detect that.

  Try this:
    We set B_Realloc precisely on blocks found in segments being cleaned.
    We pin these blocks and leafs which are Realloc go in clean_leafs.
    If a block is both Realloc and Dirty we clear Realloc but leave pinned.
       That way it gets written at end of checkpoint, but to main cluster.
    When we incorporate Realloc blocks into an index block, it gets marked
       Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
    On a checkpoint, we process both phase_leafs and clean_leafs


 FIXME do inode reads async better when cleaning...

 FIXME if a realloc inode has been allocated to a cluster when we try
     to dirty it, confusion can ensue as the writeout won't mark it
     clean, but will use up the credits.
     Maybe we need something similar to phasewait to not set PinPending...
      But normal dirtying doesn't phasewait.   I think we just need to
      detect this case and wait for the clean-cluster to flush.
      Messy...

 FIXME make sure incorporate is doing the right thing with credits.

 FIXME lafs_write_inode. We need to be careful about clearing Dirty
           when making an update.  Need some sort of locking.
           Need to review all inode dirty stuff and make sure we do
           write thing no matter when it is called.

 FIXME when blocks are attached to uninc_next, they don't have 'dirty'
        anymore so we don't know how to flag the index block.

2008jul13
 UPTO: unlink etc don't prealloc the inode that will be modified.
    And a warnon inode.c:579 is very noisy.

2008jul22
 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
     but it doesn't get set until AFTER lafs_reserve_block is called.

 Here I am...
   Cleaning cleans an InoIdx block which schedules the data block.
    Subsequent the InoIdx block gets pinned again.
    Now when we go to write the data block, we cannot because InoIdx is pinned
     in same phase.
     Maybe given that data block is pinned, we write it anyway...

 FIXME: when we realloc an block embedded in the inode, don't pluck it out
        and put it back in again.  Just realloc the inode.

 FIXME: when cleaning a directory that has shrunk, we think we have
     blocks that don't exist any more. FIXED - we thought '0' was in 
     segment '0'.

2008jul23
  FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
     flush finds no credit. for InoIdx block of 8501

  FIXME: do we do SEGREF on all the index blocks? do we need to?


2008jul24
  FIXME: seg usage for segment 0/5 isn't dropping to zero.
    Part of a file got moved off, but count is still there.
    FIXED - seg_move wasn't being called.
  FIXME: segusage file has inconsistent extents:
      Extent entries:
       0 -> 694 for 2
       1 -> 1291 for 1
       1 -> 15 for 1
   FIXED several bugs in walk_extent

  FIXME qphase:  any locking between that changing and lafs_seg_move??
    I don't think so.  Just that seg_apply_all must be called after qphase is set.

  FIXME make sure we don't try to clean the current segment!!

  FIXME 'Available' goes negative!
      Creating large file doesn't instantly reduce 'Used'.
      Deleting files plus sync doesn't increase Avail?
 
  FIXME a segment is in the table but doesn't print out!

  FIXME we don't cope with running out of free segments (not that we ever should).

  FIXME check all Credit usage and make sure credits are returned when
    ->parent is dropped.
    provide visibilty into credit counts.
    Make sure we are keeping enough space for cleaning.  We should always
     have a few segments unallocatable.

2008jul25
  FIXME cannot do io completion in cleaner thread as it can block on
     a i_mutex which might be waiting for completion. FIXED (keventd).

  FIXME as ->iblock isn't refcounted we need to be careful accessing it.
            If we 'know' we have a reference, e.g. a child with a ->parent
            link, we can access it without locking.
       So:
           lafs_make_iblock should return a counted reference.

       If we own an (indirect?) reference to iblock, we can access
        both iblock and dblock for free... but iblock can change???
       If not, we need to get a reference to on or other under a lock.

  FIXME block->inode should be a counted reference?

lafs_make_iblock OK
  lafs_leaf_find OK
    lafs_inode_handle_orphan OK
      inode_handle_orphan_loop FIXED
    __lafs_find_next OK
    find_block FIXED
  __lafs_find_next OK
    lafs_find_next FIXED
      dir_lookup_blk
      dir_handle_orphan
      lafs_readdir
      lafs_inode_handle_orphan
      choose_free_inum
  find_block - FIXED

 FIXME root->iblock should always be refcounted.  Is it?
 FIXME walking siblings - what lock?

2008jul28
 FIXME several times we clean PinPending without refiling, in dir.c in particular.
    that looks wrong. FIXED

  Maybe  lafs_new_inode should return a reference to the dblock
    Or pin it. or something. FIXED  And pinned (when needed).

 FIXME lafs_inode_dblock might return a block without valid data...
   Need to get valid data, then load block 0 in find_block rather than
       load_block.  FIXED

 FIXME we really should own a reference to ->dblock before calling
    lafs_pin_inode.  We don't want IO during a pin request.
    FIXED

 FIXME review use of PhysValid FIXED

 lafs_orphan_abort - what if lafs_orphan_pin not called?
   or if 'b' is NULL.  FIXED

 Do I Need to clean PinPending when retrying??
   Well, we need to be phase-locked when we set PinPending, so
    it must be Pinned to the current phase.
    So when we unpin a datablock, we must clear PinPending.
  FIXED we now clear PinPending in do_checkpoint.

 Does phase_wait do the right thing when pinning an inoidx block
   for an inode? FIXED


Pending
  Need to understand and document the lifetime of a page with datablocks.
    who hold what refcount, and when can it be freed?
   Then fix up locking in lafs_refile, __putref.

 FIXME how keep what refcount on orphan blocks/inodes??
 FIXME should dirty/pinned/etc hold a refcount?  they don't.


Later:
 FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)

 FIXME make sure empty files have depth of 1.

 FIXME Truncate proceeds lazily. All data blocks need to be gone

26aug2008
 If I call lafs_erase_dblock while a write is underway, we have a problem.
  We need to wait potentially for a checkpoint to let go of the block and
   a write to complete.
    This should be done with waiting for PG_writeback on the page to disappear.
  Check this out.

  When end_page_writeback is called, we must have dropped all references to the
   page.
  When we commit to writing a block, we have to set PG_writeback on the page
   so that truncate et al can wait for it.  Before we have committed, truncate
   can just remove the page.  Internally we differentiate by B_Alloc.
  So before setting B_Allocated we need to test_set_page_writeback(page).
  Be careful of races.
  I don't think we can ensure all references are dropped.  After all, that is
  the point of refcounts.  So dblock array must exist without page!
  But we need to ensure that we don't start a writeout after truncate
  has done wait_on_page_writeback.
  This is done with the page locked so when we want to write a page
    in a checkpoint, we need to lock the page first.  Once we have the lock,
    we check if the page is still dirty.  If it has been truncated it 
    will be clean.
   But how do we safely reference the page if b->page can be cleared?
    How about:
      When we clear PagePrivate, we take a counted reference to the page
      for db->page.  This is dropped when the page is freed by lafs_refile.
      But while it is held, it is still safe for db->page to be dereferenced.
    So before we commence writeout we have to lock the page and set
     PG_writeback.  After locking, we need to test if writeback is still 
     appropriate. 

  Maybe not.  I think we can submit blocks for writeout without setting the
  page to writeback.  If we do, then we need to be sure those writes
  finish before invalidatepage calls releasepage (block_invalidatepage
  calls discard_buffer which calls lock_buffer which waits).
  In our case invalidatepage need to make sure that no new write commenses.
  Maybe we should lafs_iolock_block before we allocate to a cluster and check
  again if the block is dirty.

  So:
    lafs_cluster_allocate does:
       lafs_iolock_block
       check if still dirty.  If not, unlock and return
       set allocate flag
       allocate and write
       when write completes, allocate is cleared.
                    unlock block

    invalidatepage does
       lafs_iolock_block
       clear Valid,Dirty,Realloc
       lafs_iounlock_block