README

   1
   2 So, let's try to write a kernel module that implements this filesystem.
   3 It would be good to have a plan.
   4
   5 - Mount filesystem, providing empty root directory
   6    o parse mount options - DONE
   7    o find/load superblocks and stateblocks - DONE
   8    o present empty directory - DONE
   9    o Compile external module - DONE
  10    o test DONE
  11
  12 - Mount filesystem read-only with no roll-forward
  13    o IO address mapping
  14           sync_page_io or bread? - not bread I think
  15    o Index blocks management
  16    o search cluster-header for root inode
  17    o file read
  18    o Directory lookup/read
  19    o test
  20
  21 - Support roll-forward for blocks, orphans, whatever
  22    o manage segusage files
  23    o manage quota files
  24
  25 - Support writing
  26    o inode bitmap
  27    o cluster creation / block sorting
  28
  29 - Support Cleaning
  30
  31 - Interface for snapshots and other admin
  32
  33
  34
  35 ------------------------
  36 FIXME
  37  If a device is removed from the filesystem, we cannot reliably
  38  tell from the other devices or state that this is so.
  39  Maybe we need to update all devblocks with a new 'seq' number...
  40 FIXME
  41  How do we specify mounting subordinate filesets?
  42  What superblock do they have?
  43  I suspect we do a -F lafs-sub mount from the original filesystem.
  44
  45 FIXME
  46  If mount fails, we seem to be leaving a super lying around,
  47  and sync_supers dies on it. - DONE
  48
  49 FIXME
  50   Umount appear to work, but a sync_supers dies. - DONE
  51
  52 FIXME
  53   subordinate supers aren't being locked as much - is that a problem?
  54
  55 FIXME
  56   index pages never get put on an LRU - how is this supposed to work?
  57
  58
  59
  60
  61
  62
  63
  64 --------------------------
  65 Thoughts:
  66   Inodes live in an address-space, much like a file.  To load the
  67   first inode, we need an address-space, so may as well have an
  68   'struct inode' as we may want to expose it to user-space.
  69
  70  Loading an inode, need
  71    fs (lafs filesystem structure)
  72    which subfs (maybe a lafs inode)
  73    which snapshot - this is implied by the subfs inode.
  74    and fs can be obtained from inode, so just inode, inum
  75
  76
  77
  78 UPTO
  79 03nov2005
  80   review block_leaf_find and make_iblock
  81   need to do setparent and block_adopt next
  82
  83 10nov2005
  84   need to resolve locking for ->siblings list
  85
  86 24nov2005
  87   peer_find
  88   lock_phase
  89   lafs_refile
  90
  91   I can read a file.....!!!!!
  92   Code review / tidy up.
  93      resolve locking buffer vs page
  94
  95   Export on a web page somewhere??
  96
  97 16feb2006
  98  (I spent a while getting large-directories to work again in prototype..
  99   and some holidays).
 100  - Priority: clean mount and unmount
 101  - large directories
 102  - multiple devices.
 103
 104   FIXME how do we record and handle write errors???
 105
 106  The iput in lafs_release - which is needed - is oopsing
 107   at iput+0xe!
 108
 109 23feb2006
 110  Ok, I finally have a clean mount/unmount.
 111  .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.
 112
 113  Next:
 114   - make sure we can handle 'large' directories.
 115   - make sure we can handle files with indexes
 116   - handle filesystems that span devices.
 117
 118 02mar2006
 119   Hurray - clean unmounts!!!
 120   There is a nasty circular reference of the root inode which is stored in
 121   a block that it manages.  Maybe this should not happen, rather than having to be
 122   explicitly broken - the root-block can live elsewhere, not in the inode.
 123
 124   Next multi-level index blocks.
 125
 126   But first, need to understand memory pressure and pageout.
 127    How are dirty pages found to be cleaned?
 128    How is pressure put on a filesystem to clean up?
 129    How are clean pages reaped?
 130
 131   - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
 132       lru_cache_add{,_active}(page) might be easier, but isn't exported.
 133   - call mark_page_accessed(page)  to keep the page 'active'.
 134
 135 09mar2006
 136   - make sure indexes work...
 137
 138
 139  lafs_load_block+0xf
 140   eax,bx,cx,dx,s1 all zero
 141 from block_leaf_find 203
 142
 143   ... OK, indexes seem to work.
 144    But 'lafs' have problems creating some large files.
 145    Try 'tt'
 146
 147    This is due to not handling error properly.. fix it later FIXME
 148
 149 16mar2006
 150
 151   Must make sure the index address-space gets clearred up... I wonder
 152   how we find all the pages to free.  This might be one reason to keep them
 153   in a radix tree.  Though we should be able to walk our own data structures.
 154
 155
 156   Then work on mounting a 2-device filesystem.
 157
 158
 159   FIXME dir_next_ent always starts from the beginning rather than
 160    remembering where it is up to... can this be fixed??
 161
 162
 163 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
 164
 165   Mounting snapshot needs a way to identify that it is a snapshotmount
 166   and which snapshot, and which filesystem.
 167   We could use a different filesystem type, but that isn't really needed
 168
 169     mount -t lafs -o snapshot=name /original/mount/point /new
 170
 171   This grabs the named snapshot of /original/mount/point and places it at
 172       /new
 173   The 'snapshot=' option is the trigger.
 174
 175   For a control FS, we
 176         mount -t lafs -o control /original/mount/point /new
 177
 178   To grow a filesystem, we initialise a device (super/state blocks) and
 179         mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
 180
 181   as the dev_name isn't passed to remount
 182
 183   So, mount options are:
 184         snapshot=name
 185         dev=/dev/device
 186         new=/dev/device
 187         control
 188     and various
 189           name=value
 190     pairs matching what is exposed in the control filesystem
 191
 192 23mar2006
 193  - factored out super-block finding preparatory to finding snapshots.
 194
 195  Thoughts:
 196     superblocks for snapshots and sub-ordinate filesystems do
 197     not get stored in the 'state'.  There is, however, a usage count so that
 198     the prime filesystem cannot be unmounted until all snaps and subs are gone.
 199     This should just refcount the prime_sb I suspect.
 200
 201     So: a snapshot sb points to the 'struct fs' but doesn't .... what???
 202
 203 30mar2006
 204  - remove the super-block finding code by changing the layout to store
 205     superblock locations explicitly :-)
 206
 207  - teach 'mount' to mount snapshots.
 208
 209  - need to audit for bad use of ss[0]
 210  - need to find better way to map 'sb' to snapshot number.
 211  - need to make unmount work.
 212
 213 01apr2006 (no, really!!)
 214  - rewrite index to kmalloc index blocks and use a shrinker to free them.
 215    This means that indexblock no longer has a 'page', which makes sense.
 216    It also means they cannot live in highmem, which is sad, but could
 217    be fixed.
 218
 219   Notes: superblocks and refcounts.
 220    Each device holding the filesystem gets a superblock.
 221     One of these (arbitrarily) is the 'prime' superblock and gets to
 222       manage the whole filesystem.
 223     Each snapshot also gets a superblock, as does each
 224       subordinate filesystem.  These are anon sbs - using anon dev.
 225     Each anon sb takes a reference to the 'struct fs', and also to the
 226       prime sb.... how about the reference relationship between fs and prime_sb???
 227
 228     Need to ponder this,
 229
 230    - problem with getting parent superblock due to semaphores...
 231    - when unmount, put_super isn't being called, so inode 0 isn't released!
 232
 233 13apr2006
 234   (Took a week off to play with rt2500 wireless cards)
 235   - Use different filesystem type for snapshots and subordinate filesystems.
 236     This removes the semaphore problem
 237   + OK, mount and unmount works for snapshots... what next?
 238      - review index block - worry about himem?
 239      - review ss[0] usage - OK
 240      - general code review
 241
 242   FIXME - what should leaf_lookup/index_lookup return on format error?
 243       The currently return '0' which will quietly make an empty block.
 244       Many '-1' would be better to make an error block.
 245   FIXME check how other filesystem lock the setting of PagePrivate
 246      Maybe just need to lock_page
 247   FIXME combine find/load/wait into one operation
 248   Review dir, super, roll, link
 249
 250   FIXME module refcount increases on failed mount!
 251
 252 18may2006
 253   I've been sick for too long, and not much has happened... However I think more than
 254   the above comment says.  I started looking at roll-forward and have the
 255   basic block parsing in place so that it reports what it sees in the roll.
 256   Also, the format has been changes a little: the address in the state block
 257   is the CheckpointStart cluster, and we simply roll forward to the
 258   CheckpointEnd, and then keep going beyond there - there is no longer any
 259   walking back to find the start.
 260
 261   Next step is to start incorporating rolled elements into the filesystem
 262
 263    - data blocks: shouldn't be too hard.  Don't need to update the
 264            index pages just yet
 265    - inode updates: should be straight forward enough, but care is needed
 266            as the data might be in multiple places
 267    - directory updates: these are probably most interesting..
 268
 269
 270   Question: how are symlinks created?
 271     Currently we:
 272       log the inode creation
 273       commit the new inode
 274       log the directory update.
 275     This allows the 'value' stored in the inode to appear after the directory
 276     update.
 277     That might be OK for files (Which are created empty and then extended)
 278     but is bad for symlinks (which are created atomically).
 279     So, options include:
 280      - ensure inode is in a previous cluster to directory updates.
 281        This slows things down too much I think
 282      - log the content as well.  This is awkward if it is big, certainly if more
 283        than a block, which is possible.
 284      - directory updates could be dependant on the inode being valid.
 285        This is ugly.
 286      - log content if it is small, else write inode, flush, then create link.
 287
 288     So the fast option is:
 289       log inode create, log content, log filename
 290     and the slow/safe option is
 291       log inode ceate, sync file, log filename
 292
 293     So on roll-forward if we see the inode we just save the data.
 294     Saving the whole inode seems attractive, but we want minimal order
 295     dependance: an inode update in the same cluster as the new inode should
 296     still over-ride, even though it is earlier.
 297
 298   Ok, rollforward is proceeding slowly.  I think I am now incorporating
 299   new blocks into the tree properly, though the code probably won't compile.
 300   It will be nice to test this and see the file have the right data.
 301
 302   Next step would be to include the index incorporation code.
 303   Then
 304     - directory updates
 305     - segusage summary
 306     - quota
 307     - stuff..
 308
 309 08jun2006
 310  - what exactly should happen when rollforward finds a file with a linkcount of 0?
 311    Currently all updates get lost - I wonder if they are lost safely?
 312  - rollforward is getting the size right, but not the content
 313  - do I need to flag a block that ->phys is valid?
 314
 315  : Ok, roll-forward picks up new blocks in a file OK,
 316   but umount has stopped working.
 317     Presumably because there are pages attached to the inode which aren't
 318     getting released.  What do we want to do here?
 319     Normally those pages, or their addresses need to be recorded before
 320     they are lost.  But on a read-only mount we don't care so much.
 321
 322 22jun2006    continuing above thought..
 323
 324    When we roll-forward and pick up the pieces of a file, we don't
 325    want to allocate pages to hold those pieces (and definitely don't
 326    want to read them all).  We just want to attach the addresses
 327    to the parent for incorporation.  Similarly after writing
 328    dirty blocks in a file we want to be able to release them
 329    immediately rather than waiting for the addresses to be
 330    incorporated (as incorporation can be more efficient when delayed).
 331
 332    We could just allow the page associated with a block to be released,
 333    except that the page provides the indexing to find a block.  We might
 334    be able to live without the indexing, and hunt down the indexblock tree,
 335    but living without the mutual-exclusion provided by block indexing would
 336    be more awkward.
 337    And the 'struct datablock' still contains a lot more than is needed.
 338
 339    So maybe we should just have a completely separate structure attached to
 340    the indexblock which lists fileaddr/physaddr.  This could include
 341    extent information.  The trick would be guranteeing allocation.
 342    We could either allocate-late with a fallback of attaching the 'struct block'
 343    or performing an immediate incorporation, or allocate-early and block
 344    the dirtying of a page until there is space to record the new address.
 345    This last is bound to be easiest.
 346
 347    So: what exactly do we use to store addresses?
 348     Probably a linked list of tables.
 349     Each table contains a link pointer and an array of
 350         fileaddr/physaddr/extentlen
 351     But we would need to allocate lots of these if there are hundreds of
 352       dirty pages, but possibly only end up using a few if they made
 353       extents very nicely.  That might be wasteful.
 354
 355     Or we could allocate just one.  When it is full we perform an
 356      incorporation.  But if that causes a page split we are in trouble.
 357        We could have a spare page, split to it, write out one
 358         and wait for the spare page to be written and free.
 359         But we cannot just release the index page as it might still have
 360         children.
 361
 362     (I think I've been here before).
 363     A worst-case scenario involves writing one block and that requires
 364       spliting every index up the tree to the inode.  This requires
 365       arbitrarily many pages to be allocated.  To accomodate this we either
 366       pre-allocate a spare page at every level of the tree down to the data
 367       block (a bit like storage space allocation) which seems very wasteful,
 368       or we make sure we can release one of the split pages, which seems impossible.
 369
 370     I could decide not to worry about it.  Have a pool of index pages and hope
 371      it always works.  Afterall, most pages are data pages, and they can be
 372      freed successfully.  We would only have a deadlock if all dirty memory were
 373      index pages, and that seems unbelievably unlikely.  If we trigger a
 374      checkpoint when the count of locked-pages hits some limit we should be
 375      safe.
 376
 377     So: Keep one table per index block.  Use simple append and sequential search.
 378      When table gets full, force an incorporation
 379
 380      Do we allocate the table separately, or embed it in the indexblock??
 381
 382      Probably embed it.  indexblocks that don't need it can be freed at any
 383      time so that space waste hopefully isn't significant.
 384
 385      How big?
 386       If the file is written sequentially, then everything should gather into
 387       extents, and so it doesn't need to be enormous.
 388       If the file is written randomly then the index block can be expected to
 389       be 'indirect', so incorporation will be cheap.
 390      So 'small' seems ok in both cases.
 391
 392      Let's say 8.
 393
 394      But wait a minute.....
 395      On a checkpoint we can be getting phys updates for prev and next phases.
 396      next-phase updates cannot be incorporated until the indexblock has passed
 397      on to the next phase.  So in that case, I think we still keep a linked
 398      list of unincorporated blocks and live with the fact that we cannot
 399      free them until the phase change passes.  That shouldn't be a big problem
 400      as it is a limited time frame - especially for data blocks..
 401
 402      But does this solve our initial problem??
 403      During roll-forward we want to keep the addresses but not the blocks,
 404      and we don't want to force incorporation. That means an arbitrary list
 405      of addresses attached to an index block.
 406      I guess we could possibly allow incorporation, but I would rather not
 407      as I want the fs to be able to be read-only nicely.
 408      So that means we need to have a list of address tables.
 409      Maybe the normal approach is 'add a table if possible, else incorporate'?
 410
 411      OUCH... we may write a block a second time before incorporating the
 412      new address, so when adding an address to the table we need to check
 413      if it already exists.  That could be expensive.
 414      For index blocks might it even be a different address?  I think
 415      not but the vague possibility (in the future?) does complicate
 416      things somewhat.  Maybe we just keep thing in chron order and
 417      don't worry about duplicates until incorporate time, when we have to
 418      sort anyway.
 419
 420
 421      todo:
 422         lafs_find_block  DONE
 423         free_block must free tables DONE
 424
 425
 426      Unmounting still doesn't work.
 427      Problem is that an index block is holding a reference on parent,
 428      and parent references aren't getting cleaned up.
 429      On read-only unmount I guess we need to walk the list of leafs,
 430      discard any address info, and unlock the blocks.
 431      So that should be the first task for next time.
 432
 433 27jul2006
 434   Leafs are locked blocks which have no locked children.
 435   So any locked data block (non-inode) is a leaf
 436   Any locked index block with lockcnt[phase] 0 is a leaf.
 437
 438   OK - fixed numerous bugs, but I can unmount now!!
 439   I can even rmmod and insmod and all is cool.
 440
 441
 442 TODO:
 443  - review refile and get all the code in there from prototype
 444        DONE (I hope)
 445  - write a combined find/load/wait function and use it
 446        DONE
 447  - allocate inodes in single memcache and avoid generic_ip
 448        HALF DONE. (still using kmalloc, not doing initonce well)
 449  - review recording of new block addresses
 450     + make sure we lookup there on index lookup - YES
 451     + make sure ->uninc_next gets tranferred to table at phase change.
 452     + write incorporation code as it is tricky
 453  - review how directory updates can be incorporated into a RO filesystem.
 454     No, they cannot.  We need to update the directory.
 455  - write directory update code
 456  - write cluster construction code
 457  - make sure indexblocks with unincorporated addresses get on to inc_pending
 458     ?? or is locking them enough?
 459
 460
 461 INCORPORATION - ARgggghhhhh.
 462  The current uninc_table doesn't really lend itself to building
 463   index block... though maybe....
 464  Question: what happens when an index block disappears? i.e. it has no
 465   addresses in it?
 466   We clearly need to remove it from the parent.  This should be trivial,
 467   a direct operation on the parent index block. etc some number to 0.
 468   Then the next incorporation pass with simply lose that entry.
 469
 470  OK, that might be all well and good, but how do we sort unincorporated
 471   addresses so we can merge them?
 472  A linked-list merge sort is nice and open-ended, but does waste
 473   quite a bit of space in pointers.
 474
 475  Or maybe I should just always do small-table incorporations.
 476  Is there a way that a bad ordering of writes could force very bad
 477   index layout in this case? i.e. cause a table split every time,
 478   but new blocks go in the first (full) table.
 479  OK Decision: always do small-table incorporation.
 480   i.e. not a list of blocks: just a table of addresses.
 481
 482  FIXME check validity of index type when it is first read in,
 483    and reject early if it cannot be recognised.
 484
 485 24aug2006
 486  Took a break from incorporation.
 487  Looking at directories.
 488  Wrote dir.doc in module to sum lots of stuff up.
 489  Issue:
 490    dir blocks have an info structure attached.
 491    This included a counted reference to the parent.
 492    How long does this need to hang around for??
 493
 494    - when there is any orphan issue happening, it must stay, via
 495      the 'pinned' flag.
 496    - when actually performing a dir op, we need to create and
 497      maintain this info.
 498
 499    When last ref of a dir block is dropped, should drop
 500    the parent reference.
 501
 502
 503  Status:
 504     free list management mostly done.
 505     Next:
 506       create/delete prepare/commit/abort
 507       orphan handling
 508       dirty_block lock_block
 509
 510
 511  FIXME should dir_new_block zero out the block?
 512    How will commit_create know what to do with this block?
 513
 514  NOTE another type of directory orphan is a free leaf block which
 515    is on the part-free list.
 516
 517 -------------------------------------------------------------
 518 09spe2006 0 on the plane to Frankfurt
 519  Don't tell me I am rethinking preallocation again ???
 520
 521  TODO
 522    dirty_inode needs to record the phase it is dirty in
 523    inode_fillblock needs to check current phase and act accordingly.
 524      we inode.doc
 525    Make sure the B_Orphan flag is set and used - or discard it.
 526
 527    How do we commit creating a symlink?
 528    If it is a full block in size we cannot make an update record.
 529     - maybe have two update records? We cannot guarantee they are in
 530       the same  cluster.
 531     ... but if we put the 'make dir entry' last it should work.
 532
 533    Change 'struct descriptor' definition
 534    the 'block_type' aka 'length' 16 field becomes
 535       0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
 536       0x8001 -> 0xc000 -> miniblock upto 16K+
 537       0xffff           -> index block.
 538
 539    Need to write IO routines which decrease pending-block-count in
 540      'wc'.
 541
 542
 543    Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
 544      gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
 545      - 512 segments per block - is 512 blocks in each seg usage file
 546
 547 12oct2006
 548  Need to write
 549  - lafs_lock_{d,}block  DONE
 550        Make sure the block has parents and allocation and set the locked
 551        flag and the phase.
 552
 553  - lafs_flush
 554        Given a datablock, wait for it to be written out
 555        This is needed before updating a block that is still locked in the
 556        previous phase.
 557  - lafs_inode_init
 558        Used when creating a new object/inode
 559        Given a datablock which is to hold the inode
 560          and a type (Type*) and a mode,
 561        Fill in the data block with appropriate data so that
 562           when lafs_import_inode looks at it, the right stuff happens.
 563  - phase_flip
 564  - lafs_prealloc
 565  - lafs_seg_ref
 566  - lafs_lock_inode
 567
 568 lafs_dirty_dblock
 569 lafs_cleaner_pause
 570 lafs_dirty_inode
 571 lafs_seg_flush_all
 572 lafs_write_all_super
 573 lafs_quota_flush
 574 lafs_space_use
 575 lafs_cluster_update_abort
 576 lafs_cluster_update_commit_buf
 577 lafs_cluster_update_commit
 578 lafs_seg_apply_all
 579 lafs_cluster_update_prepare
 580 lafs_inode_phase_check
 581 lafs_seg_dup
 582 lafs_dirty_block
 583 lafs_cluster_update_lock
 584 lafs_checkpoint_unlock_wait
 585 lafs_orphan_drop
 586 lafs_free_get
 587 lafs_find_next
 588
 589 2nov2006
 590  - I need to know if a block is undergoing write-io so that I can
 591    avoid modifying it in certain circumstances.  But I don't track
 592    this information.  Options:
 593     1/ track the info.  This means an extra field in the 'struct block'
 594         because I still need to know which wc has had a write.
 595     2/ For blocks that we care about copy the data on write...
 596         But we care about all inodes and directory blocks.  That is a waste.
 597    I think we put extra info in the block.
 598    We need to know which wc was used (0,1,2) and which pending cluster
 599    in there (0-3) which comes to 4 bits.
 600    But we only care about the block for wc=0. and we could include the
 601    which-pending in the b_end_io, or maybe put it all in low bits
 602    of the block pointer....  Need max 4 bits.  Can only be sure of 2...
 603
 604    Maybe:
 605        'which' goes in bottom two bits of bi_private
 606        'wc' goes in ->flags
 607
 608
 609 4apr2007  (What a long gap !!)
 610
 611  - lafs_cluster_update_*
 612    How do we prepare for a cluster update?  How do we lock it.
 613
 614    The important thing is that the update can be written.  That
 615    requires that there is space available.  So we need to preallocate
 616    space and then release it.
 617    It is possible that each update might go in a different cluster, so maybe
 618    we need to preallocate one block per update.  That sounds a little expensive.
 619    After all, we aren't preallocating a cluster block for every data block
 620    that is dirty.
 621    So: prepare does nothing
 622         lock preallocates the space - a full block.
 623         commit copies it in.
 624     For now at least.
 625
 626 24May2007
 627
 628  - Can now create and delete lots of files.  This is cool.
 629   But:
 630     Orphan slots just grow and grow - never to be reclaimed - why?
 631     After rm f*, 7 files remain.  but rm f* again and the go.
 632          FIXED - readdir wasn't returning them
 633     Size of directory remains large.
 634     And sometimes, files become ghosts... (try just removing one after first rm f*).
 635
 636   TODO - process those orphans to clean up the directory.
 637
 638 20June2007 (Happy Birthday Dad)
 639
 640  - Creating lots of file and then deleting them leaves 5 orphan slots
 641    for the directory busy, and one for inode 0??
 642
 643    Directory handling uses the following orphans:
 644     CREATE:
 645         A new index block is created by splitting.  This needs to be linked in.
 646     DELETE:
 647         The dirent block we are deleting from
 648            If it becomes empty, it needs to go on free list
 649         The index block we are deleting from
 650            If it has lots of free space it might need to be rebalanced.
 651      The inode that was deleted.
 652
 653
 654  - When a file is fully deleted, we need to drop any orphan info... DONE
 655  - Need to do orphan handling of free blocks in directory, and
 656    unmerged parents - but there doesn't seem much point as I am going to
 657    change the directory layout (again).
 658
 659  So: writing to a file.
 660    We need prepare_write, commit_write, and writepage.
 661    Prepare loads and links the page and checks there is space.
 662    commit marks it as dirty so writeout is possible.
 663    writepage chooses a page to write out
 664
 665 25June2007 - HACK week, thanks Novell!!
 666  - write - DONE
 667  - sync
 668      Somewhat done.
 669      Need to revise the process whereby async completion
 670      clears PAgeWriteback,
 671      We need locking in there, and need to worry about
 672        'which' wrapping too soon.
 673      Need to not start IO before we set page writeback
 674  - chmod
 675      Maybe, but syncing to disk needs more thought.
 676  - 'df'
 677     Partly done, need actual content.
 678  - mkdir
 679     Can make directory, but creating first entry fails. - FIXED
 680  - symlink
 681  - readlink
 682  - new directory structure.
 683
 684 27Jun2007 - More HACK week :-)
 685
 686  - new directory layout done - much easier!!
 687  - If I delete a file that was created, the blocks still have a ref-count
 688    and we crash.
 689  - mkdir doesn't increase link count on parent. - FIXED
 690
 691  TODO:
 692    Orphan handling.
 693      Infrastructure to process orphans
 694      Handle specific cases
 695      flush orphans at key times.
 696      load orphans at roll-forward
 697
 698    checkpoint
 699      Write out a checkpoint (when?)
 700      Make sure refcount goes back to zero on blocks I write.
 701
 702   Check on inode_phase_check and checkpoint_unlock and inode_dirty
 703    in all directory operations.
 704
 705  FIX: Writing a small file leaves something non-dirty but
 706     due to be written, and lafs_cluster_allocate complains.
 707   - seems to work now.
 708
 709  FIX: dir_handle_orphan doesn't lock the orphan transaction required.
 710
 711  FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
 712
 713  FIX: lafs_allocate hasn't been written!!!
 714
 715  FIX: before updating any block in a depth=0 file, we must first load
 716       and 'lock' block 0.
 717
 718 29Jun2007 - still HACK week.
 719   Summary of how incorporation works.
 720
 721   Each index block has a small table for unicorporated changes. i.e.
 722   blocks number and their addresses.
 723   This supports efficient storage of extents, and is extensible by allocating
 724   more tables.  This last is done rarely.
 725
 726   When a block gets a new address, this is added to the table or, if
 727   there is a phase missmatch, it is added to a list until a phase change
 728   happens (so the whole block is pinned pending the phase change).
 729
 730   If the table is full then:
 731    - if the filesystem is read-only (including during roll-forward),
 732      a new table is allocated (else rollforward fails).
 733    - otherwise we incorporate the table into the block, then add the new
 734      address to the (now empty) table.
 735
 736   If incorporation requires that we split the index block we allocate one
 737    from a pool.  If there are none in the pool, we wait.
 738
 739   As the table is much smaller than a block, the incorporation into
 740   two block will always succeed.
 741   The 'uninc_next' and 'children' lists will then need to be shared
 742   between the two blocks before the new address is added to whichever
 743   table is appropriate.
 744
 745   When looking for a block address, we must always check the table and
 746   then children lists.  We do not need to check uninc_next as they will always
 747   be children.
 748
 749   How to ensure that the pool always has sufficient index blocks and we don't
 750   deadlock?
 751   We have two halves of the table, one for each phase.  Before we allow
 752   a block to be dirty in a phase, we ensure that the pool has adequate
 753   index blocks for that phase.  e.g. twice the depth of the block.  If it
 754   doesn't we block the dirtying until space becomes available.
 755   For syscall writes, this is easy as we catch in prepare_write.
 756   When we perform a phase change, we must be sure there are enough index
 757   blocks for the deepest bloc that will stay dirty.  If there aren't, we need
 758   to flush all dirty block, and unmap all writable mappings before
 759   starting the checkpoint.
 760
 761
 762  FIX: need to work out life time rules so that inodes hang around while they have blocks.
 763     currently have an igrab that is never put.
 764
 765  FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
 766
 767 3Jul2007
 768  Checkpoint flushing is getting close.
 769  Current problem.
 770    InoIdx blocks are not changing phase.
 771    Phase change should happen when all children have been incorporated, and
 772     then the write has been triggered marking us clean.
 773   For InoIdx blocks, we need to be marked clean when the data block
 774    completes.
 775
 776 5jul2007 - a week off
 777  Checkpoint flushing seems to work !!!!
 778  FIX: what should filesize of symlink be?
 779      other filesystems use len, but still zero-terminate for vfs.
 780
 781  Problem.  A chmod is followed immediately by an unlink then a checkpoint.
 782    The chmod update gets into the checkpoint cluster, but the unlink completes
 783    before the checkpoint is finished so the new superblock sees the file
 784    as gone.  Roll-forward find the update and want to update a missing file.
 785
 786    This isn't a big problem, but with slightly different details, it could be.
 787
 788    One option is to ignore updates that preceed the updated block.  That might
 789    be awkward with e.g. directory updates and checkpoints that cross multiple
 790    segments.
 791
 792    Another option might be to prohibit updates once a checkpoint has started
 793    unless they are known to be after the phase change.
 794
 795  FIX: unlink isn't punching a hole in the inode file.
 796       Inode usage map isn't being updated. - FIXED (For create, not unlink).
 797
 798  FIX: roll forward does not pick up inodes, only data blocks.
 799     But tiny files are synced to inode, so they might not be picked up.
 800     So we must process a level=0 inode like a data block.
 801
 802 6July2007
 803  Time for lots of clean up.
 804
 805 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
 806 DONE 2/ rename 'lock' -> 'pin'
 807  3/ Review and fix up all locking/refcounts.  See locking.doc
 808 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
 809          B_Alloc inside the semaphore
 810        Also lock inode when copying in block 0 and probably
 811        when calling lafs_inode_fillblock (??)
 812 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
 813          more allocations can come in at any time.
 814 NotYet 3c/ cluster_flush should start all writes before calling _allocate
 815          as _allocate might block on incorporation/splitting.
 816        No.  We really want _allocate to not block, but to queue...
 817         I think this is too hard to get perfect just now, so I will leave it.
 818 DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
 819 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
 820      so that locking of lru doesn't have to be too global
 821 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
 822        lru system.
 823 DONE 3g/ revise refile lru handling based on new understanding
 824  3h/ Utilise WritePhase bit, to be cleared when write completes.
 825      In particular, find when to wait for Alloc to be cleared if
 826       WritePhase doesn't match Phase.
 827        - when about to perform an incorporation.
 828  3i/ make sure we don't re-cluster_allocate until old-phase address has
 829      be recorded for incorporation.
 830  3j/ Check that index blocks cannot race when getting locked....
 831   k/ Check what locking is needed to set PagePrivate exclusively.
 832 DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
 833      We need to get it done in process context I think and lock
 834       ->waiting access with fs->lock after changing it to ->lru
 835 DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
 836         only when *all* have finished.
 837   n/ on phase change, uninc_next blocks need to be shared out.
 838 NO 3o/ Make sure lafs_refile can be called from irq context.
 839  3p/   lock all lru accesses.
 840  3q/ Lock those index blocks!!!
 841  3r/ Can inode data block be on leafs while index isn't, what happens if we
 842        try to write it out...
 843  FIXED Why are extent entries only grouped in 4s?
 844  If InoIdx doesn't exist, then write_inode must write the data block.
 845  4/ resolve length of symlink
 846    FIXED - long symlink followed by 'sync' crashes.
 847    FIXED - rollforward isn't calling 'allocated' on blocks, or something
 848    FIXED - I cannot find 'bfile'. (inode isn't written)
 849    SEEMS OK...- Must flush final segment of a cluster properly...
 850  5/ Review what does, and does not need to be initialised in a new datablock
 851  6/ document and review all guards against dirtying a block from a previous phase
 852     that is not yet safe on storage.
 853           See lafs_dirty_dblock.
 854  7/ check for proper handling of error conditions
 855      a/ checkpoint_start might fail to start a thread!
 856      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 857  8/ review checkpoint loop.
 858        Should anything be explicit, or will refile do whatever is needed?
 859  9/ Waiting.
 860        What should checkpoint_unlock_wait wait for?
 861        When do we need to wait for blocks the change state. And how?
 862 DONE 10/ rebase on 2.6.current
 863 DONE     - use s_blocksize / s_blocksize_bits rather than fs->
 864
 865  11/ load/dirty block0 before dirtying any other block in depth=0 file
 866  12/ Add writecluster flag for old-phase updates.
 867      Why is this needed?  updates should always go in the new phase???
 868  13/ use kmem_cache for 'struct datablock'
 869  14/ indexblock allocation.
 870         use kmem_cache
 871         allocate the 'data' buffer late for InoIdx block.
 872         trigger flushing when space is tight
 873         Understand exactly when make_iblock should be called, and make it so.
 874  15/ use a mempool for skippoints in cluster.c
 875  16/ Review seg addressing code in cluster.c and make sure comments are good.
 876 DONE 17/ Make sure create inherits uid etc from process.
 877  18/ consider ranges of holes in pending_addr.
 878
 879  20/ Implement rest of "incorporate"
 880  21/ Implement staged truncate
 881          use for setattr and delete_inode
 882 DONE 22/ block usage counts.
 883  23/ review segment usage /youth handling and make a todo list.
 884       a/ Understand ref counting on segments and get it right.
 885  24/ Choose when to use VerifyNull and when to use VerifyNext2.
 886  25/ Store accesstime in separate (non-logged) file.
 887  26/ quotas.
 888         make sure files are released on unmount.
 889
 890  30/ cleaner.
 891        Support 'peer' lists and peer_find. etc
 892  31/ subordinate filesystems:
 893      a/ ss[]->rootdir needs to be an array or list.
 894      b/ lafs_iget_fs need to understand these.
 895  32/ review snapshots.
 896       How to create
 897       how they can fail / how to abort
 898       How to destroy
 899  33/ review unmount
 900       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 901  34/ review roll-forward
 902       - make sure files with nlink=0 are handled well.
 903       - sanity check various values before trusting clusters.
 904
 905  34/ Configure index block hash_table at run time base on memory size??
 906  35/ striped layout.
 907          Review everything that needs to handle laying out at cluster
 908          aligned for striping.
 909
 910  36/ consider how to handle IO errors in detail, and implement it.
 911  37/ consider how to handle data corruption in indexing and directories and
 912      other metadata and guard against problems (lot of -EIO I suspect).
 913
 914  - check all uninc_table accesses are locked if needed.
 915
 916 And more:
 917   1/ fs->pending_orphans and inode->orphans are largely unused!
 918   2/ If a datablock is memory mapped writeable, then when we write it out,
 919      we need to with fill up it's credits again, or unmap it.
 920   3/ Need to handly orphans asynchonously.
 921
 922 ---------
 923 22nov2007
 924 Free index block are on two lists, both protected by the global
 925 hash_lock.
 926   1/ The per-inode free_index, so they can be destroyed with the inode
 927   2/ The global freelist so they can be freed by memory pressure.
 928
 929 11feb2008.   Where was I up to again?
 930    reviewing phase_flip and lafs_refile.
 931
 932   UPTO
 933      Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
 934      Need to thnik about how index block really change.  How old blocks get
 935       dis-counted from segment usage, and what optimisation are really good
 936       for re-incorporating index blocks.
 937         Operations to consider are:
 938               i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
 939           i/ leaf block splits, index block gets new entry at end, and replacement
 940                   for other entry.  Easy to handle
 941          ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
 942         iii/ probably caught in leafs.  May cause internal split so we add new
 943              index address, which is easily handled if there is space.
 944          iv/ same as iii, though split more likely.
 945
 946        What about merging index blocks.  That just makes addresses disappear, which
 947         we handle the slow way.
 948        Do we ever re-target index blocks?  Would need to be careful about that.
 949        Make it look like a split where one block ends up empty as a hole.
 950      Need to write
 951            grow_index_tree (DONE - untested)
 952                   ib is a leaf inode that is getting full.  Copy addresses
 953                   into 'new', and make 'ib' an index block pointing at new.
 954
 955            add_index/walk index (DONE - untested)
 956
 957            end of do_incorporate (DONE - untested)
 958                 new contains the early addresses.  Some remain in ib
 959                  and/or ui.
 960                 the buffers much be swapped, so ib has the early address.
 961                 ui needs to be attached to new
 962                 return 2; - then new uninc needs to be split
 963
 964            lafs_incorporate
 965                 case 2 - horizontal split
 966                 case 3 - vertical split
 967   12feb2008
 968    Bother - uninc_table is a problem (again).
 969    We can currently add at any time with just a spinlock.
 970    So when we split a block horizontally,
 971
 972
 973    Still need to
 974           share out children and uninc_table in do_incorporate
 975           share out credits in do_incorporate
 976
 977 14feb2008
 978    Still need to do incorporate as above but took a break to...
 979
 980    Counting allocated blocks now works - stat show right info, hopefully
 981      storage is correct too. - DONE
 982
 983    next: truncate?  orphan thread?
 984       Then segment usage and the cleaner.
 985
 986
 987    thoughts:
 988     truncate - removing blocks doesn't need to erase them...
 989     - nothing forces a cluster_flush promptly!!!  We need a timeout
 990          or at least we need a flush before truncate_inode_pages...
 991
 992     - in lafs_truncate we need to make the block an orphan an pin in
 993       all in a checkpoint.
 994
 995 21Feb2008 (Research morning)
 996    Discard checkpoint thread created on demand in favour of a cleaner
 997    thread that runs all the time.  It cleans and checkpoints and
 998    orphans and scans.
 999
1000      want to:
1001         do segment scan and get a real list of free segments and
1002         free-space info!
1003
1004 25Feb2008
1005  - segment usage scanning to count free blocks
1006  - fix up re-reading of erased blocks
1007  - FIX truncate can still block waiting for writeback to complete.
1008  - FIX allocations aren't failing when we run out of free space
1009  - FIX df doesn't agree with du.
1010
1011  problem:
1012    Truncate when an index block has addresses in uninc_table.
1013      The summary for the new address has already been performed.
1014      We need to deallocate the new without disturbing the old.
1015      However a simple allocation may not be possible.
1016      I guess we can prune them all to zero, then incorporation
1017       can proceed.
1018
1019  TOFIX: when truncating a recently created file, it is still depth=0 so
1020     nothing happens.
1021     We really need to increase the depth to 1 as soon as we dirty
1022     any block, then reset back to 0 if it fits.
1023
1024 26Feb2008
1025   We have a file that we have written to, and the data blocks have been
1026   written out and the addresses stuck in uninc_table.
1027   We then truncate the file.  Who releases the usage of those blocks?
1028   And who removes them from uninc_table?
1029
1030   OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1031   I really should make sure that inodes are getting freed properly and the
1032   inode map is clean and everything.
1033
1034   BIG QUESTION
1035     Do we reserve segment-usage blocks.
1036      We cannot do it naively as we get infinite recursion.
1037      But we need it to be allowed to dirty the segment block.
1038      But we cannot pin them to this phase as we want to write them out
1039      after this phase
1040      This still needs more thought.  I avoided the recursion by setting SegRef
1041      before getting the ref.  But that isn't safe.
1042
1043 28Feb2008
1044   The table of cleanable segments is not working out.  Each segment appears multiple
1045   times which wastes space and adds confusion.
1046   We really want to be able to lookup by dev/seg and also find the least.
1047   'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1048
1049   We could have a skiplist for dev/segment lookup and do a merge-sort on
1050   a different link when we want to find the best segment.
1051   We then remember the best number found since a sort, and re-sort if the top
1052   is worse than the best.
1053
1054   We keep all this in a fixed size table.  Each entry has
1055    seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1056       addr-sort-skip links.
1057    This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1058    Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1059    One page of 16byte entries (256 of them)
1060    2/3 page of 24byte entries, 1/3 of 32byte entries.
1061    Total 2 pages, and 256+113+43 = 412 entries.
1062
1063   But deleting random elements is awkward... but not too awkward.  We can delete
1064   lots of entries by marking them as old, then performing a single pass of the skip
1065   list deleting them.
1066
1067   We should keep free segments here too, on a separate list.
1068
1069   So how about:
1070    2 pages of 16byte entries
1071    1 page of 24
1072    1 page of 32
1073
1074   free list randomly threads through all.
1075
1076   When using from 24 or 32, randomly choose height of 2-5 or 2-9
1077   Two lists run through the skiplist entries.  One for cleanable, one for free.
1078   Remember the nth element for some small n (10, but it decreases as we pull
1079   things off the front) and if we add something less than that, we trigger a
1080   mergesort on the next time we want to clean.... maybe.
1081
1082   Remember end of free list and add to there.  Maybe merge-sort the free list
1083   by addr occasionally.
1084
1085   Quesitions:
1086     When can we clean, when can we free wrt checkpoints?
1087       - we an clean a segment as soon as we have a checkpoint after it.
1088         So we record the youth of the segment holding the (start of the)
1089         checkpoint, and can clean any segment with a lower youth.
1090       - we can free a segment after the checkpoint after itfs usage has reached
1091         zero.  So if usage is zero and youth....
1092         We could offset the usage by one (say - for the first cluster header..)
1093         then when we find a segment with usage of '1', we schedule an update to
1094         0 in the next checkpoint...
1095     Have about segments with different sizes - they get different weights.
1096        Need to divide by segment size:  usage * youth / size.
1097
1098   TOFIX
1099    - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1100    - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1101
1102    - Lots of 'creates' makes lots of little clusters - need to optimise!
1103         Or it could be deletes as we currently cluster_flush for each
1104         delete.
1105          - I think this is fixed
1106
1107 29Feb2008
1108   Started looking at the cleaner.
1109   Need to understand how much to clean each checkpoint
1110   Need to track free-space-in-active-sectors while scanning.
1111
1112 3Mar2008
1113   TOFIX
1114     - the cluster head is currently limited to one page.  This is not good.
1115
1116     - Should the cleaner start before the scan is complete after a checkpoint?
1117       Probably it can, but while the scan is still happening it might be best
1118       to be cautious ??
1119
1120   STATE:
1121     try_clean is taking shape and has a few FIXMEs.
1122     need to write async find_block code and get it to watch for
1123        block in a cleaning segment.
1124
1125 28Mar2008
1126   - where can padding appear in a cluster? between miniblocks? at
1127     end of device blocks?
1128   - need to track phys block while parsing headers for cleaning.. why?
1129   - determine rules for avoiding block lookup during cleaning
1130     based on youth/snapshot age, and truncate generation.
1131      We need to load the inode from each snapshot
1132     Can we optimise based on snapshot age?
1133     only if we know the block is newer than the snapshot.
1134     So when we relocate blocks (cleaning) they must go in a segment
1135     that is marked as being old. we cannot really guarentee that.
1136     I guess blocks that are marked as 'new' can safely be skipped if
1137      segment is newer than snapshot. This 'age' is not the youth, but
1138     is the cluster_head->seq which is stored in creation_age.
1139
1140  - Store the rootdir for a filesystem in the metadata for the root inode.
1141    Then 'struct snapshot' doesn't need rootdir.  It can have a root
1142
1143 30Jun2008
1144   Looking at lafs_find_block_async.
1145      Needs async flag to make_iblock.
1146         Check that.  Can we block_adopt if there was an error?
1147              iblock will exist.
1148      setparent has async flag.
1149      lafs_leaf_find has async flag
1150      lafs_wait_block_async
1151
1152   FIXME I wakeup the cleaner every time an IO completes.
1153   Do I really want that?  Maybe only when number of async IOs hits
1154   half the recent maximum??
1155
1156   FIXME need to ensure that lafs_pin_dblock flushed committed
1157     B_Realloc blocks.
1158
1159   FIXME when we incorporate a dirty (non-realloc) address to an index block,
1160     we need to clear B_Realloc on the indexblock.
1161
1162   FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1163     giving it any credits.  Where should they come from?
1164
1165   We don't seem to scan for free/cleanable segments often enough.
1166
1167   FIXME we shouldn't start a checkpoint while cleaning is happening.
1168
1169   FIXME need to be careful when cleaning about finding inodes that
1170     don't exist any more.
1171
1172   FIXME give credits to realloc blocks.
1173
1174   FIXME think about/document transitions between realloc and dirty,
1175     and what locking is needed.
1176
1177 2Jul2008
1178   Allowing for the FIXMEs above, the cleaner is now identifying
1179   blocks that need to be cleaned and marking them B_Realloc (I think).
1180   We now need to gather these into a write cluster and write them.
1181   They will all be on the clean_leafs list, so we can iterate that
1182    allocating or incorporating as needed.  This will be similar to
1183    do_checkpoint.
1184   Important question is: when?
1185    Ideally we would have some auto-flush mechanism.  The cleaner just
1186    keeps finding blocks to clean and when we start running out of
1187    resources we flush the cleaning queue.
1188    However we will still want to flush the cleaner always before a
1189    checkpoint, so for now we cna implement that bit and wait for a
1190    need for the other to arise.
1191
1192
1193   FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1194       don't record that location the same way.. how to handle?
1195      Should check that 'adopt' doesn't do the wrong thing with this block.
1196
1197
1198   Realloc blocks need to be pinned.  That makes sense.  Only that way
1199     will they get onto the clean_leafs list.
1200   When checkpointing we should probably examine clean_leafs to be
1201    on the safe side.
1202
1203
1204   Realloc and Dirty:
1205
1206      Both of these hold a Credit.
1207      Both can be set at the same time.
1208      Cleaner ignores Dirty and sets Realloc anytime the block is in
1209       the wrong segment.  It also Pins the block.
1210      When the cleaner is flushing to the cleaning segment, it
1211       ignores Dirty blocks.  They get their Realloc cleared, but
1212       the remain pinned.  So they will get moved at the next checkpoint.
1213      How do we know whether an indexblock should be Dirty or Realloc?
1214       The Dirty/Realloc bit is cleared before we get to incorporation.
1215       Maybe we lafs_dirty_iblock the parent of any block we write
1216        out.  Then after incorporation, we set Realloc if it is not
1217        dirty.
1218
1219 STATUS:
1220   I think I'm pinning cleaner blocks now.
1221   Need to make sure the dirty ones are dropped. DONE
1222   Need to make sure the usage is transferred
1223   Need to get free segments back into use
1224   Need some more 'dump' options.  Maybe youth/usage files.
1225       Maybe tree.
1226   Need to make sure scan etc are triggered often enough.
1227
1228   FIXME lafs_prealloc walks up ->parent without locking
1229     I think we want i_mapping->private_lock like lafs_pin_iblock.
1230
1231 TODO:
1232   1/ a 'dump' option that triggers a scan and prints everything out.
1233   2/ scan must mark freeable as such, then subsequently free them.
1234   3/ Look at code that decreases usage of old segments.
1235   4/ Review lafs_cluster_wait_all and decide exactly how long we need
1236      to wait.
1237   5/ Review 'FIXME that is gross' HZ/10 thing.
1238   6/ Review 'wait for checkpoint to flush' msleep(500);
1239            Maybe remove that altogether.
1240
1241   FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1242   FIXME BUG in lafs_allocated_block fired.
1243             from lafs_erase_dblock from invalidate_page from .. vmtruncate
1244              from lafs_setattr
1245
1246   Current problem:
1247     An inode data block is dirty and pinned, but the inoidx is no longer
1248        pinned.  Presumably it isn't dirty.
1249      Recheck what 'dirty' means on the two blocks and see how this can happen.
1250
1251 10july2008
1252   Tree gets very big!  Lots of 'Realloc' blocks that should
1253    be long gone.
1254
1255   WE are spinning in cleaner again, and not in try_clean.
1256
1257   Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1258   In general it shouldn't be.  The flush_cleaner process will remove
1259    the Realloc bits so the blocks fall off clean_leafs.  They then either
1260    go onto phase_leafs or get unpinned.
1261   But I currently have a problem with InoIdx/data.
1262   The Pin is transferred to the Data block, but it doesn't go from the
1263    InoIdx block because it has a pincnt.  Now that is probably a bug, but
1264    what if it weren't?  What if, while we were cleaning, a block got dirtied.
1265    That would pin the whole tree.
1266   I guess the rule about not allocating an inodedata block while the
1267    InoIdx is pinned needs to be revised.  If the inodedata block is
1268    Realloc (and not Dirty) while the InoIdx is not Realloc, we
1269    can go ahead (in a cleaning segment).
1270
1271  FIXME to check:
1272    adir/big1 is garbage.... big1 was removed, so why is it even there?
1273               FIXED.
1274    echo tre > dump  # still too much stuff.
1275
1276
1277
1278  Put cond_sched in checkpoint loops!
1279
1280
1281  Thoughts about cleaning and pinning.
1282
1283   When cleaning we need to know how many dependant blocks are being cleaned
1284   so that we know when *this* block can be written - i.e. when the could hits 0.
1285   We cannot use the pincnt for this phase because there may be dependant blocks
1286   which are dirty.  They, and therefore this, may get flushed at next checkpoint,
1287   but they may not.  If we could be certain they would, we could just write
1288   to the clean-segment blocks which can become unpinned.  However if there
1289   is an index block being cleaned, and no dependant is being cleaned, but some
1290   are dirty but not pinned, then the checkpoint can go past without the block
1291   being moved.... but maybe we can detect that.
1292
1293   Try this:
1294     We set B_Realloc precisely on blocks found in segments being cleaned.
1295     We pin these blocks and leafs which are Realloc go in clean_leafs.
1296     If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1297        That way it gets written at end of checkpoint, but to main cluster.
1298     When we incorporate Realloc blocks into an index block, it gets marked
1299        Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
1300     On a checkpoint, we process both phase_leafs and clean_leafs
1301
1302
1303  FIXME do inode reads async better when cleaning...
1304
1305  FIXME if a realloc inode has been allocated to a cluster when we try
1306      to dirty it, confusion can ensue as the writeout won't mark it
1307      clean, but will use up the credits.
1308      Maybe we need something similar to phasewait to not set PinPending...
1309       But normal dirtying doesn't phasewait.   I think we just need to
1310       detect this case and wait for the clean-cluster to flush.
1311       Messy...
1312
1313  FIXME make sure incorporate is doing the right thing with credits.
1314
1315  FIXME lafs_write_inode. We need to be careful about clearing Dirty
1316            when making an update.  Need some sort of locking.
1317            Need to review all inode dirty stuff and make sure we do
1318            write thing no matter when it is called.
1319
1320  FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1321         anymore so we don't know how to flag the index block.
1322
1323 2008jul13
1324  UPTO: unlink etc don't prealloc the inode that will be modified.
1325     And a warnon inode.c:579 is very noisy.
1326
1327 2008jul22
1328  FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1329      but it doesn't get set until AFTER lafs_reserve_block is called.
1330
1331  Here I am...
1332    Cleaning cleans an InoIdx block which schedules the data block.
1333     Subsequent the InoIdx block gets pinned again.
1334     Now when we go to write the data block, we cannot because InoIdx is pinned
1335      in same phase.
1336      Maybe given that data block is pinned, we write it anyway...
1337
1338  FIXME: when we realloc an block embedded in the inode, don't pluck it out
1339         and put it back in again.  Just realloc the inode.
1340
1341  FIXME: when cleaning a directory that has shrunk, we think we have
1342      blocks that don't exist any more. FIXED - we thought '0' was in
1343      segment '0'.
1344
1345 2008jul23
1346   FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1347      flush finds no credit. for InoIdx block of 8501
1348
1349   FIXME: do we do SEGREF on all the index blocks? do we need to?
1350
1351
1352 2008jul24
1353   FIXME: seg usage for segment 0/5 isn't dropping to zero.
1354     Part of a file got moved off, but count is still there.
1355     FIXED - seg_move wasn't being called.
1356   FIXME: segusage file has inconsistent extents:
1357       Extent entries:
1358        0 -> 694 for 2
1359        1 -> 1291 for 1
1360        1 -> 15 for 1
1361    FIXED several bugs in walk_extent
1362
1363   FIXME qphase:  any locking between that changing and lafs_seg_move??
1364     I don't think so.  Just that seg_apply_all must be called after qphase is set.
1365
1366   FIXME make sure we don't try to clean the current segment!!
1367
1368   FIXME 'Available' goes negative!
1369       Creating large file doesn't instantly reduce 'Used'.
1370       Deleting files plus sync doesn't increase Avail?
1371
1372   FIXME a segment is in the table but doesn't print out!
1373
1374   FIXME we don't cope with running out of free segments (not that we ever should).
1375
1376   FIXME check all Credit usage and make sure credits are returned when
1377     ->parent is dropped.
1378     provide visibilty into credit counts.
1379     Make sure we are keeping enough space for cleaning.  We should always
1380      have a few segments unallocatable.
1381
1382 2008jul25
1383   FIXME cannot do io completion in cleaner thread as it can block on
1384      a i_mutex which might be waiting for completion. FIXED (keventd).
1385
1386   FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1387             If we 'know' we have a reference, e.g. a child with a ->parent
1388             link, we can access it without locking.
1389        So:
1390            lafs_make_iblock should return a counted reference.
1391
1392        If we own an (indirect?) reference to iblock, we can access
1393         both iblock and dblock for free... but iblock can change???
1394        If not, we need to get a reference to on or other under a lock.
1395
1396   FIXME block->inode should be a counted reference?
1397
1398 lafs_make_iblock OK
1399   lafs_leaf_find OK
1400     lafs_inode_handle_orphan OK
1401       inode_handle_orphan_loop FIXED
1402     __lafs_find_next OK
1403     find_block FIXED
1404   __lafs_find_next OK
1405     lafs_find_next FIXED
1406       dir_lookup_blk
1407       dir_handle_orphan
1408       lafs_readdir
1409       lafs_inode_handle_orphan
1410       choose_free_inum
1411   find_block - FIXED
1412
1413  FIXME root->iblock should always be refcounted.  Is it?
1414  FIXME walking siblings - what lock?
1415
1416 2008jul28
1417  FIXME several times we clean PinPending without refiling, in dir.c in particular.
1418     that looks wrong. FIXED
1419
1420   Maybe  lafs_new_inode should return a reference to the dblock
1421     Or pin it. or something. FIXED  And pinned (when needed).
1422
1423  FIXME lafs_inode_dblock might return a block without valid data...
1424    Need to get valid data, then load block 0 in find_block rather than
1425        load_block.  FIXED
1426
1427  FIXME we really should own a reference to ->dblock before calling
1428     lafs_pin_inode.  We don't want IO during a pin request.
1429     FIXED
1430
1431  FIXME review use of PhysValid FIXED
1432
1433  lafs_orphan_abort - what if lafs_orphan_pin not called?
1434    or if 'b' is NULL.  FIXED
1435
1436  Do I Need to clean PinPending when retrying??
1437    Well, we need to be phase-locked when we set PinPending, so
1438     it must be Pinned to the current phase.
1439     So when we unpin a datablock, we must clear PinPending.
1440   FIXED we now clear PinPending in do_checkpoint.
1441
1442  Does phase_wait do the right thing when pinning an inoidx block
1443    for an inode? FIXED
1444
1445
1446 Pending
1447   Need to understand and document the lifetime of a page with datablocks.
1448     who hold what refcount, and when can it be freed?
1449    Then fix up locking in lafs_refile, __putref.
1450
1451  FIXME how keep what refcount on orphan blocks/inodes??
1452  FIXME should dirty/pinned/etc hold a refcount?  they don't.
1453
1454
1455 Later:
1456  FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1457
1458  FIXME make sure empty files have depth of 1.
1459
1460  FIXME Truncate proceeds lazily. All data blocks need to be gone
1461
1462 26aug2008
1463  If I call lafs_erase_dblock while a write is underway, we have a problem.
1464   We need to wait potentially for a checkpoint to let go of the block and
1465    a write to complete.
1466     This should be done with waiting for PG_writeback on the page to disappear.
1467   Check this out.
1468
1469   When end_page_writeback is called, we must have dropped all references to the
1470    page.
1471   When we commit to writing a block, we have to set PG_writeback on the page
1472    so that truncate et al can wait for it.  Before we have committed, truncate
1473    can just remove the page.  Internally we differentiate by B_Alloc.
1474   So before setting B_Allocated we need to test_set_page_writeback(page).
1475   Be careful of races.
1476   I don't think we can ensure all references are dropped.  After all, that is
1477   the point of refcounts.  So dblock array must exist without page!
1478   But we need to ensure that we don't start a writeout after truncate
1479   has done wait_on_page_writeback.
1480   This is done with the page locked so when we want to write a page
1481     in a checkpoint, we need to lock the page first.  Once we have the lock,
1482     we check if the page is still dirty.  If it has been truncated it
1483     will be clean.
1484    But how do we safely reference the page if b->page can be cleared?
1485     How about:
1486       When we clear PagePrivate, we take a counted reference to the page
1487       for db->page.  This is dropped when the page is freed by lafs_refile.
1488       But while it is held, it is still safe for db->page to be dereferenced.
1489     So before we commence writeout we have to lock the page and set
1490      PG_writeback.  After locking, we need to test if writeback is still
1491      appropriate.
1492
1493   Maybe not.  I think we can submit blocks for writeout without setting the
1494   page to writeback.  If we do, then we need to be sure those writes
1495   finish before invalidatepage calls releasepage (block_invalidatepage
1496   calls discard_buffer which calls lock_buffer which waits).
1497   In our case invalidatepage need to make sure that no new write commenses.
1498   Maybe we should lafs_iolock_block before we allocate to a cluster and check
1499   again if the block is dirty.
1500
1501   So:
1502     lafs_cluster_allocate does:
1503        lafs_iolock_block
1504        check if still dirty.  If not, unlock and return
1505        set allocate flag
1506        allocate and write
1507        when write completes, allocate is cleared.
1508                     unlock block
1509
1510     invalidatepage does
1511        lafs_iolock_block
1512        clear Valid,Dirty,Realloc
1513        lafs_iounlock_block
1514
1515
1516
1517 2008 aug 28 - happy birthday.
1518 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1519 lafs_prealloc complains.
1520
1521   mark_cleaning does too, but cleaning only happens well away from a checkpoint
1522   lock.
1523 segsum_find is being called to reference a new segment when we flush a cluster.
1524  segment usage blocks are special.  Their index information doesn't
1525 need to be written out in the current checkpoint.  We can do that, but
1526 the backstop is to write just the data block in the tail of the
1527 checkpoint and write indexing information later.
1528
1529 2008sep10
1530  unlink is getting "No space left on device".  This is when trying to
1531  pin the directoory block, the physaddr is 0, so it looks like we want
1532  NewSpace.  But we should even be trying to prealloc in that case becase
1533  there should already be a prealloc on the block.  i.e. there should be
1534  credits.
1535  Hmmm. after multiple 'syncs' how can the block not be written out.
1536  Maybe it is embedded in the inode?
1537  When we pin a block that was embedded in the inode it isn't clear what to
1538  do.  If we might grow the file so it doesn't fit any more, we need to
1539  allocate NewSpace.  If we know it won't grow. we use Release.
1540   This still needs a proper fix.
1541
1542  Cleaning seems to be working nicely.  However we don't get all the space
1543  back that we should because lots of blocks still have credits that
1544  aren't being returned.
1545
1546  So when should credits be returned?
1547  They are set when a block is pinned.  It then gets dirtied which
1548  consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
1549  then it doesn't need any credits.
1550
1551
1552  It seems that cluster_flush is not always writing things in the correct
1553   order.  Root gets written before some other things below it.
1554    Maybe they are temporarily out of the loop??
1555  No.  There are dirty blocks which one checkpoint doesn't pick up, but
1556   they aren't holding the index block pinned. so they lose allocation.
1557
1558  But they must hold the indexblock pinned, even though they aren't pinned
1559  themselves.  We maybe do this just with the refcnt... maybe.  That will cause
1560  it to phase-flip rather than drop pinning, which I think is right.
1561
1562  So: too many credits remain allocated.  Where are they?  There are 1464
1563    outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1564    But things removed from the tree have credits removed.
1565
1566
1567
1568 FIXME roll forward ignores inodes.  But what about an inode that contains
1569    data.  Should that be ignored?  I think not.
1570 FIXME delete adir/big2 then delete adir and it cannot release:
1571   Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1572  presumably there is orphan processing or something to complete???
1573 FIXME when files are deleted, the space isn't returned!
1574    This seems to be mostly fixed - need to test.
1575 FIXME when I "rm [b-z]*" it waits for writeback on something???
1576    zfile again!!!  OK, I think that is fixed.
1577
1578
1579 12sep2008
1580   Current problem:
1581     seg_apply_all dirties dblocks.  When should they be reserved?
1582     The originally get reserved by a lafs_reserve_block call in
1583     segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1584     However: that block might get written before *and* after a checkpoint.
1585     So we need N* Credits.  These are usually only used for Index blocks.
1586     We can set these easily enough if inode type is TypeSegmentMap.
1587     We move them across to Credit in seg_apply_all.
1588     But when to we clear them if they aren't needed?  I guess
1589      when we drop the last segref.  Yes, we already do that.
1590     FIXME need to make sure these get flushed on next checkpoint
1591      if we cannot allocate new credits after a checkpoint.
1592
1593   New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
1594     Think that is fixed.
1595
1596   Some problems.
1597     1/ see above:  rm x/y; rmdir x -> BUG - FIXED
1598     2/ Spins on 'CURRENT=1' ??
1599     3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1600     4/ When I create/delete a file, ablocks_used increments by one.
1601         The inode hasn't been allocated yet, so it seems the deallocation
1602          isn't adjusting ablocks_used??
1603     5/ open_namei (for dd) got caught on a mutex_lock.
1604     6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1605        I'm not sure where we should and am not thinking very clearly.
1606        Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1607     7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
1608        the lock?  Writeout that hasn't completed?
1609        Yes.  writepage calls lafs_allocated_block without calling flush.
1610        So the block could be sitting waiting for a flush.  How long do we
1611        wait??
1612     8/ It seems that some datablock can need NCredits.  Make sure these
1613        are handled properly re flush-or-refill after checkpoint and
1614        flip_phase rather than unpin.
1615     9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1616        enough, and we lock up (see 7).  Need to flush the first block
1617        straight away, and the next one as soon as the first finishes, etc.
1618        Or something like that.  Then remove the comment from lafs_writepage.
1619
1620 8th December 2008
1621
1622   I seem to be getting only 4 blocks to a cluster at the moment.
1623    This is good as it motivates the code to handle block splitting in
1624    the Btree.   But it shouldn't happen.
1625
1626   ....
1627   Block spliting might work - it doesn't crash at least.
1628   But
1629   After deleting all files, the tree is full of stuff.
1630   Lots of inode data/InoIdx blocks.
1631   Many but not all a Pinned.  The others are OnFree
1632   The Pinned ones have outstanding references.
1633   Others
1634
1635   ....
1636   Problem with the block splitting, when adding an index block.
1637   The index block is initially empty - we need to find things by looking
1638   at children.  But we don't.  We BUG_ON the iphys==0.
1639   In general, when we add a block below and index block and before we incorporate,
1640   the block must be found by finding the first indexed block and looking to
1641   see if there is a 'next' block that contains the address we need.
1642   FIXED
1643
1644   But if we truncate a file while an index block is pinned and dirty,
1645   we spin on trying to incorporate it, which should make it empty.
1646
1647 11th December 2008
1648   deadlock.
1649   sync is trying to get lock in lafs_cluster_flush
1650   pdflush holds the lock and is stuck in cluster_flush_0xa40
1651     some wait_event I expect.
1652     Maybe we need an unplug ??
1653
1654  - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1655    This is in clean_free.  We try to update the 'youth' to mark
1656    the segment as free, and we don't have a reservation to do it.
1657    Maybe just reserve it there and then.
1658
1659
1660 12th December 2008
1661   When doing a lookup in an index block, we need to check the unincorp
1662   address list.  It isn't enough to look for unincorp blocks as they
1663   might have disappeared.
1664   For INDIRECT and EXTENT this is easy enough as full information is in
1665   'uninc'.
1666   For INDEX it is a little tricky as we need to look at the full set of
1667    addresses to know where a particular address fits.
1668    We could force and incorporate first, but that has awkward implications
1669     if it requires a split.
1670    Maybe if we get from the lookup "start+range"....
1671      That is not enough as the 'start' might get zeroed by an update.
1672
1673
1674    rm adir/* doen't work as readdir doesn't get all the entries
1675     for some reason.
1676    Reason is that they are being put in the wrong block.
1677    lafs_find_next doesn't correctly find the 'next' block if it
1678    hasn't been incorporated yet.
1679    Block can be:
1680      in index tree -- easy to find
1681      in uninc_table -- not too hard
1682      in only in the ->children list, or attached to a page.
1683    It would be nice to use find_get_pages but that isn't exported so try
1684     something else for now.
1685    For index blocks
1686         Look in index block for 'next
1687
1688 15th December 2008
1689    FIXME when we split an index block, we need to hold a reference to
1690    the original so it doesn't disappear until the split-off copy is
1691    written.  This is because we search from an index block to find
1692    split-off copies.
1693    [ note from Feb09.  This should be OK now. Both will need
1694    incorporation, and we now hold on to blocks until they are
1695    incorporated.]
1696
1697
1698
1699 23rd February 2009
1700   - index block.  What changes are allowed exactly.
1701      - splitting certainly makes sense.
1702      - merging two adjacent blocks is fine, of which a special case
1703        is finding that a block is empty and so removing it.
1704      - What about a 2->3 split which would require removing a block
1705         and adding another at the same time?
1706        or noticing that the first blocks addressed are all missing, so
1707        moving the index forward?
1708        In each case, searching down by indexes will find a block that
1709        has been replaced by a later address.  We could manage that as
1710        long as the new block is attached after the replaced block.
1711        So we cannot move a block.  We must delete and replace.
1712
1713   - unincorporated index blocks..
1714     unincorporated data blocks are not pinned in memory.  Once they have
1715     been written out, they can be freed.  Their address is stored in the
1716     uninc-table.  This means we can delay incorporation while many
1717     extents are written out and freed.  When we come to incorporated, we
1718     may have many hundred of address in a few extents that can be incorporated
1719     efficiently without holding all that data pinned in memory.
1720     The same scale doesn't apply to index blocks.  An index block can
1721     reference only 102 blocks (for 1K block size).  And the uninc table can
1722     hold far fewer so we will naturally incorporate more often.
1723     So keeping index/indirect/extent blocks pinned until they are incorporated
1724     is reasonable.  And it makes lookup a lot easier, as we have
1725     guarantees about ordering of block in the children list that we
1726     don't have in the uninc table.
1727
1728     Incorporation could have some atomicity issues.  There is no
1729     concern about bad stuff appearing on disk as the phase-change
1730     process handles that.  In memory it might be awkward if we split
1731     an index block before incorporating a block what would span them.
1732     That could conceivably happen if we only incorporate 8 blocks
1733     (size of uninc table) at a time.
1734     So maybe we should incorporate a full uninc list (not table) at
1735     a time.
1736     This means quite different code paths for incorporating leaf
1737     and internal index blocks....
1738
1739
1740   - uninc_table lists are a real problem.
1741     They can only be created during roll-forward so they hardly ever
1742     happen.
1743     But if the block is split while processing earlier things on the
1744     list, then splitting an uninc table would be very messy.
1745     Is there any way around this?
1746     Why not just do incorporation during roll-forward?
1747     We only need to incorporate leafs, not internal blocks because we
1748     don't use uninc_table for internal blocks any more.
1749     So during roll forward, all index blocks that are touched need to
1750     be held in cache...
1751     I think we live with that.  If it every becomes a problem, we will
1752     need to perform the roll-forward twice.  The first time collects
1753     the usage information so that we know where we can start writing,
1754     then the second just applies all the changes. to the rest of the
1755     filesystem.
1756
1757
1758    So:
1759      uninc table only used for leaves, and has no linked list
1760      unincorporated index block are stored on a list, which we
1761      sort before applying.
1762      All uninc index blocks are therefore kept in the index tree.
1763      Their order on the children list allows us to find the correct
1764      index. Each block for which the fileaddr is in the parent is
1765      followed by any blocks that have been split off and end after
1766      this one starts.  Blocks that have been emptied are Hole and are
1767      skipped over when looking for a block.
1768
1769      When we split an internal block, the remaining uninc blocks
1770      must not start with a Hole.
1771
1772    FIXME: what locking do I need around lafs_incorporate?
1773       i_mutex?? i_alloc_sem??
1774       i_alloc_sem is imposed by truncate (inode_setattr) and
1775          direct_io possibly.  So it is really about adding/removing
1776          blocks.  Not updating internals.
1777          Maybe our own mutex.  Could even be per-index-block !!
1778       Whatever it is, we need to protect walking ->children too.
1779
1780
1781 24th February 2008
1782   "rm -r" problem from 12/dec/2008 fixed now.
1783   incorporate code got a make-over and is probably much better.
1784
1785   New problems:  After test runs, cannot create files due to no space
1786      on devices!!  But directory tree is empty.
1787   I can see:
1788
1789     free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1790
1791   The problem is that we think 1425 has been allocated to data that
1792   might still need to be written, leaving not enough room for more.
1793   Index Dump shows
1794   ====================414 credits ==============================
1795   which doesn't explain everything, but does explain a lot.  There
1796   really should be nothing in the Index tree (except fs-root and
1797   tree-root)
1798   There is also:
1799   Some inodes which are OnFree and hold no credits.
1800     0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
1801     52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1802
1803   Some other inodes which are pinned with lots of credits and are
1804     on the phase_leaf list
1805     0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1806    299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1807
1808   And that is about it.  some are not Valid, some are...
1809   checkpoint just wants to 'flip' them.
1810   They mostly have a refcnt of 1... I wonder who is holding that....
1811   The reference of on the dblock is held by the iblock.
1812   But what is the iblock remaining?  Who holds that reference?
1813
1814   I restored some code to clean iblock, and now:
1815   free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1816   ====================244 credits ==============================
1817   which saved 130 credits.  That helps.
1818   There seem to be many fewer of the many-credits blocks
1819   Lot of index blocks in tree are 'OnFree' and have a
1820   0 refcnt, but haven't been removed.  Why?
1821   It seems that the have ->parent == NULL, so lafs_refile never
1822   bothers to remove them.  I guess it should...
1823   OK, lots of InoIdx block have gone now with their DATA blocks.
1824
1825   So, remaining blocks are pinned to their phase with lots of Credits,
1826     have not pincnt, mostly have physaddr==0.
1827    It is just the stray refcnt that keeps them there..
1828    inums are 40, 56, 62-73, 275-278, 280
1829     40 is f22
1830     56 is first adir
1831     63-69 are directories 2/3/4/5/6/7/8/9
1832     70-73 are looooong symlinks
1833     275 is cfile
1834     276 is dfile - same as cfile but truncated.
1835       Then some nbfile-X that were big enough.
1836
1837    So: what do they have in common:
1838      Several only use the in-inode data block, but
1839        probably not all
1840
1841     Can it be that it is refcounted on the Leaf list, and so
1842     cannot get off??  Yes, I think so!
1843     We only unpin things that have a zero refcount.
1844
1845     So: what to do?
1846       checkpoint takes it off the list, then flips the phase and puts it
1847       on the other list with refile.  During that time it has a refcount
1848       it doesn't lose the pinning.
1849       Do we want to:
1850         1/ Not have it on the list despite being pinned.
1851         2/ Drop the PIN despite the refcnt.
1852         3/ have refile do the phase_flip so it has a chance to
1853            notice the refcount has hit zero.
1854
1855       2 isn't really an option.  We need PIN to persist whenver we have
1856        a reference.  We could possibly use PinPending for index blocks too,
1857        but that would require a lot of thinking.
1858       1 requires another criterea for being on the list.  I suspect that would
1859        get messy fast.
1860       3 we used to do I think... But refile is in a big lock, and we
1861         cannot really do a phase_flip under that.. and phase flip calls
1862          refile anyway so we would get recursion.
1863       So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1864
1865       FIXME use kzalloc where appropriate.
1866
1867       FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1868
1869 25th February 2009
1870   Good progress.
1871   Only 54 credits in Index Tree now.
1872   Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1873   plus '74', which seems to be schedules for deletion - root has uninc_table.
1874    ... and 'sync' got rid of that and left 44 credits.
1875   Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1876     50  link
1877     55  zfile
1878     72  long84
1879     73  long85
1880     74  adir
1881   These seem to be the files that used data-in-the-inode
1882   They still have a refcnt of 1 (or 2 for adir).
1883   ... OK, that's gone now.  I fould a refcount leak.
1884
1885   So now:  42 Credits in Index Dump.   No stray files.
1886
1887   df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1888   So we still seem to have 1085 blocks allocated.  42 are accounted
1889   for, so 1043 still missing... either we lost the count, or lost the tree.
1890
1891   create a finy file, remove, and sync, now
1892   df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1893
1894   so I lost 15, b ut now 48 are in tree.  Lets try again...
1895   df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1896   and 44 in tree
1897   and again:
1898   df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1899
1900   Definitely losing more thant the difference in the tree.
1901
1902   Try creating empty files...
1903 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1905 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1906 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1907 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1908 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1910
1911  very strong pattern there.
1912  What about 2 files at a time.
1913 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1914 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1915 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1916 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1918
1919   Slightly different pattern - not as bad.
1920   Have to try 4 now.
1921 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1922 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1923 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1924 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1925
1926   Strange, isn't it....
1927
1928   Making sure we clear UnincCredit... result looks worse.
1929
1930 26th February 2009
1931   I fixed up the credit accounting 'incorporate' and then fixed a couple
1932   more little bugs.  And now:
1933
1934
1935
1936 ====================48 credits ==============================
1937 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1938
1939 So we still have 720 allocated credits that aren't accounted for.
1940 But we are nicely under 100...
1941
1942 .... and now
1943
1944
1945 ====================76 credits ==============================
1946 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1947
1948 That is different.  The count of missing blocks is way down,
1949 but there is some extra cruft in the index tree.
1950 Quite a few like
1951     0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1952     0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1953 and even one
1954     0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1955    330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1956 Time for a commit though....
1957
1958 and now
1959 ====================46 credits ==============================
1960 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1961
1962 so the strays in The index tree are gone. but still have 159 outstanding
1963 credits.