README

   1
   2 So, let's try to write a kernel module that implements this filesystem.
   3 It would be good to have a plan.
   4
   5 - Mount filesystem, providing empty root directory
   6    o parse mount options - DONE
   7    o find/load superblocks and stateblocks - DONE
   8    o present empty directory - DONE
   9    o Compile external module - DONE
  10    o test DONE
  11
  12 - Mount filesystem read-only with no roll-forward
  13    o IO address mapping
  14           sync_page_io or bread? - not bread I think
  15    o Index blocks management
  16    o search cluster-header for root inode
  17    o file read
  18    o Directory lookup/read
  19    o test
  20
  21 - Support roll-forward for blocks, orphans, whatever
  22    o manage segusage files
  23    o manage quota files
  24
  25 - Support writing
  26    o inode bitmap
  27    o cluster creation / block sorting
  28
  29 - Support Cleaning
  30
  31 - Interface for snapshots and other admin
  32
  33
  34
  35 ------------------------
  36 FIXME
  37  If a device is removed from the filesystem, we cannot reliably
  38  tell from the other devices or state that this is so.
  39  Maybe we need to update all devblocks with a new 'seq' number...
  40 FIXME
  41  How do we specify mounting subordinate filesets?
  42  What superblock do they have?
  43  I suspect we do a -F lafs-sub mount from the original filesystem.
  44
  45 FIXME
  46  If mount fails, we seem to be leaving a super lying around,
  47  and sync_supers dies on it. - DONE
  48
  49 FIXME
  50   Umount appear to work, but a sync_supers dies. - DONE
  51
  52 FIXME
  53   subordinate supers aren't being locked as much - is that a problem?
  54
  55 FIXME
  56   index pages never get put on an LRU - how is this supposed to work?
  57
  58
  59
  60
  61
  62
  63
  64 --------------------------
  65 Thoughts:
  66   Inodes live in an address-space, much like a file.  To load the
  67   first inode, we need an address-space, so may as well have an
  68   'struct inode' as we may want to expose it to user-space.
  69
  70  Loading an inode, need
  71    fs (lafs filesystem structure)
  72    which subfs (maybe a lafs inode)
  73    which snapshot - this is implied by the subfs inode.
  74    and fs can be obtained from inode, so just inode, inum
  75
  76
  77
  78 UPTO
  79 03nov2005
  80   review block_leaf_find and make_iblock
  81   need to do setparent and block_adopt next
  82
  83 10nov2005
  84   need to resolve locking for ->siblings list
  85
  86 24nov2005
  87   peer_find
  88   lock_phase
  89   lafs_refile
  90
  91   I can read a file.....!!!!!
  92   Code review / tidy up.
  93      resolve locking buffer vs page
  94
  95   Export on a web page somewhere??
  96
  97 16feb2006
  98  (I spent a while getting large-directories to work again in prototype..
  99   and some holidays).
 100  - Priority: clean mount and unmount
 101  - large directories
 102  - multiple devices.
 103
 104   FIXME how do we record and handle write errors???
 105
 106  The iput in lafs_release - which is needed - is oopsing
 107   at iput+0xe!
 108
 109 23feb2006
 110  Ok, I finally have a clean mount/unmount.
 111  .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.
 112
 113  Next:
 114   - make sure we can handle 'large' directories.
 115   - make sure we can handle files with indexes
 116   - handle filesystems that span devices.
 117
 118 02mar2006
 119   Hurray - clean unmounts!!!
 120   There is a nasty circular reference of the root inode which is stored in
 121   a block that it manages.  Maybe this should not happen, rather than having to be
 122   explicitly broken - the root-block can live elsewhere, not in the inode.
 123
 124   Next multi-level index blocks.
 125
 126   But first, need to understand memory pressure and pageout.
 127    How are dirty pages found to be cleaned?
 128    How is pressure put on a filesystem to clean up?
 129    How are clean pages reaped?
 130
 131   - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
 132       lru_cache_add{,_active}(page) might be easier, but isn't exported.
 133   - call mark_page_accessed(page)  to keep the page 'active'.
 134
 135 09mar2006
 136   - make sure indexes work...
 137
 138
 139  lafs_load_block+0xf
 140   eax,bx,cx,dx,s1 all zero
 141 from block_leaf_find 203
 142
 143   ... OK, indexes seem to work.
 144    But 'lafs' have problems creating some large files.
 145    Try 'tt'
 146
 147    This is due to not handling error properly.. fix it later FIXME
 148
 149 16mar2006
 150
 151   Must make sure the index address-space gets clearred up... I wonder
 152   how we find all the pages to free.  This might be one reason to keep them
 153   in a radix tree.  Though we should be able to walk our own data structures.
 154
 155
 156   Then work on mounting a 2-device filesystem.
 157
 158
 159   FIXME dir_next_ent always starts from the beginning rather than
 160    remembering where it is up to... can this be fixed??
 161
 162
 163 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
 164
 165   Mounting snapshot needs a way to identify that it is a snapshotmount
 166   and which snapshot, and which filesystem.
 167   We could use a different filesystem type, but that isn't really needed
 168
 169     mount -t lafs -o snapshot=name /original/mount/point /new
 170
 171   This grabs the named snapshot of /original/mount/point and places it at
 172       /new
 173   The 'snapshot=' option is the trigger.
 174
 175   For a control FS, we
 176         mount -t lafs -o control /original/mount/point /new
 177
 178   To grow a filesystem, we initialise a device (super/state blocks) and
 179         mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
 180
 181   as the dev_name isn't passed to remount
 182
 183   So, mount options are:
 184         snapshot=name
 185         dev=/dev/device
 186         new=/dev/device
 187         control
 188     and various
 189           name=value
 190     pairs matching what is exposed in the control filesystem
 191
 192 23mar2006
 193  - factored out super-block finding preparatory to finding snapshots.
 194
 195  Thoughts:
 196     superblocks for snapshots and sub-ordinate filesystems do
 197     not get stored in the 'state'.  There is, however, a usage count so that
 198     the prime filesystem cannot be unmounted until all snaps and subs are gone.
 199     This should just refcount the prime_sb I suspect.
 200
 201     So: a snapshot sb points to the 'struct fs' but doesn't .... what???
 202
 203 30mar2006
 204  - remove the super-block finding code by changing the layout to store
 205     superblock locations explicitly :-)
 206
 207  - teach 'mount' to mount snapshots.
 208
 209  - need to audit for bad use of ss[0]
 210  - need to find better way to map 'sb' to snapshot number.
 211  - need to make unmount work.
 212
 213 01apr2006 (no, really!!)
 214  - rewrite index to kmalloc index blocks and use a shrinker to free them.
 215    This means that indexblock no longer has a 'page', which makes sense.
 216    It also means they cannot live in highmem, which is sad, but could
 217    be fixed.
 218
 219   Notes: superblocks and refcounts.
 220    Each device holding the filesystem gets a superblock.
 221     One of these (arbitrarily) is the 'prime' superblock and gets to
 222       manage the whole filesystem.
 223     Each snapshot also gets a superblock, as does each
 224       subordinate filesystem.  These are anon sbs - using anon dev.
 225     Each anon sb takes a reference to the 'struct fs', and also to the
 226       prime sb.... how about the reference relationship between fs and prime_sb???
 227
 228     Need to ponder this,
 229
 230    - problem with getting parent superblock due to semaphores...
 231    - when unmount, put_super isn't being called, so inode 0 isn't released!
 232
 233 13apr2006
 234   (Took a week off to play with rt2500 wireless cards)
 235   - Use different filesystem type for snapshots and subordinate filesystems.
 236     This removes the semaphore problem
 237   + OK, mount and unmount works for snapshots... what next?
 238      - review index block - worry about himem?
 239      - review ss[0] usage - OK
 240      - general code review
 241
 242   FIXME - what should leaf_lookup/index_lookup return on format error?
 243       The currently return '0' which will quietly make an empty block.
 244       Many '-1' would be better to make an error block.
 245   FIXME check how other filesystem lock the setting of PagePrivate
 246      Maybe just need to lock_page
 247   FIXME combine find/load/wait into one operation
 248   Review dir, super, roll, link
 249
 250   FIXME module refcount increases on failed mount!
 251
 252 18may2006
 253   I've been sick for too long, and not much has happened... However I think more than
 254   the above comment says.  I started looking at roll-forward and have the
 255   basic block parsing in place so that it reports what it sees in the roll.
 256   Also, the format has been changes a little: the address in the state block
 257   is the CheckpointStart cluster, and we simply roll forward to the
 258   CheckpointEnd, and then keep going beyond there - there is no longer any
 259   walking back to find the start.
 260
 261   Next step is to start incorporating rolled elements into the filesystem
 262
 263    - data blocks: shouldn't be too hard.  Don't need to update the
 264            index pages just yet
 265    - inode updates: should be straight forward enough, but care is needed
 266            as the data might be in multiple places
 267    - directory updates: these are probably most interesting..
 268
 269
 270   Question: how are symlinks created?
 271     Currently we:
 272       log the inode creation
 273       commit the new inode
 274       log the directory update.
 275     This allows the 'value' stored in the inode to appear after the directory
 276     update.
 277     That might be OK for files (Which are created empty and then extended)
 278     but is bad for symlinks (which are created atomically).
 279     So, options include:
 280      - ensure inode is in a previous cluster to directory updates.
 281        This slows things down too much I think
 282      - log the content as well.  This is awkward if it is big, certainly if more
 283        than a block, which is possible.
 284      - directory updates could be dependant on the inode being valid.
 285        This is ugly.
 286      - log content if it is small, else write inode, flush, then create link.
 287
 288     So the fast option is:
 289       log inode create, log content, log filename
 290     and the slow/safe option is
 291       log inode ceate, sync file, log filename
 292
 293     So on roll-forward if we see the inode we just save the data.
 294     Saving the whole inode seems attractive, but we want minimal order
 295     dependance: an inode update in the same cluster as the new inode should
 296     still over-ride, even though it is earlier.
 297
 298   Ok, rollforward is proceeding slowly.  I think I am now incorporating
 299   new blocks into the tree properly, though the code probably won't compile.
 300   It will be nice to test this and see the file have the right data.
 301
 302   Next step would be to include the index incorporation code.
 303   Then
 304     - directory updates
 305     - segusage summary
 306     - quota
 307     - stuff..
 308
 309 08jun2006
 310  - what exactly should happen when rollforward finds a file with a linkcount of 0?
 311    Currently all updates get lost - I wonder if they are lost safely?
 312  - rollforward is getting the size right, but not the content
 313  - do I need to flag a block that ->phys is valid?
 314
 315  : Ok, roll-forward picks up new blocks in a file OK,
 316   but umount has stopped working.
 317     Presumably because there are pages attached to the inode which aren't
 318     getting released.  What do we want to do here?
 319     Normally those pages, or their addresses need to be recorded before
 320     they are lost.  But on a read-only mount we don't care so much.
 321
 322 22jun2006    continuing above thought..
 323
 324    When we roll-forward and pick up the pieces of a file, we don't
 325    want to allocate pages to hold those pieces (and definitely don't
 326    want to read them all).  We just want to attach the addresses
 327    to the parent for incorporation.  Similarly after writing
 328    dirty blocks in a file we want to be able to release them
 329    immediately rather than waiting for the addresses to be
 330    incorporated (as incorporation can be more efficient when delayed).
 331
 332    We could just allow the page associated with a block to be released,
 333    except that the page provides the indexing to find a block.  We might
 334    be able to live without the indexing, and hunt down the indexblock tree,
 335    but living without the mutual-exclusion provided by block indexing would
 336    be more awkward.
 337    And the 'struct datablock' still contains a lot more than is needed.
 338
 339    So maybe we should just have a completely separate structure attached to
 340    the indexblock which lists fileaddr/physaddr.  This could include
 341    extent information.  The trick would be guranteeing allocation.
 342    We could either allocate-late with a fallback of attaching the 'struct block'
 343    or performing an immediate incorporation, or allocate-early and block
 344    the dirtying of a page until there is space to record the new address.
 345    This last is bound to be easiest.
 346
 347    So: what exactly do we use to store addresses?
 348     Probably a linked list of tables.
 349     Each table contains a link pointer and an array of
 350         fileaddr/physaddr/extentlen
 351     But we would need to allocate lots of these if there are hundreds of
 352       dirty pages, but possibly only end up using a few if they made
 353       extents very nicely.  That might be wasteful.
 354
 355     Or we could allocate just one.  When it is full we perform an
 356      incorporation.  But if that causes a page split we are in trouble.
 357        We could have a spare page, split to it, write out one
 358         and wait for the spare page to be written and free.
 359         But we cannot just release the index page as it might still have
 360         children.
 361
 362     (I think I've been here before).
 363     A worst-case scenario involves writing one block and that requires
 364       spliting every index up the tree to the inode.  This requires
 365       arbitrarily many pages to be allocated.  To accomodate this we either
 366       pre-allocate a spare page at every level of the tree down to the data
 367       block (a bit like storage space allocation) which seems very wasteful,
 368       or we make sure we can release one of the split pages, which seems impossible.
 369
 370     I could decide not to worry about it.  Have a pool of index pages and hope
 371      it always works.  Afterall, most pages are data pages, and they can be
 372      freed successfully.  We would only have a deadlock if all dirty memory were
 373      index pages, and that seems unbelievably unlikely.  If we trigger a
 374      checkpoint when the count of locked-pages hits some limit we should be
 375      safe.
 376
 377     So: Keep one table per index block.  Use simple append and sequential search.
 378      When table gets full, force an incorporation
 379
 380      Do we allocate the table separately, or embed it in the indexblock??
 381
 382      Probably embed it.  indexblocks that don't need it can be freed at any
 383      time so that space waste hopefully isn't significant.
 384
 385      How big?
 386       If the file is written sequentially, then everything should gather into
 387       extents, and so it doesn't need to be enormous.
 388       If the file is written randomly then the index block can be expected to
 389       be 'indirect', so incorporation will be cheap.
 390      So 'small' seems ok in both cases.
 391
 392      Let's say 8.
 393
 394      But wait a minute.....
 395      On a checkpoint we can be getting phys updates for prev and next phases.
 396      next-phase updates cannot be incorporated until the indexblock has passed
 397      on to the next phase.  So in that case, I think we still keep a linked
 398      list of unincorporated blocks and live with the fact that we cannot
 399      free them until the phase change passes.  That shouldn't be a big problem
 400      as it is a limited time frame - especially for data blocks..
 401
 402      But does this solve our initial problem??
 403      During roll-forward we want to keep the addresses but not the blocks,
 404      and we don't want to force incorporation. That means an arbitrary list
 405      of addresses attached to an index block.
 406      I guess we could possibly allow incorporation, but I would rather not
 407      as I want the fs to be able to be read-only nicely.
 408      So that means we need to have a list of address tables.
 409      Maybe the normal approach is 'add a table if possible, else incorporate'?
 410
 411      OUCH... we may write a block a second time before incorporating the
 412      new address, so when adding an address to the table we need to check
 413      if it already exists.  That could be expensive.
 414      For index blocks might it even be a different address?  I think
 415      not but the vague possibility (in the future?) does complicate
 416      things somewhat.  Maybe we just keep thing in chron order and
 417      don't worry about duplicates until incorporate time, when we have to
 418      sort anyway.
 419
 420
 421      todo:
 422         lafs_find_block  DONE
 423         free_block must free tables DONE
 424
 425
 426      Unmounting still doesn't work.
 427      Problem is that an index block is holding a reference on parent,
 428      and parent references aren't getting cleaned up.
 429      On read-only unmount I guess we need to walk the list of leafs,
 430      discard any address info, and unlock the blocks.
 431      So that should be the first task for next time.
 432
 433 27jul2006
 434   Leafs are locked blocks which have no locked children.
 435   So any locked data block (non-inode) is a leaf
 436   Any locked index block with lockcnt[phase] 0 is a leaf.
 437
 438   OK - fixed numerous bugs, but I can unmount now!!
 439   I can even rmmod and insmod and all is cool.
 440
 441
 442 TODO:
 443  - review refile and get all the code in there from prototype
 444        DONE (I hope)
 445  - write a combined find/load/wait function and use it
 446        DONE
 447  - allocate inodes in single memcache and avoid generic_ip
 448        HALF DONE. (still using kmalloc, not doing initonce well)
 449  - review recording of new block addresses
 450     + make sure we lookup there on index lookup - YES
 451     + make sure ->uninc_next gets tranferred to table at phase change.
 452     + write incorporation code as it is tricky
 453  - review how directory updates can be incorporated into a RO filesystem.
 454     No, they cannot.  We need to update the directory.
 455  - write directory update code
 456  - write cluster construction code
 457  - make sure indexblocks with unincorporated addresses get on to inc_pending
 458     ?? or is locking them enough?
 459
 460
 461 INCORPORATION - ARgggghhhhh.
 462  The current uninc_table doesn't really lend itself to building
 463   index block... though maybe....
 464  Question: what happens when an index block disappears? i.e. it has no
 465   addresses in it?
 466   We clearly need to remove it from the parent.  This should be trivial,
 467   a direct operation on the parent index block. etc some number to 0.
 468   Then the next incorporation pass with simply lose that entry.
 469
 470  OK, that might be all well and good, but how do we sort unincorporated
 471   addresses so we can merge them?
 472  A linked-list merge sort is nice and open-ended, but does waste
 473   quite a bit of space in pointers.
 474
 475  Or maybe I should just always do small-table incorporations.
 476  Is there a way that a bad ordering of writes could force very bad
 477   index layout in this case? i.e. cause a table split every time,
 478   but new blocks go in the first (full) table.
 479  OK Decision: always do small-table incorporation.
 480   i.e. not a list of blocks: just a table of addresses.
 481
 482  FIXME check validity of index type when it is first read in,
 483    and reject early if it cannot be recognised.
 484
 485 24aug2006
 486  Took a break from incorporation.
 487  Looking at directories.
 488  Wrote dir.doc in module to sum lots of stuff up.
 489  Issue:
 490    dir blocks have an info structure attached.
 491    This included a counted reference to the parent.
 492    How long does this need to hang around for??
 493
 494    - when there is any orphan issue happening, it must stay, via
 495      the 'pinned' flag.
 496    - when actually performing a dir op, we need to create and
 497      maintain this info.
 498
 499    When last ref of a dir block is dropped, should drop
 500    the parent reference.
 501
 502
 503  Status:
 504     free list management mostly done.
 505     Next:
 506       create/delete prepare/commit/abort
 507       orphan handling
 508       dirty_block lock_block
 509
 510
 511  FIXME should dir_new_block zero out the block?
 512    How will commit_create know what to do with this block?
 513
 514  NOTE another type of directory orphan is a free leaf block which
 515    is on the part-free list.
 516
 517 -------------------------------------------------------------
 518 09spe2006 0 on the plane to Frankfurt
 519  Don't tell me I am rethinking preallocation again ???
 520
 521  TODO
 522    dirty_inode needs to record the phase it is dirty in
 523    inode_fillblock needs to check current phase and act accordingly.
 524      we inode.doc
 525    Make sure the B_Orphan flag is set and used - or discard it.
 526
 527    How do we commit creating a symlink?
 528    If it is a full block in size we cannot make an update record.
 529     - maybe have two update records? We cannot guarantee they are in
 530       the same  cluster.
 531     ... but if we put the 'make dir entry' last it should work.
 532
 533    Change 'struct descriptor' definition
 534    the 'block_type' aka 'length' 16 field becomes
 535       0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
 536       0x8001 -> 0xc000 -> miniblock upto 16K+
 537       0xffff           -> index block.
 538
 539    Need to write IO routines which decrease pending-block-count in
 540      'wc'.
 541
 542
 543    Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
 544      gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
 545      - 512 segments per block - is 512 blocks in each seg usage file
 546
 547 12oct2006
 548  Need to write
 549  - lafs_lock_{d,}block  DONE
 550        Make sure the block has parents and allocation and set the locked
 551        flag and the phase.
 552
 553  - lafs_flush
 554        Given a datablock, wait for it to be written out
 555        This is needed before updating a block that is still locked in the
 556        previous phase.
 557  - lafs_inode_init
 558        Used when creating a new object/inode
 559        Given a datablock which is to hold the inode
 560          and a type (Type*) and a mode,
 561        Fill in the data block with appropriate data so that
 562           when lafs_import_inode looks at it, the right stuff happens.
 563  - phase_flip
 564  - lafs_prealloc
 565  - lafs_seg_ref
 566  - lafs_lock_inode
 567
 568 lafs_dirty_dblock
 569 lafs_cleaner_pause
 570 lafs_dirty_inode
 571 lafs_seg_flush_all
 572 lafs_write_all_super
 573 lafs_quota_flush
 574 lafs_space_use
 575 lafs_cluster_update_abort
 576 lafs_cluster_update_commit_buf
 577 lafs_cluster_update_commit
 578 lafs_seg_apply_all
 579 lafs_cluster_update_prepare
 580 lafs_inode_phase_check
 581 lafs_seg_dup
 582 lafs_dirty_block
 583 lafs_cluster_update_lock
 584 lafs_checkpoint_unlock_wait
 585 lafs_orphan_drop
 586 lafs_free_get
 587 lafs_find_next
 588
 589 2nov2006
 590  - I need to know if a block is undergoing write-io so that I can
 591    avoid modifying it in certain circumstances.  But I don't track
 592    this information.  Options:
 593     1/ track the info.  This means an extra field in the 'struct block'
 594         because I still need to know which wc has had a write.
 595     2/ For blocks that we care about copy the data on write...
 596         But we care about all inodes and directory blocks.  That is a waste.
 597    I think we put extra info in the block.
 598    We need to know which wc was used (0,1,2) and which pending cluster
 599    in there (0-3) which comes to 4 bits.
 600    But we only care about the block for wc=0. and we could include the
 601    which-pending in the b_end_io, or maybe put it all in low bits
 602    of the block pointer....  Need max 4 bits.  Can only be sure of 2...
 603
 604    Maybe:
 605        'which' goes in bottom two bits of bi_private
 606        'wc' goes in ->flags
 607
 608
 609 4apr2007  (What a long gap !!)
 610
 611  - lafs_cluster_update_*
 612    How do we prepare for a cluster update?  How do we lock it.
 613
 614    The important thing is that the update can be written.  That
 615    requires that there is space available.  So we need to preallocate
 616    space and then release it.
 617    It is possible that each update might go in a different cluster, so maybe
 618    we need to preallocate one block per update.  That sounds a little expensive.
 619    After all, we aren't preallocating a cluster block for every data block
 620    that is dirty.
 621    So: prepare does nothing
 622         lock preallocates the space - a full block.
 623         commit copies it in.
 624     For now at least.
 625
 626 24May2007
 627
 628  - Can now create and delete lots of files.  This is cool.
 629   But:
 630     Orphan slots just grow and grow - never to be reclaimed - why?
 631     After rm f*, 7 files remain.  but rm f* again and the go.
 632          FIXED - readdir wasn't returning them
 633     Size of directory remains large.
 634     And sometimes, files become ghosts... (try just removing one after first rm f*).
 635
 636   TODO - process those orphans to clean up the directory.
 637
 638 20June2007 (Happy Birthday Dad)
 639
 640  - Creating lots of file and then deleting them leaves 5 orphan slots
 641    for the directory busy, and one for inode 0??
 642
 643    Directory handling uses the following orphans:
 644     CREATE:
 645         A new index block is created by splitting.  This needs to be linked in.
 646     DELETE:
 647         The dirent block we are deleting from
 648            If it becomes empty, it needs to go on free list
 649         The index block we are deleting from
 650            If it has lots of free space it might need to be rebalanced.
 651      The inode that was deleted.
 652
 653
 654  - When a file is fully deleted, we need to drop any orphan info... DONE
 655  - Need to do orphan handling of free blocks in directory, and
 656    unmerged parents - but there doesn't seem much point as I am going to
 657    change the directory layout (again).
 658
 659  So: writing to a file.
 660    We need prepare_write, commit_write, and writepage.
 661    Prepare loads and links the page and checks there is space.
 662    commit marks it as dirty so writeout is possible.
 663    writepage chooses a page to write out
 664
 665 25June2007 - HACK week, thanks Novell!!
 666  - write - DONE
 667  - sync
 668      Somewhat done.
 669      Need to revise the process whereby async completion
 670      clears PAgeWriteback,
 671      We need locking in there, and need to worry about
 672        'which' wrapping too soon.
 673      Need to not start IO before we set page writeback
 674  - chmod
 675      Maybe, but syncing to disk needs more thought.
 676  - 'df'
 677     Partly done, need actual content.
 678  - mkdir
 679     Can make directory, but creating first entry fails. - FIXED
 680  - symlink
 681  - readlink
 682  - new directory structure.
 683
 684 27Jun2007 - More HACK week :-)
 685
 686  - new directory layout done - much easier!!
 687  - If I delete a file that was created, the blocks still have a ref-count
 688    and we crash.
 689  - mkdir doesn't increase link count on parent. - FIXED
 690
 691  TODO:
 692    Orphan handling.
 693      Infrastructure to process orphans
 694      Handle specific cases
 695      flush orphans at key times.
 696      load orphans at roll-forward
 697
 698    checkpoint
 699      Write out a checkpoint (when?)
 700      Make sure refcount goes back to zero on blocks I write.
 701
 702   Check on inode_phase_check and checkpoint_unlock and inode_dirty
 703    in all directory operations.
 704
 705  FIX: Writing a small file leaves something non-dirty but
 706     due to be written, and lafs_cluster_allocate complains.
 707   - seems to work now.
 708
 709  FIX: dir_handle_orphan doesn't lock the orphan transaction required.
 710
 711  FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
 712
 713  FIX: lafs_allocate hasn't been written!!!
 714
 715  FIX: before updating any block in a depth=0 file, we must first load
 716       and 'lock' block 0.
 717
 718 29Jun2007 - still HACK week.
 719   Summary of how incorporation works.
 720
 721   Each index block has a small table for unicorporated changes. i.e.
 722   blocks number and their addresses.
 723   This supports efficient storage of extents, and is extensible by allocating
 724   more tables.  This last is done rarely.
 725
 726   When a block gets a new address, this is added to the table or, if
 727   there is a phase missmatch, it is added to a list until a phase change
 728   happens (so the whole block is pinned pending the phase change).
 729
 730   If the table is full then:
 731    - if the filesystem is read-only (including during roll-forward),
 732      a new table is allocated (else rollforward fails).
 733    - otherwise we incorporate the table into the block, then add the new
 734      address to the (now empty) table.
 735
 736   If incorporation requires that we split the index block we allocate one
 737    from a pool.  If there are none in the pool, we wait.
 738
 739   As the table is much smaller than a block, the incorporation into
 740   two block will always succeed.
 741   The 'uninc_next' and 'children' lists will then need to be shared
 742   between the two blocks before the new address is added to whichever
 743   table is appropriate.
 744
 745   When looking for a block address, we must always check the table and
 746   then children lists.  We do not need to check uninc_next as they will always
 747   be children.
 748
 749   How to ensure that the pool always has sufficient index blocks and we don't
 750   deadlock?
 751   We have two halves of the table, one for each phase.  Before we allow
 752   a block to be dirty in a phase, we ensure that the pool has adequate
 753   index blocks for that phase.  e.g. twice the depth of the block.  If it
 754   doesn't we block the dirtying until space becomes available.
 755   For syscall writes, this is easy as we catch in prepare_write.
 756   When we perform a phase change, we must be sure there are enough index
 757   blocks for the deepest bloc that will stay dirty.  If there aren't, we need
 758   to flush all dirty block, and unmap all writable mappings before
 759   starting the checkpoint.
 760
 761
 762  FIX: need to work out life time rules so that inodes hang around while they have blocks.
 763     currently have an igrab that is never put.
 764
 765  FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
 766
 767 3Jul2007
 768  Checkpoint flushing is getting close.
 769  Current problem.
 770    InoIdx blocks are not changing phase.
 771    Phase change should happen when all children have been incorporated, and
 772     then the write has been triggered marking us clean.
 773   For InoIdx blocks, we need to be marked clean when the data block
 774    completes.
 775
 776 5jul2007 - a week off
 777  Checkpoint flushing seems to work !!!!
 778  FIX: what should filesize of symlink be?
 779      other filesystems use len, but still zero-terminate for vfs.
 780
 781  Problem.  A chmod is followed immediately by an unlink then a checkpoint.
 782    The chmod update gets into the checkpoint cluster, but the unlink completes
 783    before the checkpoint is finished so the new superblock sees the file
 784    as gone.  Roll-forward find the update and want to update a missing file.
 785
 786    This isn't a big problem, but with slightly different details, it could be.
 787
 788    One option is to ignore updates that preceed the updated block.  That might
 789    be awkward with e.g. directory updates and checkpoints that cross multiple
 790    segments.
 791
 792    Another option might be to prohibit updates once a checkpoint has started
 793    unless they are known to be after the phase change.
 794
 795  FIX: unlink isn't punching a hole in the inode file.
 796       Inode usage map isn't being updated. - FIXED (For create, not unlink).
 797
 798  FIX: roll forward does not pick up inodes, only data blocks.
 799     But tiny files are synced to inode, so they might not be picked up.
 800     So we must process a level=0 inode like a data block.
 801
 802 6July2007
 803  Time for lots of clean up.
 804
 805 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
 806 DONE 2/ rename 'lock' -> 'pin'
 807  3/ Review and fix up all locking/refcounts.  See locking.doc
 808 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
 809          B_Alloc inside the semaphore
 810        Also lock inode when copying in block 0 and probably
 811        when calling lafs_inode_fillblock (??)
 812 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
 813          more allocations can come in at any time.
 814 NotYet 3c/ cluster_flush should start all writes before calling _allocate
 815          as _allocate might block on incorporation/splitting.
 816        No.  We really want _allocate to not block, but to queue...
 817         I think this is too hard to get perfect just now, so I will leave it.
 818 DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
 819 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
 820      so that locking of lru doesn't have to be too global
 821 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
 822        lru system.
 823 DONE 3g/ revise refile lru handling based on new understanding
 824  3h/ Utilise WritePhase bit, to be cleared when write completes.
 825      In particular, find when to wait for Alloc to be cleared if
 826       WritePhase doesn't match Phase.
 827        - when about to perform an incorporation.
 828  3i/ make sure we don't re-cluster_allocate until old-phase address has
 829      be recorded for incorporation.
 830  3j/ Check that index blocks cannot race when getting locked....
 831   k/ Check what locking is needed to set PagePrivate exclusively.
 832 DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
 833      We need to get it done in process context I think and lock
 834       ->waiting access with fs->lock after changing it to ->lru
 835 DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
 836         only when *all* have finished.
 837 DONE  n/ on phase change, uninc_next blocks need to be shared out.
 838 NO 3o/ Make sure lafs_refile can be called from irq context.
 839  3p/   lock all lru accesses.
 840  3q/ Lock those index blocks!!!
 841  3r/ Can inode data block be on leafs while index isn't, what happens if we
 842        try to write it out...
 843  FIXED Why are extent entries only grouped in 4s?
 844  If InoIdx doesn't exist, then write_inode must write the data block.
 845  4/ resolve length of symlink
 846    FIXED - long symlink followed by 'sync' crashes.
 847    FIXED - rollforward isn't calling 'allocated' on blocks, or something
 848    FIXED - I cannot find 'bfile'. (inode isn't written)
 849    SEEMS OK...- Must flush final segment of a cluster properly...
 850  5/ Review what does, and does not need to be initialised in a new datablock
 851  6/ document and review all guards against dirtying a block from a previous phase
 852     that is not yet safe on storage.
 853           See lafs_dirty_dblock.
 854  7/ check for proper handling of error conditions
 855      a/ checkpoint_start might fail to start a thread!
 856      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 857  8/ review checkpoint loop.
 858        Should anything be explicit, or will refile do whatever is needed?
 859  9/ Waiting.
 860        What should checkpoint_unlock_wait wait for?
 861        When do we need to wait for blocks the change state. And how?
 862 DONE 10/ rebase on 2.6.current
 863 DONE     - use s_blocksize / s_blocksize_bits rather than fs->
 864
 865  11/ load/dirty block0 before dirtying any other block in depth=0 file
 866  12/ Add writecluster flag for old-phase updates.
 867      Why is this needed?  updates should always go in the new phase???
 868  13/ use kmem_cache for 'struct datablock'
 869  14/ indexblock allocation.
 870         use kmem_cache
 871         allocate the 'data' buffer late for InoIdx block.
 872         trigger flushing when space is tight
 873         Understand exactly when make_iblock should be called, and make it so.
 874  15/ use a mempool for skippoints in cluster.c
 875  16/ Review seg addressing code in cluster.c and make sure comments are good.
 876 DONE 17/ Make sure create inherits uid etc from process.
 877  18/ consider ranges of holes in pending_addr.
 878
 879 DONE 20/ Implement rest of "incorporate"
 880 DONE 21/ Implement staged truncate
 881 DONE         use for setattr and delete_inode
 882 DONE 22/ block usage counts.
 883  23/ review segment usage /youth handling and make a todo list.
 884       a/ Understand ref counting on segments and get it right.
 885  24/ Choose when to use VerifyNull and when to use VerifyNext2.
 886  25/ Store accesstime in separate (non-logged) file.
 887  26/ quotas.
 888         make sure files are released on unmount.
 889
 890  30/ cleaner.
 891        Support 'peer' lists and peer_find. etc
 892  31/ subordinate filesystems:
 893      a/ ss[]->rootdir needs to be an array or list.
 894      b/ lafs_iget_fs need to understand these.
 895  32/ review snapshots.
 896       How to create
 897       how they can fail / how to abort
 898       How to destroy
 899  33/ review unmount
 900       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 901  34/ review roll-forward
 902       - make sure files with nlink=0 are handled well.
 903       - sanity check various values before trusting clusters.
 904
 905  34/ Configure index block hash_table at run time base on memory size??
 906  35/ striped layout.
 907          Review everything that needs to handle laying out at cluster
 908          aligned for striping.
 909
 910  36/ consider how to handle IO errors in detail, and implement it.
 911  37/ consider how to handle data corruption in indexing and directories and
 912      other metadata and guard against problems (lot of -EIO I suspect).
 913
 914  - check all uninc_table accesses are locked if needed.
 915
 916 And more:
 917   1/ fs->pending_orphans and inode->orphans are largely unused!
 918   2/ If a datablock is memory mapped writeable, then when we write it out,
 919      we need to with fill up it's credits again, or unmap it.
 920   3/ Need to handly orphans asynchonously.
 921
 922 ---------
 923 22nov2007
 924 Free index block are on two lists, both protected by the global
 925 hash_lock.
 926   1/ The per-inode free_index, so they can be destroyed with the inode
 927   2/ The global freelist so they can be freed by memory pressure.
 928
 929 11feb2008.   Where was I up to again?
 930    reviewing phase_flip and lafs_refile.
 931
 932   UPTO
 933      Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
 934      Need to thnik about how index block really change.  How old blocks get
 935       dis-counted from segment usage, and what optimisation are really good
 936       for re-incorporating index blocks.
 937         Operations to consider are:
 938               i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
 939           i/ leaf block splits, index block gets new entry at end, and replacement
 940                   for other entry.  Easy to handle
 941          ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
 942         iii/ probably caught in leafs.  May cause internal split so we add new
 943              index address, which is easily handled if there is space.
 944          iv/ same as iii, though split more likely.
 945
 946        What about merging index blocks.  That just makes addresses disappear, which
 947         we handle the slow way.
 948        Do we ever re-target index blocks?  Would need to be careful about that.
 949        Make it look like a split where one block ends up empty as a hole.
 950      Need to write
 951            grow_index_tree (DONE - untested)
 952                   ib is a leaf inode that is getting full.  Copy addresses
 953                   into 'new', and make 'ib' an index block pointing at new.
 954
 955            add_index/walk index (DONE - untested)
 956
 957            end of do_incorporate (DONE - untested)
 958                 new contains the early addresses.  Some remain in ib
 959                  and/or ui.
 960                 the buffers much be swapped, so ib has the early address.
 961                 ui needs to be attached to new
 962                 return 2; - then new uninc needs to be split
 963
 964            lafs_incorporate
 965                 case 2 - horizontal split
 966                 case 3 - vertical split
 967   12feb2008
 968    Bother - uninc_table is a problem (again).
 969    We can currently add at any time with just a spinlock.
 970    So when we split a block horizontally,
 971
 972
 973    Still need to
 974           share out children and uninc_table in do_incorporate
 975           share out credits in do_incorporate
 976
 977 14feb2008
 978    Still need to do incorporate as above but took a break to...
 979
 980    Counting allocated blocks now works - stat show right info, hopefully
 981      storage is correct too. - DONE
 982
 983    next: truncate?  orphan thread?
 984       Then segment usage and the cleaner.
 985
 986
 987    thoughts:
 988     truncate - removing blocks doesn't need to erase them...
 989     - nothing forces a cluster_flush promptly!!!  We need a timeout
 990          or at least we need a flush before truncate_inode_pages...
 991
 992     - in lafs_truncate we need to make the block an orphan an pin in
 993       all in a checkpoint.
 994
 995 21Feb2008 (Research morning)
 996    Discard checkpoint thread created on demand in favour of a cleaner
 997    thread that runs all the time.  It cleans and checkpoints and
 998    orphans and scans.
 999
1000      want to:
1001         do segment scan and get a real list of free segments and
1002         free-space info!
1003
1004 25Feb2008
1005  - segment usage scanning to count free blocks
1006  - fix up re-reading of erased blocks
1007  - FIX truncate can still block waiting for writeback to complete.
1008  - FIX allocations aren't failing when we run out of free space
1009  - FIX df doesn't agree with du.
1010
1011  problem:
1012    Truncate when an index block has addresses in uninc_table.
1013      The summary for the new address has already been performed.
1014      We need to deallocate the new without disturbing the old.
1015      However a simple allocation may not be possible.
1016      I guess we can prune them all to zero, then incorporation
1017       can proceed.
1018
1019  TOFIX: when truncating a recently created file, it is still depth=0 so
1020     nothing happens.
1021     We really need to increase the depth to 1 as soon as we dirty
1022     any block, then reset back to 0 if it fits.
1023
1024 26Feb2008
1025   We have a file that we have written to, and the data blocks have been
1026   written out and the addresses stuck in uninc_table.
1027   We then truncate the file.  Who releases the usage of those blocks?
1028   And who removes them from uninc_table?
1029
1030   OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1031   I really should make sure that inodes are getting freed properly and the
1032   inode map is clean and everything.
1033
1034   BIG QUESTION
1035     Do we reserve segment-usage blocks.
1036      We cannot do it naively as we get infinite recursion.
1037      But we need it to be allowed to dirty the segment block.
1038      But we cannot pin them to this phase as we want to write them out
1039      after this phase
1040      This still needs more thought.  I avoided the recursion by setting SegRef
1041      before getting the ref.  But that isn't safe.
1042
1043 28Feb2008
1044   The table of cleanable segments is not working out.  Each segment appears multiple
1045   times which wastes space and adds confusion.
1046   We really want to be able to lookup by dev/seg and also find the least.
1047   'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1048
1049   We could have a skiplist for dev/segment lookup and do a merge-sort on
1050   a different link when we want to find the best segment.
1051   We then remember the best number found since a sort, and re-sort if the top
1052   is worse than the best.
1053
1054   We keep all this in a fixed size table.  Each entry has
1055    seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1056       addr-sort-skip links.
1057    This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1058    Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1059    One page of 16byte entries (256 of them)
1060    2/3 page of 24byte entries, 1/3 of 32byte entries.
1061    Total 2 pages, and 256+113+43 = 412 entries.
1062
1063   But deleting random elements is awkward... but not too awkward.  We can delete
1064   lots of entries by marking them as old, then performing a single pass of the skip
1065   list deleting them.
1066
1067   We should keep free segments here too, on a separate list.
1068
1069   So how about:
1070    2 pages of 16byte entries
1071    1 page of 24
1072    1 page of 32
1073
1074   free list randomly threads through all.
1075
1076   When using from 24 or 32, randomly choose height of 2-5 or 2-9
1077   Two lists run through the skiplist entries.  One for cleanable, one for free.
1078   Remember the nth element for some small n (10, but it decreases as we pull
1079   things off the front) and if we add something less than that, we trigger a
1080   mergesort on the next time we want to clean.... maybe.
1081
1082   Remember end of free list and add to there.  Maybe merge-sort the free list
1083   by addr occasionally.
1084
1085   Quesitions:
1086     When can we clean, when can we free wrt checkpoints?
1087       - we an clean a segment as soon as we have a checkpoint after it.
1088         So we record the youth of the segment holding the (start of the)
1089         checkpoint, and can clean any segment with a lower youth.
1090       - we can free a segment after the checkpoint after itfs usage has reached
1091         zero.  So if usage is zero and youth....
1092         We could offset the usage by one (say - for the first cluster header..)
1093         then when we find a segment with usage of '1', we schedule an update to
1094         0 in the next checkpoint...
1095     Have about segments with different sizes - they get different weights.
1096        Need to divide by segment size:  usage * youth / size.
1097
1098   TOFIX
1099    - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1100    - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1101
1102    - Lots of 'creates' makes lots of little clusters - need to optimise!
1103         Or it could be deletes as we currently cluster_flush for each
1104         delete.
1105          - I think this is fixed
1106
1107 29Feb2008
1108   Started looking at the cleaner.
1109   Need to understand how much to clean each checkpoint
1110   Need to track free-space-in-active-sectors while scanning.
1111
1112 3Mar2008
1113   TOFIX
1114     - the cluster head is currently limited to one page.  This is not good.
1115
1116     - Should the cleaner start before the scan is complete after a checkpoint?
1117       Probably it can, but while the scan is still happening it might be best
1118       to be cautious ??
1119
1120   STATE:
1121     try_clean is taking shape and has a few FIXMEs.
1122     need to write async find_block code and get it to watch for
1123        block in a cleaning segment.
1124
1125 28Mar2008
1126   - where can padding appear in a cluster? between miniblocks? at
1127     end of device blocks?
1128   - need to track phys block while parsing headers for cleaning.. why?
1129   - determine rules for avoiding block lookup during cleaning
1130     based on youth/snapshot age, and truncate generation.
1131      We need to load the inode from each snapshot
1132     Can we optimise based on snapshot age?
1133     only if we know the block is newer than the snapshot.
1134     So when we relocate blocks (cleaning) they must go in a segment
1135     that is marked as being old. we cannot really guarentee that.
1136     I guess blocks that are marked as 'new' can safely be skipped if
1137      segment is newer than snapshot. This 'age' is not the youth, but
1138     is the cluster_head->seq which is stored in creation_age.
1139
1140  - Store the rootdir for a filesystem in the metadata for the root inode.
1141    Then 'struct snapshot' doesn't need rootdir.  It can have a root
1142
1143 30Jun2008
1144   Looking at lafs_find_block_async.
1145      Needs async flag to make_iblock.
1146         Check that.  Can we block_adopt if there was an error?
1147              iblock will exist.
1148      setparent has async flag.
1149      lafs_leaf_find has async flag
1150      lafs_wait_block_async
1151
1152   FIXME I wakeup the cleaner every time an IO completes.
1153   Do I really want that?  Maybe only when number of async IOs hits
1154   half the recent maximum??
1155
1156   FIXME need to ensure that lafs_pin_dblock flushed committed
1157     B_Realloc blocks.
1158
1159   FIXME when we incorporate a dirty (non-realloc) address to an index block,
1160     we need to clear B_Realloc on the indexblock.
1161
1162   FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1163     giving it any credits.  Where should they come from?
1164
1165   We don't seem to scan for free/cleanable segments often enough.
1166
1167   FIXME we shouldn't start a checkpoint while cleaning is happening.
1168
1169   FIXME need to be careful when cleaning about finding inodes that
1170     don't exist any more.
1171
1172   FIXME give credits to realloc blocks.
1173
1174   FIXME think about/document transitions between realloc and dirty,
1175     and what locking is needed.
1176
1177 2Jul2008
1178   Allowing for the FIXMEs above, the cleaner is now identifying
1179   blocks that need to be cleaned and marking them B_Realloc (I think).
1180   We now need to gather these into a write cluster and write them.
1181   They will all be on the clean_leafs list, so we can iterate that
1182    allocating or incorporating as needed.  This will be similar to
1183    do_checkpoint.
1184   Important question is: when?
1185    Ideally we would have some auto-flush mechanism.  The cleaner just
1186    keeps finding blocks to clean and when we start running out of
1187    resources we flush the cleaning queue.
1188    However we will still want to flush the cleaner always before a
1189    checkpoint, so for now we cna implement that bit and wait for a
1190    need for the other to arise.
1191
1192
1193   FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1194       don't record that location the same way.. how to handle?
1195      Should check that 'adopt' doesn't do the wrong thing with this block.
1196
1197
1198   Realloc blocks need to be pinned.  That makes sense.  Only that way
1199     will they get onto the clean_leafs list.
1200   When checkpointing we should probably examine clean_leafs to be
1201    on the safe side.
1202
1203
1204   Realloc and Dirty:
1205
1206      Both of these hold a Credit.
1207      Both can be set at the same time.
1208      Cleaner ignores Dirty and sets Realloc anytime the block is in
1209       the wrong segment.  It also Pins the block.
1210      When the cleaner is flushing to the cleaning segment, it
1211       ignores Dirty blocks.  They get their Realloc cleared, but
1212       the remain pinned.  So they will get moved at the next checkpoint.
1213      How do we know whether an indexblock should be Dirty or Realloc?
1214       The Dirty/Realloc bit is cleared before we get to incorporation.
1215       Maybe we lafs_dirty_iblock the parent of any block we write
1216        out.  Then after incorporation, we set Realloc if it is not
1217        dirty.
1218
1219 STATUS:
1220   I think I'm pinning cleaner blocks now.
1221   Need to make sure the dirty ones are dropped. DONE
1222   Need to make sure the usage is transferred
1223   Need to get free segments back into use
1224   Need some more 'dump' options.  Maybe youth/usage files.
1225       Maybe tree.
1226   Need to make sure scan etc are triggered often enough.
1227
1228   FIXME lafs_prealloc walks up ->parent without locking
1229     I think we want i_mapping->private_lock like lafs_pin_iblock.
1230
1231 TODO:
1232   1/ a 'dump' option that triggers a scan and prints everything out.
1233   2/ scan must mark freeable as such, then subsequently free them.
1234   3/ Look at code that decreases usage of old segments.
1235   4/ Review lafs_cluster_wait_all and decide exactly how long we need
1236      to wait.
1237   5/ Review 'FIXME that is gross' HZ/10 thing.
1238   6/ Review 'wait for checkpoint to flush' msleep(500);
1239            Maybe remove that altogether.
1240
1241   FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1242   FIXME BUG in lafs_allocated_block fired.
1243             from lafs_erase_dblock from invalidate_page from .. vmtruncate
1244              from lafs_setattr
1245
1246   Current problem:
1247     An inode data block is dirty and pinned, but the inoidx is no longer
1248        pinned.  Presumably it isn't dirty.
1249      Recheck what 'dirty' means on the two blocks and see how this can happen.
1250
1251 10july2008
1252   Tree gets very big!  Lots of 'Realloc' blocks that should
1253    be long gone.
1254
1255   WE are spinning in cleaner again, and not in try_clean.
1256
1257   Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1258   In general it shouldn't be.  The flush_cleaner process will remove
1259    the Realloc bits so the blocks fall off clean_leafs.  They then either
1260    go onto phase_leafs or get unpinned.
1261   But I currently have a problem with InoIdx/data.
1262   The Pin is transferred to the Data block, but it doesn't go from the
1263    InoIdx block because it has a pincnt.  Now that is probably a bug, but
1264    what if it weren't?  What if, while we were cleaning, a block got dirtied.
1265    That would pin the whole tree.
1266   I guess the rule about not allocating an inodedata block while the
1267    InoIdx is pinned needs to be revised.  If the inodedata block is
1268    Realloc (and not Dirty) while the InoIdx is not Realloc, we
1269    can go ahead (in a cleaning segment).
1270
1271  FIXME to check:
1272    adir/big1 is garbage.... big1 was removed, so why is it even there?
1273               FIXED.
1274    echo tre > dump  # still too much stuff.
1275
1276
1277
1278  Put cond_sched in checkpoint loops!
1279
1280
1281  Thoughts about cleaning and pinning.
1282
1283   When cleaning we need to know how many dependant blocks are being cleaned
1284   so that we know when *this* block can be written - i.e. when the could hits 0.
1285   We cannot use the pincnt for this phase because there may be dependant blocks
1286   which are dirty.  They, and therefore this, may get flushed at next checkpoint,
1287   but they may not.  If we could be certain they would, we could just write
1288   to the clean-segment blocks which can become unpinned.  However if there
1289   is an index block being cleaned, and no dependant is being cleaned, but some
1290   are dirty but not pinned, then the checkpoint can go past without the block
1291   being moved.... but maybe we can detect that.
1292
1293   Try this:
1294     We set B_Realloc precisely on blocks found in segments being cleaned.
1295     We pin these blocks and leafs which are Realloc go in clean_leafs.
1296     If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1297        That way it gets written at end of checkpoint, but to main cluster.
1298     When we incorporate Realloc blocks into an index block, it gets marked
1299        Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
1300     On a checkpoint, we process both phase_leafs and clean_leafs
1301
1302
1303  FIXME do inode reads async better when cleaning...
1304
1305  FIXME if a realloc inode has been allocated to a cluster when we try
1306      to dirty it, confusion can ensue as the writeout won't mark it
1307      clean, but will use up the credits.
1308      Maybe we need something similar to phasewait to not set PinPending...
1309       But normal dirtying doesn't phasewait.   I think we just need to
1310       detect this case and wait for the clean-cluster to flush.
1311       Messy...
1312
1313  FIXME make sure incorporate is doing the right thing with credits.
1314
1315  FIXME lafs_write_inode. We need to be careful about clearing Dirty
1316            when making an update.  Need some sort of locking.
1317            Need to review all inode dirty stuff and make sure we do
1318            write thing no matter when it is called.
1319
1320  FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1321         anymore so we don't know how to flag the index block.
1322
1323 2008jul13
1324  UPTO: unlink etc don't prealloc the inode that will be modified.
1325     And a warnon inode.c:579 is very noisy.
1326
1327 2008jul22
1328  FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1329      but it doesn't get set until AFTER lafs_reserve_block is called.
1330
1331  Here I am...
1332    Cleaning cleans an InoIdx block which schedules the data block.
1333     Subsequent the InoIdx block gets pinned again.
1334     Now when we go to write the data block, we cannot because InoIdx is pinned
1335      in same phase.
1336      Maybe given that data block is pinned, we write it anyway...
1337
1338  FIXME: when we realloc an block embedded in the inode, don't pluck it out
1339         and put it back in again.  Just realloc the inode.
1340
1341  FIXME: when cleaning a directory that has shrunk, we think we have
1342      blocks that don't exist any more. FIXED - we thought '0' was in
1343      segment '0'.
1344
1345 2008jul23
1346   FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1347      flush finds no credit. for InoIdx block of 8501
1348
1349   FIXME: do we do SEGREF on all the index blocks? do we need to?
1350
1351
1352 2008jul24
1353   FIXME: seg usage for segment 0/5 isn't dropping to zero.
1354     Part of a file got moved off, but count is still there.
1355     FIXED - seg_move wasn't being called.
1356   FIXME: segusage file has inconsistent extents:
1357       Extent entries:
1358        0 -> 694 for 2
1359        1 -> 1291 for 1
1360        1 -> 15 for 1
1361    FIXED several bugs in walk_extent
1362
1363   FIXME qphase:  any locking between that changing and lafs_seg_move??
1364     I don't think so.  Just that seg_apply_all must be called after qphase is set.
1365
1366   FIXME make sure we don't try to clean the current segment!!
1367
1368   FIXME 'Available' goes negative!
1369       Creating large file doesn't instantly reduce 'Used'.
1370       Deleting files plus sync doesn't increase Avail?
1371
1372   FIXME a segment is in the table but doesn't print out!
1373
1374   FIXME we don't cope with running out of free segments (not that we ever should).
1375
1376   FIXME check all Credit usage and make sure credits are returned when
1377     ->parent is dropped.
1378     provide visibilty into credit counts.
1379     Make sure we are keeping enough space for cleaning.  We should always
1380      have a few segments unallocatable.
1381
1382 2008jul25
1383   FIXME cannot do io completion in cleaner thread as it can block on
1384      a i_mutex which might be waiting for completion. FIXED (keventd).
1385
1386   FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1387             If we 'know' we have a reference, e.g. a child with a ->parent
1388             link, we can access it without locking.
1389        So:
1390            lafs_make_iblock should return a counted reference.
1391
1392        If we own an (indirect?) reference to iblock, we can access
1393         both iblock and dblock for free... but iblock can change???
1394        If not, we need to get a reference to on or other under a lock.
1395
1396   FIXME block->inode should be a counted reference?
1397
1398 lafs_make_iblock OK
1399   lafs_leaf_find OK
1400     lafs_inode_handle_orphan OK
1401       inode_handle_orphan_loop FIXED
1402     __lafs_find_next OK
1403     find_block FIXED
1404   __lafs_find_next OK
1405     lafs_find_next FIXED
1406       dir_lookup_blk
1407       dir_handle_orphan
1408       lafs_readdir
1409       lafs_inode_handle_orphan
1410       choose_free_inum
1411   find_block - FIXED
1412
1413  FIXME root->iblock should always be refcounted.  Is it?
1414  FIXME walking siblings - what lock?
1415
1416 2008jul28
1417  FIXME several times we clean PinPending without refiling, in dir.c in particular.
1418     that looks wrong. FIXED
1419
1420   Maybe  lafs_new_inode should return a reference to the dblock
1421     Or pin it. or something. FIXED  And pinned (when needed).
1422
1423  FIXME lafs_inode_dblock might return a block without valid data...
1424    Need to get valid data, then load block 0 in find_block rather than
1425        load_block.  FIXED
1426
1427  FIXME we really should own a reference to ->dblock before calling
1428     lafs_pin_inode.  We don't want IO during a pin request.
1429     FIXED
1430
1431  FIXME review use of PhysValid FIXED
1432
1433  lafs_orphan_abort - what if lafs_orphan_pin not called?
1434    or if 'b' is NULL.  FIXED
1435
1436  Do I Need to clean PinPending when retrying??
1437    Well, we need to be phase-locked when we set PinPending, so
1438     it must be Pinned to the current phase.
1439     So when we unpin a datablock, we must clear PinPending.
1440   FIXED we now clear PinPending in do_checkpoint.
1441
1442  Does phase_wait do the right thing when pinning an inoidx block
1443    for an inode? FIXED
1444
1445
1446 Pending
1447   Need to understand and document the lifetime of a page with datablocks.
1448     who hold what refcount, and when can it be freed?
1449    Then fix up locking in lafs_refile, __putref.
1450
1451  FIXME how keep what refcount on orphan blocks/inodes??
1452  FIXME should dirty/pinned/etc hold a refcount?  they don't.
1453
1454
1455 Later:
1456  FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1457
1458  FIXME make sure empty files have depth of 1.
1459
1460  FIXME Truncate proceeds lazily. All data blocks need to be gone
1461
1462 26aug2008
1463  If I call lafs_erase_dblock while a write is underway, we have a problem.
1464   We need to wait potentially for a checkpoint to let go of the block and
1465    a write to complete.
1466     This should be done with waiting for PG_writeback on the page to disappear.
1467   Check this out.
1468
1469   When end_page_writeback is called, we must have dropped all references to the
1470    page.
1471   When we commit to writing a block, we have to set PG_writeback on the page
1472    so that truncate et al can wait for it.  Before we have committed, truncate
1473    can just remove the page.  Internally we differentiate by B_Alloc.
1474   So before setting B_Allocated we need to test_set_page_writeback(page).
1475   Be careful of races.
1476   I don't think we can ensure all references are dropped.  After all, that is
1477   the point of refcounts.  So dblock array must exist without page!
1478   But we need to ensure that we don't start a writeout after truncate
1479   has done wait_on_page_writeback.
1480   This is done with the page locked so when we want to write a page
1481     in a checkpoint, we need to lock the page first.  Once we have the lock,
1482     we check if the page is still dirty.  If it has been truncated it
1483     will be clean.
1484    But how do we safely reference the page if b->page can be cleared?
1485     How about:
1486       When we clear PagePrivate, we take a counted reference to the page
1487       for db->page.  This is dropped when the page is freed by lafs_refile.
1488       But while it is held, it is still safe for db->page to be dereferenced.
1489     So before we commence writeout we have to lock the page and set
1490      PG_writeback.  After locking, we need to test if writeback is still
1491      appropriate.
1492
1493   Maybe not.  I think we can submit blocks for writeout without setting the
1494   page to writeback.  If we do, then we need to be sure those writes
1495   finish before invalidatepage calls releasepage (block_invalidatepage
1496   calls discard_buffer which calls lock_buffer which waits).
1497   In our case invalidatepage need to make sure that no new write commenses.
1498   Maybe we should lafs_iolock_block before we allocate to a cluster and check
1499   again if the block is dirty.
1500
1501   So:
1502     lafs_cluster_allocate does:
1503        lafs_iolock_block
1504        check if still dirty.  If not, unlock and return
1505        set allocate flag
1506        allocate and write
1507        when write completes, allocate is cleared.
1508                     unlock block
1509
1510     invalidatepage does
1511        lafs_iolock_block
1512        clear Valid,Dirty,Realloc
1513        lafs_iounlock_block
1514
1515
1516
1517 2008 aug 28 - happy birthday.
1518 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1519 lafs_prealloc complains.
1520
1521   mark_cleaning does too, but cleaning only happens well away from a checkpoint
1522   lock.
1523 segsum_find is being called to reference a new segment when we flush a cluster.
1524  segment usage blocks are special.  Their index information doesn't
1525 need to be written out in the current checkpoint.  We can do that, but
1526 the backstop is to write just the data block in the tail of the
1527 checkpoint and write indexing information later.
1528
1529 2008sep10
1530  unlink is getting "No space left on device".  This is when trying to
1531  pin the directoory block, the physaddr is 0, so it looks like we want
1532  NewSpace.  But we should even be trying to prealloc in that case becase
1533  there should already be a prealloc on the block.  i.e. there should be
1534  credits.
1535  Hmmm. after multiple 'syncs' how can the block not be written out.
1536  Maybe it is embedded in the inode?
1537  When we pin a block that was embedded in the inode it isn't clear what to
1538  do.  If we might grow the file so it doesn't fit any more, we need to
1539  allocate NewSpace.  If we know it won't grow. we use Release.
1540   This still needs a proper fix.
1541
1542  Cleaning seems to be working nicely.  However we don't get all the space
1543  back that we should because lots of blocks still have credits that
1544  aren't being returned.
1545
1546  So when should credits be returned?
1547  They are set when a block is pinned.  It then gets dirtied which
1548  consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
1549  then it doesn't need any credits.
1550
1551
1552  It seems that cluster_flush is not always writing things in the correct
1553   order.  Root gets written before some other things below it.
1554    Maybe they are temporarily out of the loop??
1555  No.  There are dirty blocks which one checkpoint doesn't pick up, but
1556   they aren't holding the index block pinned. so they lose allocation.
1557
1558  But they must hold the indexblock pinned, even though they aren't pinned
1559  themselves.  We maybe do this just with the refcnt... maybe.  That will cause
1560  it to phase-flip rather than drop pinning, which I think is right.
1561
1562  So: too many credits remain allocated.  Where are they?  There are 1464
1563    outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1564    But things removed from the tree have credits removed.
1565
1566
1567
1568 FIXME roll forward ignores inodes.  But what about an inode that contains
1569    data.  Should that be ignored?  I think not.
1570 FIXME delete adir/big2 then delete adir and it cannot release:
1571   Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1572  presumably there is orphan processing or something to complete???
1573 FIXME when files are deleted, the space isn't returned!
1574    This seems to be mostly fixed - need to test.
1575 FIXME when I "rm [b-z]*" it waits for writeback on something???
1576    zfile again!!!  OK, I think that is fixed.
1577
1578
1579 12sep2008
1580   Current problem:
1581     seg_apply_all dirties dblocks.  When should they be reserved?
1582     The originally get reserved by a lafs_reserve_block call in
1583     segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1584     However: that block might get written before *and* after a checkpoint.
1585     So we need N* Credits.  These are usually only used for Index blocks.
1586     We can set these easily enough if inode type is TypeSegmentMap.
1587     We move them across to Credit in seg_apply_all.
1588     But when to we clear them if they aren't needed?  I guess
1589      when we drop the last segref.  Yes, we already do that.
1590     FIXME need to make sure these get flushed on next checkpoint
1591      if we cannot allocate new credits after a checkpoint.
1592
1593   New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
1594     Think that is fixed.
1595
1596   Some problems.
1597     1/ see above:  rm x/y; rmdir x -> BUG - FIXED
1598     2/ Spins on 'CURRENT=1' ??
1599     3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1600     4/ When I create/delete a file, ablocks_used increments by one.
1601         The inode hasn't been allocated yet, so it seems the deallocation
1602          isn't adjusting ablocks_used??
1603     5/ open_namei (for dd) got caught on a mutex_lock.
1604     6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1605        I'm not sure where we should and am not thinking very clearly.
1606        Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1607     7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
1608        the lock?  Writeout that hasn't completed?
1609        Yes.  writepage calls lafs_allocated_block without calling flush.
1610        So the block could be sitting waiting for a flush.  How long do we
1611        wait??
1612     8/ It seems that some datablock can need NCredits.  Make sure these
1613        are handled properly re flush-or-refill after checkpoint and
1614        flip_phase rather than unpin.
1615     9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1616        enough, and we lock up (see 7).  Need to flush the first block
1617        straight away, and the next one as soon as the first finishes, etc.
1618        Or something like that.  Then remove the comment from lafs_writepage.
1619
1620 8th December 2008
1621
1622   I seem to be getting only 4 blocks to a cluster at the moment.
1623    This is good as it motivates the code to handle block splitting in
1624    the Btree.   But it shouldn't happen.
1625
1626   ....
1627   Block spliting might work - it doesn't crash at least.
1628   But
1629   After deleting all files, the tree is full of stuff.
1630   Lots of inode data/InoIdx blocks.
1631   Many but not all a Pinned.  The others are OnFree
1632   The Pinned ones have outstanding references.
1633   Others
1634
1635   ....
1636   Problem with the block splitting, when adding an index block.
1637   The index block is initially empty - we need to find things by looking
1638   at children.  But we don't.  We BUG_ON the iphys==0.
1639   In general, when we add a block below and index block and before we incorporate,
1640   the block must be found by finding the first indexed block and looking to
1641   see if there is a 'next' block that contains the address we need.
1642   FIXED
1643
1644   But if we truncate a file while an index block is pinned and dirty,
1645   we spin on trying to incorporate it, which should make it empty.
1646
1647 11th December 2008
1648   deadlock.
1649   sync is trying to get lock in lafs_cluster_flush
1650   pdflush holds the lock and is stuck in cluster_flush_0xa40
1651     some wait_event I expect.
1652     Maybe we need an unplug ??
1653
1654  - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1655    This is in clean_free.  We try to update the 'youth' to mark
1656    the segment as free, and we don't have a reservation to do it.
1657    Maybe just reserve it there and then.
1658
1659
1660 12th December 2008
1661   When doing a lookup in an index block, we need to check the unincorp
1662   address list.  It isn't enough to look for unincorp blocks as they
1663   might have disappeared.
1664   For INDIRECT and EXTENT this is easy enough as full information is in
1665   'uninc'.
1666   For INDEX it is a little tricky as we need to look at the full set of
1667    addresses to know where a particular address fits.
1668    We could force and incorporate first, but that has awkward implications
1669     if it requires a split.
1670    Maybe if we get from the lookup "start+range"....
1671      That is not enough as the 'start' might get zeroed by an update.
1672
1673
1674    rm adir/* doen't work as readdir doesn't get all the entries
1675     for some reason.
1676    Reason is that they are being put in the wrong block.
1677    lafs_find_next doesn't correctly find the 'next' block if it
1678    hasn't been incorporated yet.
1679    Block can be:
1680      in index tree -- easy to find
1681      in uninc_table -- not too hard
1682      in only in the ->children list, or attached to a page.
1683    It would be nice to use find_get_pages but that isn't exported so try
1684     something else for now.
1685    For index blocks
1686         Look in index block for 'next
1687
1688 15th December 2008
1689    FIXME when we split an index block, we need to hold a reference to
1690    the original so it doesn't disappear until the split-off copy is
1691    written.  This is because we search from an index block to find
1692    split-off copies.
1693    [ note from Feb09.  This should be OK now. Both will need
1694    incorporation, and we now hold on to blocks until they are
1695    incorporated.]
1696
1697
1698
1699 23rd February 2009
1700   - index block.  What changes are allowed exactly.
1701      - splitting certainly makes sense.
1702      - merging two adjacent blocks is fine, of which a special case
1703        is finding that a block is empty and so removing it.
1704      - What about a 2->3 split which would require removing a block
1705         and adding another at the same time?
1706        or noticing that the first blocks addressed are all missing, so
1707        moving the index forward?
1708        In each case, searching down by indexes will find a block that
1709        has been replaced by a later address.  We could manage that as
1710        long as the new block is attached after the replaced block.
1711        So we cannot move a block.  We must delete and replace.
1712
1713   - unincorporated index blocks..
1714     unincorporated data blocks are not pinned in memory.  Once they have
1715     been written out, they can be freed.  Their address is stored in the
1716     uninc-table.  This means we can delay incorporation while many
1717     extents are written out and freed.  When we come to incorporated, we
1718     may have many hundred of address in a few extents that can be incorporated
1719     efficiently without holding all that data pinned in memory.
1720     The same scale doesn't apply to index blocks.  An index block can
1721     reference only 102 blocks (for 1K block size).  And the uninc table can
1722     hold far fewer so we will naturally incorporate more often.
1723     So keeping index/indirect/extent blocks pinned until they are incorporated
1724     is reasonable.  And it makes lookup a lot easier, as we have
1725     guarantees about ordering of block in the children list that we
1726     don't have in the uninc table.
1727
1728     Incorporation could have some atomicity issues.  There is no
1729     concern about bad stuff appearing on disk as the phase-change
1730     process handles that.  In memory it might be awkward if we split
1731     an index block before incorporating a block what would span them.
1732     That could conceivably happen if we only incorporate 8 blocks
1733     (size of uninc table) at a time.
1734     So maybe we should incorporate a full uninc list (not table) at
1735     a time.
1736     This means quite different code paths for incorporating leaf
1737     and internal index blocks....
1738
1739
1740   - uninc_table lists are a real problem.
1741     They can only be created during roll-forward so they hardly ever
1742     happen.
1743     But if the block is split while processing earlier things on the
1744     list, then splitting an uninc table would be very messy.
1745     Is there any way around this?
1746     Why not just do incorporation during roll-forward?
1747     We only need to incorporate leafs, not internal blocks because we
1748     don't use uninc_table for internal blocks any more.
1749     So during roll forward, all index blocks that are touched need to
1750     be held in cache...
1751     I think we live with that.  If it every becomes a problem, we will
1752     need to perform the roll-forward twice.  The first time collects
1753     the usage information so that we know where we can start writing,
1754     then the second just applies all the changes. to the rest of the
1755     filesystem.
1756
1757
1758    So:
1759      uninc table only used for leaves, and has no linked list
1760      unincorporated index block are stored on a list, which we
1761      sort before applying.
1762      All uninc index blocks are therefore kept in the index tree.
1763      Their order on the children list allows us to find the correct
1764      index. Each block for which the fileaddr is in the parent is
1765      followed by any blocks that have been split off and end after
1766      this one starts.  Blocks that have been emptied are Hole and are
1767      skipped over when looking for a block.
1768
1769      When we split an internal block, the remaining uninc blocks
1770      must not start with a Hole.
1771
1772    FIXME: what locking do I need around lafs_incorporate?
1773       i_mutex?? i_alloc_sem??
1774       i_alloc_sem is imposed by truncate (inode_setattr) and
1775          direct_io possibly.  So it is really about adding/removing
1776          blocks.  Not updating internals.
1777          Maybe our own mutex.  Could even be per-index-block !!
1778       Whatever it is, we need to protect walking ->children too.
1779
1780
1781 24th February 2008
1782   "rm -r" problem from 12/dec/2008 fixed now.
1783   incorporate code got a make-over and is probably much better.
1784
1785   New problems:  After test runs, cannot create files due to no space
1786      on devices!!  But directory tree is empty.
1787   I can see:
1788
1789     free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1790
1791   The problem is that we think 1425 has been allocated to data that
1792   might still need to be written, leaving not enough room for more.
1793   Index Dump shows
1794   ====================414 credits ==============================
1795   which doesn't explain everything, but does explain a lot.  There
1796   really should be nothing in the Index tree (except fs-root and
1797   tree-root)
1798   There is also:
1799   Some inodes which are OnFree and hold no credits.
1800     0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
1801     52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1802
1803   Some other inodes which are pinned with lots of credits and are
1804     on the phase_leaf list
1805     0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1806    299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1807
1808   And that is about it.  some are not Valid, some are...
1809   checkpoint just wants to 'flip' them.
1810   They mostly have a refcnt of 1... I wonder who is holding that....
1811   The reference of on the dblock is held by the iblock.
1812   But what is the iblock remaining?  Who holds that reference?
1813
1814   I restored some code to clean iblock, and now:
1815   free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1816   ====================244 credits ==============================
1817   which saved 130 credits.  That helps.
1818   There seem to be many fewer of the many-credits blocks
1819   Lot of index blocks in tree are 'OnFree' and have a
1820   0 refcnt, but haven't been removed.  Why?
1821   It seems that the have ->parent == NULL, so lafs_refile never
1822   bothers to remove them.  I guess it should...
1823   OK, lots of InoIdx block have gone now with their DATA blocks.
1824
1825   So, remaining blocks are pinned to their phase with lots of Credits,
1826     have not pincnt, mostly have physaddr==0.
1827    It is just the stray refcnt that keeps them there..
1828    inums are 40, 56, 62-73, 275-278, 280
1829     40 is f22
1830     56 is first adir
1831     63-69 are directories 2/3/4/5/6/7/8/9
1832     70-73 are looooong symlinks
1833     275 is cfile
1834     276 is dfile - same as cfile but truncated.
1835       Then some nbfile-X that were big enough.
1836
1837    So: what do they have in common:
1838      Several only use the in-inode data block, but
1839        probably not all
1840
1841     Can it be that it is refcounted on the Leaf list, and so
1842     cannot get off??  Yes, I think so!
1843     We only unpin things that have a zero refcount.
1844
1845     So: what to do?
1846       checkpoint takes it off the list, then flips the phase and puts it
1847       on the other list with refile.  During that time it has a refcount
1848       it doesn't lose the pinning.
1849       Do we want to:
1850         1/ Not have it on the list despite being pinned.
1851         2/ Drop the PIN despite the refcnt.
1852         3/ have refile do the phase_flip so it has a chance to
1853            notice the refcount has hit zero.
1854
1855       2 isn't really an option.  We need PIN to persist whenver we have
1856        a reference.  We could possibly use PinPending for index blocks too,
1857        but that would require a lot of thinking.
1858       1 requires another criterea for being on the list.  I suspect that would
1859        get messy fast.
1860       3 we used to do I think... But refile is in a big lock, and we
1861         cannot really do a phase_flip under that.. and phase flip calls
1862          refile anyway so we would get recursion.
1863       So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1864
1865       FIXME use kzalloc where appropriate.
1866
1867       FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1868
1869 25th February 2009
1870   Good progress.
1871   Only 54 credits in Index Tree now.
1872   Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1873   plus '74', which seems to be schedules for deletion - root has uninc_table.
1874    ... and 'sync' got rid of that and left 44 credits.
1875   Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1876     50  link
1877     55  zfile
1878     72  long84
1879     73  long85
1880     74  adir
1881   These seem to be the files that used data-in-the-inode
1882   They still have a refcnt of 1 (or 2 for adir).
1883   ... OK, that's gone now.  I fould a refcount leak.
1884
1885   So now:  42 Credits in Index Dump.   No stray files.
1886
1887   df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1888   So we still seem to have 1085 blocks allocated.  42 are accounted
1889   for, so 1043 still missing... either we lost the count, or lost the tree.
1890
1891   create a finy file, remove, and sync, now
1892   df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1893
1894   so I lost 15, b ut now 48 are in tree.  Lets try again...
1895   df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1896   and 44 in tree
1897   and again:
1898   df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1899
1900   Definitely losing more thant the difference in the tree.
1901
1902   Try creating empty files...
1903 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1905 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1906 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1907 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1908 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1910
1911  very strong pattern there.
1912  What about 2 files at a time.
1913 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1914 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1915 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1916 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1918
1919   Slightly different pattern - not as bad.
1920   Have to try 4 now.
1921 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1922 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1923 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1924 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1925
1926   Strange, isn't it....
1927
1928   Making sure we clear UnincCredit... result looks worse.
1929
1930 26th February 2009
1931   I fixed up the credit accounting 'incorporate' and then fixed a couple
1932   more little bugs.  And now:
1933
1934
1935
1936 ====================48 credits ==============================
1937 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1938
1939 So we still have 720 allocated credits that aren't accounted for.
1940 But we are nicely under 100...
1941
1942 .... and now
1943
1944
1945 ====================76 credits ==============================
1946 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1947
1948 That is different.  The count of missing blocks is way down,
1949 but there is some extra cruft in the index tree.
1950 Quite a few like
1951     0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1952     0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1953 and even one
1954     0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1955    330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1956 Time for a commit though....
1957
1958 and now
1959 ====================46 credits ==============================
1960 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1961
1962 so the strays in The index tree are gone. but still have 159 outstanding
1963 credits.
1964 Now change but now
1965 ====================36 credits ==============================
1966 df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
1967
1968
1969 That is a little weird...
1970 Hmmm. back to
1971 ====================48 credits ==============================
1972 df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
1973
1974 Oh well.
1975 ====================34 credits ==============================
1976 df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
1977
1978 It seems that the unaccounted blocks are (or can be) created by
1979 writing to a file then removing the file without a sync.
1980 ..but why is cb (cblocks_used) so high?
1981
1982 27th February 2009
1983
1984  Got onto a bit of a tangent...
1985  What happens if we truncate a block while it is on a list to
1986  be cleaned?  Clearly we want to cleaner to drop it ASAP.
1987  But what if invalidate_page wants to drop it *now*
1988  Hopefully it is either still on clean_leafs and we can remove it,
1989  or it is now iolocked and we can wait for it.  So should be OK.
1990
1991  I keep getting caught in "looping on..."
1992  We are truncating an inode and some index block which is now empty
1993  is not getting removed from the tree because there is an outstanding
1994  reference.... 327/0 depth=1.  I guess I turn on the tracing.
1995
1996  ... and it seems that it is in the process of checkpointing.
1997  I guess I need to lock against that ... maybe with the iolock.
1998
1999 Credits = -1, rv=2
2000 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
2001 ------------[ cut here ]------------
2002 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
2003
2004  -------
2005  Every time I create/delete a file, I get an extra 'ab' which disappears
2006  on 'sync'.
2007    ablocks_used is:
2008      decremented when +ve summary_update on non-index
2009      increased on lafs_summary_allocate... should not be done for index blocks.
2010
2011  OK:  after test run, filesystem is empty, but cblocks_used is around 360.
2012   cblocks_used:
2013         is loaded at mount time
2014         collects pblocks_used on a phase flip
2015         is updated in lafs_summary_update (unless pblocks is)
2016    So we must be missing a lafs_summary_update when phys->0
2017
2018
2019  Lots of problem:
2020    truncating big (multi-level index) seems to be bad
2021      Leaves 'pb-338 !!! and cb+689, even after sync.
2022    still 'looping on' occasionally
2023    Haven't found cblocks_used leak yet.
2024    Occasionally non-B_Valid blocks are actted on.
2025      I think I need to improve io locking.
2026
2027 ---------------
2028 1st March 2009
2029   Need some improvements to iolock locking.
2030   We use this lock to wait for a block to be written out (if that is happening)
2031    before we allow lafs_invalidate_page to complete.n
2032    It is also use in lafs_erase_{d,i}block (Similar purpose)
2033   We take the lock in lafs_cluster_allocate, and then make sure the block is
2034    still dirty.
2035
2036   Also lock in lafs_new_inode as initing the inode is a form of IO ??
2037   load_block takes the lock
2038   We only clear_bit(B_Valid, ) under this lock.
2039
2040   So the issue is this:
2041     A block that is going to be written is passed to lafs_cluster_allocate.
2042     This happens either after taking it of a _leafs list, or when
2043     lafs_writepage requests the write.
2044
2045     lafs_invalidate_page needs to be able to release the page, so there needs to
2046     be no transient references.  In particular, once the block has been
2047     removed from a _leafs list it must already be iolocked.
2048     Invalidate_page can then either remove from that list and erase the block,
2049     or use io_lock_block to wait for the IO to complete.
2050   So when a datablock comes out of get_flushable it must be iolocked, and must
2051   remain iolocked until after Dirty and Alloc are clear
2052   Index blocks belong entirely to the fs, so we can be more relaxed with them.
2053   If get_flushable finds the block already iolocked, it is either being invalidated
2054   or already has IO pending, so it can be dropped.
2055
2056
2057 16th Match 2009
2058
2059   FIXME  When we sync a small file, we just write out the inode.
2060      rollforward currently ignores data in inodes I think.
2061      Thanks needs to be fixed to ensure this data is safe.
2062
2063  - stop iblock from disappearing so much.
2064
2065  - I think...
2066     While cleaning a file, I truncate it.  This makes it appear
2067     to fit in the inode but it is very big and we get confused.
2068    We cannot allocate block 0 until all the others have been
2069    allocated to 0 and forgotten.
2070    But what if we truncate a file to 10 bytes, then fsync?
2071     We need to write the data promptly, but we like doing truncate
2072     in the background.
2073    When we extend a file we already need to wait for truncation
2074     to complete (FIXME do we do that?)  We could wait on fsync too.
2075    We cannot just delay block0 as it might be part of a checkpoint
2076     that has to complete promptly while truncation can take a long time.
2077    i.e. we have a very large file.  We update the first byte, then
2078     truncate to 2 bytes.... we don't need to write until fsync which will wait...
2079     Directory?? delete lots of entries so it shrinks to one block?
2080        There is no delayed truncate there.
2081    ?? Never clean an I_Trunc file.
2082    If we try to allocate a file with other indexes:
2083      clear Realloc
2084      if Dirty and Pinned, just do normal alloc
2085      if Dirty and not pinned, skip.
2086
2087
2088   Sometimes I run out of credits while truncating a file.
2089   I need credits - maybe only briefly - to dirty the index blocks.
2090      -- FIXED I think.
2091
2092   An indexblock remains pinned while the refcount is non-zero.
2093   A pinned index block can be on a _leaf lru
2094   The _leaf lru holds a refcount.
2095   This is an awkward referential loop.
2096   We break it at checkpoint time with special code in phase-flip.
2097   But there are other awkward times such as truncate.
2098
2099   We cannot use PinPending like we do with data blocks because there
2100   could be multiple pending Pins (from different children).
2101
2102   We could possibly treat checkpoint_lock like pinpending, but that
2103   might be racy.
2104
2105   We could not count the _leaf lru, but that might just make the race
2106   harder to find.
2107
2108   I think we want to explicitly drop the pin when we truncate a block.
2109   Normally, once we Pin an index block is will become dirty so we don't
2110   want to de-pin before a checkpoint anyway...
2111
2112   Just to clarify: an index block gets dePinned:
2113    - during checkpoint on a phase_flip if it is no longer dirty etc
2114    - on truncation when we erase it
2115    - during pre-emptive write-out which is a bit like an early phase_flip
2116            not sure that we implement that one yet.
2117
2118 17th March 2009
2119  Deadlock?
2120    - checkpoint calls incorporate call erase_iblock calls iolock_block
2121    - rm calls orphan_pin calls phase_wait
2122  The problem is in lafs_incorporate.  It expects the block to be iolocked,
2123   but can call erase_iblock which try to get an iolock itself...
2124  ...fixed that and it still happens.
2125  checkpoint calls phase_flip calls allocated_block (on uninc list) calls
2126     iolock_block before calling incorporate
2127  Maybe all of these should assume an IO lock.
2128
2129  FIXME truncate assume truncate-to-zero.  We need proper ftruncate support.
2130
2131  It nearly works....
2132   Things to do:
2133     - sort out individual patches and review DONE
2134     - allow compilation without refcount tracking DONE
2135     - don't hold a 'leaf' reference. NO
2136     - clean up *ref calls - differentiate those that can be called when zero DONE
2137     - use enum for B_* DONE
2138     - support truncate to non-zero offset DONE
2139     - "looping on" found an 'OnFree' block!
2140     - clean out lot of debugging
2141
2142  Hmmm.... deadlock.
2143   rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
2144   checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
2145   Not cool.  i_mutex must not be taken by checkpoint
2146  Fixed that, though it is a bit of a hack....
2147
2148  New deadlock:  checkpoint calls phase_flip which calls allocate_block,
2149     to move the uninc_next across, and that tries to iolock the parent to
2150     perform a partial incorporation.  But that seems to be iolocked.
2151     Generally that is ugly as ->uninc_next might be very long and require
2152     multiple splits, and direct-driving that from phase_flip is bad.
2153     I should just move the list across
2154
2155
2156 19th March 2009
2157   Spent too long trying to remove refcount help by *_leaf lists.
2158   This leaves InoIdx block with zero refcount so Data block can get
2159   lost and bad things happen.
2160   I might be able to fix it up, but it is probably better to try the
2161   checkpoint_lock approach if I can only remember what that is.
2162
2163 Locking:
2164   Available locks:
2165
2166    Spin:
2167
2168     lafs_hash_lock
2169         Used in:
2170            lafs_shrinker
2171            lafs_refile ???
2172         Protects:
2173            ib->hash
2174            ->lru when on freelist
2175
2176     i_data.private_lock
2177         Used in:
2178            lafs_shrinker
2179         Protects:
2180            ->iblock / refcnt
2181            ->dblock / my_inode
2182            ->children / ->parent within an inode
2183            setting ->private
2184
2185     fs->alloc_lock
2186         fs->allocate_blocks
2187
2188     fs->stable_lock
2189         segsum hash table
2190         segsummary counters (in blocks)
2191
2192     fs->lock
2193         _leafs lru
2194         ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
2195         Pinned consistent with lru
2196         ->checkpointing / ->phase_locked
2197         fs->pending_orphans
2198         ->uninc and ->chain ??  Should use parent->B_IOLock ??
2199         uninc_table - should use B_IOLock
2200         free list / clean list segtrack
2201
2202    Mutex:
2203
2204     fs->wc->lock
2205       wc[0] .. something in prepare_checkpoint
2206        ->remaining etc
2207       cluster_flush
2208       mini blocks
2209
2210     i_mutex
2211       inode_map
2212       orphans
2213
2214    Other:
2215
2216     B_IOLock
2217        erase_block
2218        incorporate
2219        cluster_allocate
2220        allocated_block
2221        IO
2222        Phase flip
2223        Initialising new inode
2224     B_IOLockLock
2225          IOLock across a page
2226
2227
2228 --------------------
2229 This is a list from 18 months ago, with updates
2230
2231  - Understand how superblock 'version' should be used.
2232
2233  -  Review and fix up all locking/refcounts.  See locking.doc
2234        Also lock inode when copying in block 0 and probably
2235        when calling lafs_inode_fillblock (??)
2236  -  lafs_incorporate must take a copy of the table under a lock so
2237          more allocations can come in at any time.
2238
2239  - We don't want _allocated to block during cluster flush.  So have
2240    a no-block version and queue blocks on ->uninc if we cannot
2241    allocate quickly.  Find some way to process those ->uninc blocks.
2242
2243  - Use above for phase_flip so that we don't need to _allocated there.
2244
2245  - Utilise WritePhase bit, to be cleared when write completes.
2246      In particular, find when to wait for Alloc to be cleared if
2247       WritePhase doesn't match Phase.
2248        - when about to perform an incorporation.
2249  - make sure we don't re-cluster_allocate until old-phase address has
2250      be recorded for incorporation.
2251
2252  - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
2253
2254  - Can inode data block be on leafs while index isn't, what happens if we
2255        try to write it out...
2256
2257  -  If InoIdx doesn't exist, then write_inode must write the data block.
2258
2259  - document and review all guards against dirtying a block from a previous phase
2260     that is not yet safe on storage.
2261           See lafs_dirty_dblock.
2262  - check for proper handling of error conditions
2263      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
2264  - review checkpoint loop.
2265        Should anything be explicit, or will refile do whatever is needed?
2266  - Waiting.
2267        What should checkpoint_unlock_wait wait for?
2268        When do we need to wait for blocks the change state. And how?
2269
2270  - load/dirty block0 before dirtying any other block in depth=0 file
2271
2272  - use kmem_cache for 'struct datablock'
2273  - indexblock allocation.
2274         use kmem_cache
2275         allocate the 'data' buffer late for InoIdx block.
2276         trigger flushing when space is tight
2277         Understand exactly when make_iblock should be called, and make it so.
2278  - use a mempool for skippoints in cluster.c
2279  - Review seg addressing code in cluster.c and make sure comments are good.
2280  - consider ranges of holes in pending_addr.
2281
2282  - review correct placement of state block given issues with stripes.
2283
2284  - review segment usage /youth handling and make a todo list.
2285       a/ Understand ref counting on segments and get it right.
2286  - Choose when to use VerifyNull and when to use VerifyNext2.
2287  - implement non-logged files
2288  - Store accesstime in separate (non-logged) file.
2289  - quotas.
2290         make sure files are released on unmount.
2291
2292  - cleaner.
2293        Support 'peer' lists and peer_find. etc
2294  - subordinate filesystems:
2295      a/ ss[]->rootdir needs to be an array or list.
2296      b/ lafs_iget_fs need to understand these.
2297  - review snapshots.
2298       How to create
2299       how they can fail / how to abort
2300       How to destroy
2301  - review unmount
2302       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
2303  - review roll-forward
2304       - make sure files with nlink=0 are handled well.
2305       - sanity check various values before trusting clusters.
2306
2307  - Configure index block hash_table at run time base on memory size??
2308  - striped layout.
2309          Review everything that needs to handle laying out at cluster
2310          aligned for striping.
2311
2312  - consider how to handle IO errors in detail, and implement it.
2313  - consider how to handle data corruption in indexing and directories and
2314      other metadata and guard against problems (lot of -EIO I suspect).
2315
2316  - check all uninc_table accesses are locked if needed.
2317
2318  - If a datablock is memory mapped writeable, then when we write it out,
2319      we need to with fill up it's credits again, or unmap it.
2320  - Need to handle orphans asynchonously.
2321
2322  - support 'remount'
2323  - implement 'write_super' ??
2324
2325  - pin_all_children has horrible gotos - remove them.
2326
2327  - perform consistency check on all metadata blocks read from disk
2328    e.g. don't assume index blocks are type 1 or 2.
2329
2330 23rd March 2009
2331  + looking at cleanup for unmount.
2332  - various more refcounts fixed up
2333  - B_SegRef is never dropped!  and we take a ref on a segment when
2334    we start a cluster on it, but never drop that reference.
2335   THIS is next thing - review all setting and clearing of B_SegRef.
2336
2337 30th March 2009
2338  - SegRef and lafs_reserve_block...
2339    There is room for recursion here, I need to be careful.
2340    To dirty a data block, all parent index blocks must be Pinned and must
2341    be able to be written.  That means their segusage blocks must be
2342    available for update.  And Pinning a segusage block for update requires
2343    all its parents.  So the segment for the block, the indexes, and the
2344    segusage and indexes and so-on must all be pinned.
2345    When we pin a block, we do it from the root down to avoid recursion.
2346    We probably wany whatever reserve_block calls, to return an unreserved
2347    block rather than call reserve_block itself.
2348
2349   When do we clear SegRef?? We set it when Pinning, so I guess we
2350     clear it when unpinning.
2351    pin_dblock, mark_cleaning, prepare_write, truncate
2352    seg_move clean_free
2353   We it is really when Pinning, or Dirtying or Reallocing.
2354   So we clear when unpinning, or when a dblock gets written...
2355   Maybe just when we lose ->parent
2356
2357 6th April 2009
2358  - sometimes sugsum counter goes zero for random data block
2359      Something is going wrong in roll-forward.  The block looks transiently valid
2360      so doesn't get read, but has no good data in it.
2361  - After deleting a directory, the block might still have incorporation
2362    to happen, but is not marked dirty
2363  - at unmount, there are various blocks that are still dirty.
2364  - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
2365
2366 12th April 2009
2367  - that rollforward problem above:
2368     When rolling the checkpoint, if we find segusage blocks we want to include
2369     them directly into file.  But by pinning the block we might preread a
2370     segusage block.. but we must be sure not to update it.
2371     So during the early stages of rollforward while still in the checkpoint,
2372     seg_inc must be called with in_phase == 0.
2373     so seg_move is called with phase != qphase.
2374     ditto for summary update.
2375     So the block must be pinned to the previous phase...
2376     Normally 'phase' changes at checkpoint-start,
2377              qphase changes at checkpoint-end
2378     So we probably want to start with qphase being 0 and phase being 1.
2379     When we reach the end of the checkpoint, we flip qphase to 1.
2380
2381  - blocks still in phase_leafs at unmount:
2382     After we force a final checkpoint we still have Pinned:
2383         root InoIdx
2384         ino==8 InoIdx due to Dirty block0
2385         ino=16 InoIdx due to dirty block0
2386      and dirty:
2387         inode block 1,  inode usage map
2388                     2,  root directory
2389                     8,  orphan
2390                    16   seg usage
2391      Problems:
2392         inode blocks dirty but not pinned?  No InoIdx...
2393         Segusage dirty - probably by seg_apply_all - disable that at umount
2394         orphan dirty ??... but not pinned!
2395            This is possible - we don't pin for clearing entries, just for setting.
2396         The inode problem stems from the datablock being dirty while the
2397          InoIdx block isn't.  That is, at best, confusing.
2398
2399 13th April 2009
2400    segusage blocks aren't being pinned
2401    They need to be pinned  whenever dirty.
2402    and youth blocks aren't even made dirty some times.  They need to be
2403     pre-pinned in many cases.
2404
2405    So: segusage gets changed when we write out a cluster, and when we
2406       delete/relocate blocks.
2407       In the first case we pin the block when it becomes part of the free list,
2408       and need to keep it pinned across checkpoint changes.
2409       In the second, we pin when the block is dirtied and again must keep it pinned.
2410       Youth gets changed when a segment becomes free and again when we allocate
2411       a segment to it.
2412
2413       Keeping a datablock pinned across checkpoints is awkward - we currently need
2414       to repin for each dirty... I guess we can re-pin for each checkpoint
2415       in lafs_seg_apply_all.  That might work for segusage, but not for youth!
2416       If segsnum for ssnum==0 held a reference to the youth block, that might
2417       help.  Segstat on 'clean' or 'free' would imply a reference to that segsum.
2418
2419       Is it OK to keep all youth/usage blocks for free/clean blocks
2420       pinned?  We can currently have 810 entries.  Only half will be clean/free.
2421       For each entry there can be two blocks, youth and usage.  So that could be
2422       810 blocks. 1Meg?  Normally much less.  If it became a problem we could
2423       reduce the number dynamically I guess.
2424
2425       maybe segusage blocks need to get phase_flipped, as other blocks do
2426       depend on them,   pin_all_children wouldn't be able to find them though..
2427
2428     1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
2429       Youth block.
2430
2431 14th April 2009
2432    I think I want to link dirty block to the space in free segments that we
2433    actually know about.  Each of those segments has youth and usage blocks
2434    pinned (at least parent pointer is active).  So we have everything we need
2435    to write everything that is dirty.  So 'free' or 'clean' implies
2436    a segsum reference which holds youth block.
2437
2438    When we get low on space, we wait for cleaning/finding to progress.
2439    This would limit us to  400 segments, say 16Meg each, so 6Gig of dirty
2440    memory.  I guess that we need to scale the 'free' list based on available
2441    memory (FIXME).
2442
2443    When cleaning needs a segment, it needs to load the usage blocks for other
2444    snapshots too.
2445
2446    When cleaning in the presence of snapshot we need to be careful never to
2447    duplicate a block that is shared.  To allow for v.many snapshots, we don't
2448    even want to duplicate in memory.
2449    So we need to choose a 'primary' copy - probably first one found - and
2450    follow the peers link when possible...
2451
2452 18th April 2009
2453    (continuing).
2454
2455    So clean and free segments in the list carry a SegRef.  But it could be
2456    excessive if all of them did - we shouldn't be required to pin more
2457    data than we need.
2458    So for segments with a usage of 0, we use the score to record if a
2459    segref is held.  0 means 'no', 1 means 'yes'.
2460    When space_alloc wants more space we need to find an entry and
2461    segref it.  Maybe we want free lists - reffed and not-reffed.
2462
2463    Then again, SegRefs are fairly cheap as they are heavily shared.
2464    maybe 512 to a block.  If we hold 400 refs they could easily all be
2465    in one block.  We could possibly encourage this by sorting the list
2466    and discarding from one end if it is too full.
2467    Sorting is a good idea definitely.  It keeps youth/usage updates
2468    together.
2469
2470    Just check the numbers.
2471    a 1TB device with 1K blocks might have 32M segments of which there
2472    would be 32768.  512 per block means 64 blocks or 16 pages (64K).
2473    So total segusage files is 128K plus snapshots.  Not worth worrying
2474    about surely.
2475    For 16TB, that is 2Meg plus snapshots.
2476
2477    So
2478     - keep a SegRef for all free and clean blocks.
2479       This must include a youthblk reference.
2480     - sort the free list when 'clean' is merged or when a pass
2481           finishes.
2482         sort clean list
2483         fix youth value
2484         merge as many as fit into free
2485         sort
2486
2487    How is the code flow...
2488       add_cleanable is called during the periodic scan.  It could hold
2489                a SegRef easily.
2490       add_cleanable calls add_clean as does lafs_get_cleanable during
2491           clean.  That might block getting a segref, might even
2492           deadlock?
2493       add_free is also called by seg_scan
2494
2495       So seg_scan should get a segref and leave it with everything!
2496
2497     BUT.....
2498     A SegRef implies a 'struct segsum' for each segment.  We don't
2499     want to allocated one of these for every segment in the table.
2500     We only want a reference to the youth and segusage block, which
2501     are heavily shared.
2502
2503     But these blocks need to be Pinned and SegReffed etc so we can
2504     write them at any time.
2505
2506 20th July 2009
2507   The refcount held by the 'leaf' lru is a problem.
2508   While it holds a count we do not unpin an index block, so it cannot
2509   be removed from the list.
2510   Thus we can only remove from the leaf lru on a phase change.....
2511   Or when doing lru based flushing... Maybe we can remove from the
2512   lru while holding the checkpoint lock.
2513   This happens when truncating..
2514
2515   No, that is just too messy as it is too easy to get put back on the list.
2516
2517   Maybe the leaf lru should not imply a reference count ... or maybe
2518   we need to split the refcount:  'inuse' and 'active'.....
2519   How about we test refcnt against list_empty(->lru)...
2520
2521   ....
2522
2523   During truncate, we need each index block to get unpinned so they can
2524   all be cleaned up.
2525   But the InoIdx block is held pinned by by the inode block being dirty.
2526   In this particular case, the InoIdx block is Invalid as the file is empty.
2527   But.... InoIdx should always be valid until after Inode is destroyed??
2528
2529
2530  umount
2531  I need to stop the cleaner and flush everything before trying to
2532  clean up.
2533
2534  This is awkward though.
2535  The 'sync' of umount is done by kill_block_super, but I call
2536  that rather late, after checking that the tree is empty.
2537  There are pinned/dirty bits left after sync that we want to magically
2538   clean.
2539  We have:
2540    - segusage/youth blocks.  Maybe if we don't seg_apply_all...
2541    - orphan block.  Maybe don't mark it dirty when we remove things?
2542    - inode map?? why is that dirty
2543
2544    - root directory is dirty still??  But it has been erased.
2545      InoIdx is valid-but-empty.  Inode Data is dirty
2546         Data block 0 is Dirty at block 0.
2547
2548   ......
2549  Ahh... need to mark page dirty when block is marked dirty !!
2550
2551  The seg usage blocks are now flushed out but not incorporated.
2552  I feel that might be correct - we don't want to care about
2553  incorporation as we will never use it.
2554  For this, segusage and quota are very special cases.
2555
2556  Inode map is no longer dirty, but is pinned
2557  Orphan does have a dirty block still
2558     The orphan table contains the root directory.
2559  root is now clean and gone
2560
2561  Segusage doesn't get incorporated after last checkpoint now
2562  so that is better.
2563  But now we have a circular reference for SegRef.  This should not
2564  be surprising given the circular problems we had setting SegRef.
2565  I guess we just erase the references in the segsum table...
2566
2567 22nd July 2009
2568  Hurray!!! I can unmount without crashing!
2569  Now I need to sort through all the fixes required to achieve that
2570  and make discrete patches, and be sure it is all OK.
2571
2572 DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
2573 DONE - (block.c) Mark page dirty when block becomes dirty
2574 DONE - (checkpoint.c) print orphan_slot with Orphan flag
2575 DONE - Don't incorporate segcount etc after final checkpoint
2576 DONE - Don't apply seg changes after final checkpoint.
2577 DONE - Don't start opportunistic checkpoint after final.
2578 DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
2579 DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
2580 DONE - (cluster) be more flexible about credit usage when flushing InoIdx
2581 DONE - (dir) do add_orphan when we abort as well as on success
2582 DONE - use inode_dec_link_count, not i_nlink--
2583 DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
2584 DONE - change %d/%d to strblk
2585 DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
2586 DONE - (index) refile: when unpinning, remove from lru
2587  - lafs_refile: ->iblock can be non-null for inode 0.
2588 DONE - Make sure I_Deleting gets cleared when deleting finished.
2589 DONE - phase_flip should have something separate to call, not lafs_allocated_block
2590  - inode.c: lafs_dirty_inode: getref_lock used to get dblock
2591 NONO - ?? getref_locked allowed if PagePrivate
2592 DONE - segment: lafs_seg_put_all needed at unmount
2593 DONE - segdelete_all: need to put intable references
2594 DONE - lafs_free_get: put the intable references
2595 DONE - lafs_get_cleanable: put the intable references
2596 DONE - fix sort splitting in add_cleanable
2597 DONE - add lafs_empty_segment_table for unmount
2598 DONE - lafs_release: flush all dirty blocks
2599 DONE - lafs_release: force a final checkpoint
2600 DONE - lafs_release: move kill_block_super before final check
2601 DONE - lafs_put_super: release orphans and segsum files.
2602 DONE - lafs_destroy_inode: putref should be 'iblock'
2603  - lafs_destroy_inode: allow for iblock to be present but no ref held....
2604 DONE - can roll forward call lafs_allocated_block without dirty???
2605
2606 27th July 2009.
2607  - I've re-arranged lafs_release so that the flush is all done in
2608    generic_shutdown_super.  However it calls invalidate_inodes, and that has
2609    problems with pinned inodes.  So we need for fsync_super to checkpoint
2610    out all inodes that we don't hold our own reference to.
2611    If we do hold a reference, then invalidate_inodes will skip them,
2612    and ->put_super can be used to drop the references and perform the final
2613    checkpoint.
2614    fsync_super calls ->sync_fs. after syncing call files.  Maybe I can
2615    do some sort of checkpoint there...
2616    There almost is a checkpoint in there.... But only when called without
2617    'wait'....
2618    I need to understand 's_dirt'.
2619    This is controlled entirely by the filesystem, common code only examines it.
2620    If it is set:
2621           file_fsync (the generic 'fsync' method) will call ->write_super
2622           fsync_super will call write_super
2623           generic_shutdown_super will call write_super
2624           sync_supers will call write_super
2625           sync_filesystems(0) will call ->sync_fs
2626    sync_fs is called:
2627         twice from 'sync', once with '0', once with '1' for 'wait'.
2628              (though in emergency_sync, both are '0').
2629         once from unmount and remount with 'wait' set to '1'.
2630         We don't want two checkpoints for a 'sync', but we want to start
2631         on 'wait=0'.
2632         Maybe if we get called with '0', we set a flag and treat the '1'
2633         differently..  There is no locking to make this really safe, but
2634         it will probably be OK...  I could take a process_id, but then
2635         parallel 'sync's could race.
2636         write_super is called before the syncs.  So it could start the checkpoint,
2637         and sync could wait for it.
2638         write_super is called multiple times at shutdown,  We really need
2639         to utilise sb_dirt to avoid some of these.
2640         We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
2641             - when we pin a dblock or dirty a this-phase iblock.
2642
2643 29jul2009
2644   at unmount, we iput the root inode which de-references the dblock
2645   before clearing ->iblock, which fails an assertion ... why?
2646    Apart from the shinker, ->iblock is only set to NULL in refile
2647    when we find an I_Destroyed inode... I guess the root block isn't
2648    getting Destroyed...
2649  The protocol for freeing iblocks is bad.  Should be:
2650    - it only gets freed by the shrinker
2651    - when inode dies, set ->inode to NULL
2652    - when InoIdx iblock dies, set ->iblock to NULL
2653    ...???
2654 30Jul2009
2655   So, what exactly is the protocol?
2656     - index blocks live either in the parent/sibling tree, or
2657       on the inode's free_index list
2658     - when refcnt is 0, they live on 'freelist.lru'.  When refcount
2659       is elevated they stay on lru until they need to be
2660       added to some other lru (leafs or cluster)
2661     - when shrinker finds block on freelist.lru with non-zero refcnt,
2662       it just removes from lru
2663     - when shrinker finds free block, it removes from free_index and discards
2664       the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
2665         I think not as such would either have children or be on an lru
2666     - When we destroy an inode, all index blocks get disconnected from the
2667       inode and freed.  This must include the ->iblock
2668     - When an index block becomes free due to index tree shrinkage,
2669       we set the ->depth to -1 so that it cannot be found by mistake,
2670       and leave it for shrinker or inode destruction.
2671
2672    Confused about inode<->dblock dependence.
2673    We don't want the inode to refcnt the dblock as that wastes space.
2674    We don't want the dblock to refcnt the inode as that stops it from being freed.
2675    So each must disconnect from other when freed.
2676    What locking?
2677    inode takes private_lock, then checks dblock
2678    dblock cannot take private_lock before checking ->my_inode..
2679    Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
2680      drops ref
2681
2682 1Aug2009.
2683   Tracking down the 'credit' count and making sure it stays correct.
2684   It seems that I have a Dirty InoIdx block which is not pinned.
2685   Due to this it has no refcount and so the data block disappears so
2686   the InoIdx block is not visible in the tree.  This isn't a definite bug
2687   but it means I cannot count credits properly.
2688   And surely Dirty index blocks must always be pinned!!??
2689
2690   When as small file is flushed to the inode we were dirtying the
2691   iblock.  That seems wrong - should dirty the dblock?  Need to
2692   check that is valid
2693
2694   I got a hang in 'rm adir/4'.
2695   rm is in lafs_cluster_update_commit_both
2696        getting a mutex.
2697   cleaner is in lafs_do_checkpoint+0xe4
2698   pdflush is in writepage/lafs_cluster_flush waiting on a lock
2699   so I guess cleaner is holding a mutex and waiting for something
2700    that wont happen?
2701
2702
2703   Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
2704    cleaner is at some point, holding a mutex to stop 'sh'.
2705   0e4 == 228
2706
2707   ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
2708    to be allowed.
2709   So when something locks the checkpoint and needs to flush, we have problems....
2710
2711
2712   I seem to have fixed the above.  Now:
2713     Free space is a real problem.  When I remount after the successful unmount,
2714     we find a usage pattern like:
2715 CLEANABLE: 0/0 y=10 u=34179
2716 CLEANABLE: 0/1 y=0 u=65144
2717 CLEANABLE: 0/2 y=0 u=65535
2718 CLEANABLE: 0/3 y=32773 u=32910
2719 CLEANABLE: 0/4 y=32772 u=149
2720 CLEANABLE: 0/5 y=0 u=0
2721 CLEANABLE: 0/6 y=32770 u=16529
2722 CLEANABLE: 0/7 y=32769 u=35084
2723 CLEANABLE: 0/8 y=32768 u=31877
2724
2725     Which is ridiculous.
2726    Better fix up what I have first...
2727
2728  ...
2729  In rm /mnt/1/nbfile* we hang..
2730    rm is in lafs_phase_Wait from pin_dblock in unlink
2731 wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
2732
2733    cleaner is in lafs_iolock_block from add_block_address in phase_flip
2734 iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
2735
2736  So cleaner is probably deadlocking against itself via iolock_block.
2737   This is taken:
2738     - in lafs_invalidate_page just to wait for any io - it isn't held long
2739     - in lafs_erase_dblock while we erase and 'allocated_block'
2740     - in lafs_get_flushable to protect blocks being checkpointed
2741     - in lafs_writepage to call cluster_allocate (which releases), both for
2742              data block or for inode when data was flushed there.
2743     - lafs_add_block_address to process pending incorporations to make room.
2744          This is what is trapping the cleaner.
2745     - lafs_inode_handle_orphan when truncate finishes to erase_iblock
2746     - lafs_inode_handle_orphan again to incorporate all removal
2747     - and again to erase_iblock
2748     - and for partial truncate to incorporate some removals
2749     - and again....
2750     - lafs_new_inode to keep it from being cleaned while being created
2751     - roll_block to add addresses
2752     - lafs_load_block during IO
2753
2754   So: who holds it?.... let's use the code to find out...
2755   And the answer is : lafs_get_flushable.
2756    So get_flushable iolocks the block then calls phase_flip which tries to
2757    incorporate other-phase children which try to iolock the block.  Deadlock.
2758    Do we need to hold iolock during phase_flip ??.  Not for all of it..
2759
2760 02August2009
2761    FIXME When erasing a block, do I need an uninc credit?  I usually don't
2762     have one and the need certainly isn't as great...
2763
2764   Now... let's try to get free space accounting right.
2765    Observed problems:
2766      - unlink sometimes failed with ENOSPC
2767      - usage scan shows segmetns with enormous usage - 23039!!
2768
2769   no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
2770   no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
2771
2772   no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
2773
2774
2775   after umount/remount df says "4608 7 1544" but cannot
2776    create anything.
2777 df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
2778 ============= Cleanable table (7) =================
2779 pos: dev/seg  usage score
2780   0:   0/0        1 0
2781   1:   0/5        1 64
2782   2:   0/6        6 384
2783   3:   0/7        2 128
2784   4:   0/8        3 192
2785   5:   0/3        1 64
2786   6:   0/2        2 128
2787 ...sorted....
2788   0:   0/0        1 0
2789   1:   0/3        1 64
2790   2:   0/5        1 64
2791   3:   0/2        2 128
2792   4:   0/7        2 128
2793   5:   0/8        3 192
2794   6:   0/6        6 384
2795 --------------- Free table (1) ---------------
2796 12290:   0/4        0 0
2797 --------------- Clean table (0) ---------------
2798 CLEANABLE: 0/0 y=10 u=1
2799 CLEANABLE: 0/1 y=32775 u=3
2800 CLEANABLE: 0/2 y=32774 u=2
2801 CLEANABLE: 0/3 y=32773 u=1
2802 CLEANABLE: 0/4 y=0 u=0
2803 CLEANABLE: 0/5 y=32771 u=1
2804 CLEANABLE: 0/6 y=32770 u=6
2805 CLEANABLE: 0/7 y=32769 u=2
2806 CLEANABLE: 0/8 y=32768 u=3
2807
2808
2809 03Aug2009
2810  Current issues:
2811 FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
2812 Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
2813  3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
2814  4/ invalidate_page find Realloc set, even after iolock ..
2815      This is during umount  in generic_shutdown/lafs_put_super/iput
2816  5/
2817
2818
2819  Thoughts:
2820    If we flag a block for Realloc then Dirty before it is allocated,
2821      then all is fine.
2822    But if we have already allocated to a cleaning cluster... what happens?
2823     We need to treat this like it was dirties after being written, so
2824     it gets written to a regular cluster as well.
2825     As we only have one uninc bit for both Dirty and Realloc, we need
2826     to *not* incorporate the Realloc update if the block is still dirty.
2827    So:
2828         - block gets chosen for cleaning and allocated to a clean-cluster
2829         - block gets marked dirty.  This must not clear Realloc
2830         - cluster is flushed, block is dirty, so don't call lafs_allocated_block
2831         - Return the Realloc credit, but keep dirty and Uninc.
2832      Is there a race if Dirty is set after we enter lafs_allocated_block?
2833       As long as the index block gets marked Dirty, not Realloc we might
2834        be safe... though it gets awkward if the Dirty writeout falls in to
2835        the next phase.  But reserve_block will have provided NCredits for that.
2836      So:
2837         1/ don't clear Realloc when setting Dirty
2838         2/ do clear Realloc if cleaner finds the block is Dirty
2839         3/ avoid calling lafs_allocate_block when cleaning a dirty block.
2840                    This is an optimisation.
2841
2842     Almost...  A B_Realloc block no longer has B_Credit so B_Dirty cannot be
2843        set.
2844
2845
2846   Thoughts3.
2847      When cleaning blocks we hold no reference to the inode and it can disappear.
2848      We don't want to hold the inode active, but need a reference much like
2849       the truncate code has.
2850      I think we need a subordinate refcount for both cleaning and truncate.
2851       These hold inode present but not active.
2852      Maybe every block->inode should be counted like this.
2853      And this might simplify the my_inode->dblock inter-relationship.
2854      For later..
2855        We need to ensure that if a new iget is called on an inode that still
2856        exists, we don't allocate a new one but just reuse the old.
2857        But that won't work as we cannot add an inode back into the hash table.
2858      So I think when cleaning a block we need to ref the inode.
2859       i.e. B_Realloc implies an i_grab
2860
2861 05aug2009
2862  So I have a problem with the cleaner wanting to hold and inode that
2863  the VFS is destroying.
2864  I don't want the cleaner to hold i_count as that delays truncate etc.
2865  So we need a second counter subordinate to i_count.
2866  This is held by the cleaner and by delayed truncate, and by i_count.
2867  Possibly ->my_inode holds this, which means it can be a single bit...
2868
2869  When a lookup wants an inode, we need to load the inode data block and
2870  see if it has my_inode.  If it does, we insert that inode in to the
2871  hash table.  If not we fall back to regular inode creation....
2872
2873  On reflection, that is too complicated and hard and error prone.
2874  When relocating a file we need the data so it had best be in the page
2875  cache so the filesystem really needs to know that the inode is still
2876  active.
2877  So cleaning needs to keep a reference to the inode.
2878  The cost of this is that if an inode is being deleted while it is
2879  being cleaned the truncate cannot happen until the cleaning
2880  completes.  This means that space usage will be wrong.
2881  When nlink becomes zero we can drop the cleaner reference.  When
2882  the inode is dropped/destroyed we can tie the cleaning in with the
2883  delayed truncate so that the final destruction doesn't happen until
2884  the cleaner has let go.
2885
2886  So: how to track that the cleaner has a reference to the inode?
2887  Maybe every B_Realloc block owns a ref on the inode.... but dropping
2888  those references when i_nlink hits zero would be difficult.
2889  They could hold a secondary refcount which, if non-zero, implies a
2890  ref on the inode.
2891
2892  So:
2893   - Set B_Cleaning when we look at a block for cleaning, and clear
2894     it when we find Realloc clear and ....????
2895   - Whenever a block has B_Cleaning set, it holds a counted reference
2896     on LAFSI(b->inode)->cleaner_ref
2897   - When cleaner_ref is non-zero and I_Deleting is not set, we hold
2898     a reference on the inode (i_grab).
2899   - when i_nlink hits zero, set I_Deleting and drop any reference
2900     held by the cleaner.
2901  DONE - cleaner must be careful not to process any block that has been
2902     truncated, or file that is dead.
2903  DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
2904   - What about filesystem inode... how do they fit in??
2905
2906
2907   Question. When are the index blocks for an inode flushed?
2908   We need to have them gone when the inode disappears.
2909   For deleted inodes, this happens in background truncate.
2910   For memory-pressure inodes it will hopefully happen well in advance,
2911   but we need to make sure in destroy_inode that everything is
2912   written. - FIXME
2913
2914
2915   Thinking again about B_Cleaning, any B_Realloc block will hold a
2916   reference through to InoIdx and so dblock will be present and the
2917   inode won't be freed.  So we only need an extra reference during
2918   the first little phase of cleaning when we are collecting blocks.
2919   After that a reference can be useful as it will delay flushing so it
2920   can be more efficient...
2921
2922   Maybe this is all much simpler than I thought.
2923   If we hold a ref on the inode whenever the InoIdx block is Pinned
2924   and i_nlink is non-zero, then we won't be forgotten until all
2925   index blocks are written.  We may still be deleted, but as that
2926   is one-way we can hold on to the inode at little cost.
2927
2928   getting/putting that ref at exactly those times turns out to be
2929   messy.
2930   It might be best to have a flag to say "We hold an extra ref".
2931   Then we occasionally call a function that validates the setting.
2932   It is most important to drop the count at the right time, so
2933   after unlink/rmdir/rename and when B_Pinned is dropped.
2934
2935   B_Pinned is set in:
2936      set_phase which is called from:
2937           lafs_cluster_allocated when moving 'pin' across to data block
2938               so don't need checkpin
2939           lafs_pin_block_ph
2940               only need check_pin if dropping spinlock
2941           pin_all_children
2942               only pins data blocks (Index are already pinned if relevant).
2943           grow_index_tree
2944               where "inoidx block pinning" doesn't change
2945           do_incorporate_leaf
2946               No InoIdx involved
2947           do_incorporate_internal
2948               ditto
2949    So only need check in lafs_pin_block_ph and maybe pin_all_children...
2950
2951 08Aug2009
2952   - credits get out of sync from
2953       lafs_incorporate->refile->space_return from checkpoint.
2954       counter is one more than we can find.
2955       returning space on
2956          i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
2957        Note it in an Index but not InoIdx.  The parent is still in the tree.
2958      This that is FIXED
2959
2960   - and out by 8! at
2961       delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
2962     FIXED that.
2963
2964   - BUG credits<0 in space_return from lafs_incorporate from add_block_address
2965      from phase_flip
2966 Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
2967      from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2968 msg: (1,3,1)(1,1,-1)
2969 Credits = -1, rv=1
2970 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2971
2972     This is a predicted but not handled problem.
2973     The answer is that not all blocks need ICredit/UnincCredit.
2974     The purpose of this credit is to allow for a split in the parent.
2975     pre-existing index blocks can never split the parent themselves
2976     If an index block becomes full, it will split and this might split
2977     the parent.
2978     If an index block has free space, then it will only over flow if it
2979     gets multiple child updates and this will provide multiple credits.
2980     So an index block with space for 3 or more new addresses does not need
2981     and ICredit/UnincCredit.  So when we split we don't need to provide an
2982     uninc credit.
2983     In particular.
2984     When we have a fully InoIdx block and a single new child with 1 UnincCredit,
2985     each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
2986     an ICredit, then create a new intermediate such that
2987         InoIdx is Dirty and has an ICredit
2988         New Index is Dirty with no ICredit - it used the UnincCredit
2989         New child looses its UnincCredit
2990     When another block in the new index arrives, it's unincredit is used to
2991     provide an ICredit
2992
2993     When a leaf block cannot fit a single address it will have ICredit.
2994     The block is split so that each has 3 spaces and so do not need ICredit,
2995     but as soon as ICredit is available, they take it.
2996
2997     Worst case is that every ancestor is full and the leaf is split
2998     We then get two full branches, each block half empty so not needing ICredit.
2999
3000
3001   Then...
3002     free data being used in lafs_refile from cleaner.
3003     b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
3004     Answer: lafs_refile was derefering ->inode when it wasn't safe.
3005      Need to at least have a parent before it is safe.
3006
3007   Hang:
3008      soft lockup cleaner->lafs_iget->ifind_fast ....
3009     Then (may be caused)
3010 Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
3011 .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
3012 Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
3013 ------------[ cut here ]------------
3014 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
3015
3016     It seems the cleaner gets confused and goes spinning.
3017
3018
3019   So: space problems:
3020     After the run, we have -14 used and 2055 available (of 4608), and
3021     cannot create anything.
3022     4 segments ar free, one is cleanable.
3023    free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
3024 or
3025    free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
3026 or
3027    df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
3028    free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
3029    and very little free
3030
3031   ablocks_used is going negative - why?
3032    Probably we erase a dblock without clearing Prealloc.
3033    Then when Prealloc later gets cleared, ablocks_used is
3034    wrongly decremented.... no...
3035
3036
3037 10aug2009  (don't forget above problems)
3038   Another problem.
3039    read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
3040      getiref_lock triggers BUG.
3041    This is presumably because I have just fixed it to get the correct
3042      iblock and not the iblock of the filesystem.
3043
3044   FIXME I hacked around this but I'm not sure the result is right.
3045     The question is about when the InoIdx should be dirty and when
3046     the inode data block should be dirty.
3047    In this particular case we are writing a page of a small file.
3048      cluster_allocate calls flush_data_to_inode which tried to dirty
3049      the inode dblock but finds that iblock is not pinned...
3050      When we dirty a data page we aren't pinning the parent!
3051    That might be OK - we only need to count and reserve the parent.
3052     We don't need to pin it until it becomes dirty.
3053
3054    Still need to resolve when which block gets to be dirty, and also
3055     exactly when an index block needs to be pinned.  And how does that
3056     related to holding a ref on the inode when the inoidx is pinned.
3057     Maybe it should be when the inoidx is referenced.
3058    FIXME
3059
3060 11aug2009
3061    Another problem. unlink->handle_orphans->erase_dblock->allocated_block
3062     and get a zero from lafs_add_block_address but parent is not pinned.
3063   And... One unmount, orphan file still has pinned blocks so the inode
3064     isn't free.
3065   And ... root still old phase after lots of 'rm' then sync.
3066     Inode 244 has pinned inode block held by writepage0 and writepage
3067          this is adir/170
3068
3069 13aug2009
3070   - lots of bugs introduced by change to marking inode blocks dirty:
3071      writepage/cluster_allocate wants to Dirty inode data block with no credits.
3072          because I put credit in iblock!
3073
3074   - ohhh.... The phase contour is broken.  When a block is added to a
3075     cluster for allocation it isn't in the phaseleafs any more, but prevents
3076     it's parent from joining.  So we cannot assume that if dblock is on
3077     list then iblock or a child will be too.
3078     So when we find dblock we do need to remove it.... done that.
3079
3080   - root not changing because Data 1/0 is Pinned and IOPending
3081      and held by writepage!!
3082      Problem is that IOPending blocks aren't put back on lru.
3083      But that should only be blocks on the cluster list.....
3084      But that is where I am putting it.
3085      Maybe I need exclusion between checkpointing and any other
3086        code that writes to checkpoint so checkpoint can wait
3087        for that ... can we use wc->lock??  That doesn't lock
3088        against cleaner, but that isn't a problem...
3089    But now 0/228 is still pinned and in writepage and IOPending
3090     So there is more to it than that.
3091     When checkpoint finds an IOLocked block, it might be about to
3092      join a cluster, in which case we don't really want to wait, or it
3093      might be undergoing incorporation in which case we want to wait.
3094      or it could be being erased, so wait..
3095      Maybe I wait until it appears on some list.... yes.
3096
3097 14aug2009
3098     At unmount Index 8/0 with child and leaf is still pinned
3099   This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3100
3101   and..
3102
3103   A problem is that something goes wrong in the erase process.
3104   We find new children after we erase the inoidx block!
3105
3106   This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
3107
3108   When/how do we erase indexblock and particularly inoidx blocks?
3109   Does and inValid InoIdx simply mean there is no indexing and does not
3110   reflect on the Data block?
3111
3112 .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
3113
3114  Orphan problem:
3115 nextfree = 0
3116 reserved = 0
3117 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3118 This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3119 [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3120   [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
3121   [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
3122   [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3123
3124 nextfree = 1
3125 reserved = 0
3126   0: 1 0 0 304
3127 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3128 This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3129 [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3130   [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
3131 badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
3132
3133
3134 erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
3135 erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
3136 ------------[ cut here ]------------
3137 WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
3138 unlink/orphan/erase_dblock_allocated_block
3139 ---[ end trace 61b8bd59512ea4da ]---
3140 zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
3141    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3142    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3143 ------------[ cut here ]------------
3144 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
3145
3146 BINGO.  When we remove last entry from directory we erase the InoIdx block,
3147  then when we add entries, we hit problems.
3148
3149
3150 nextfree = 3
3151 reserved = 0
3152   0: 1 0 0 306
3153   1: 1 0 0 307
3154   2: 1 0 0 74
3155 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3156
3157 This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3158 [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3159   [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
3160
3161 This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3162 [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3163   [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
3164
3165 This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3166 [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3167   [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
3168
3169 We have stray 'cleaning' references.
3170 It is taken -
3171    on a data block that was in a to-clean segment
3172      at which point we igrab the inode
3173      the block is put on the ->cleaning list.
3174 It is put:
3175    when we get an error finding the block
3176    when we find that it isn't in the segment
3177    when an error occurs loading the block-to-be-relocated
3178    and when we mark that block for cleaning.
3179   i.e. always unless we got EAGAIN or some space error.
3180    If we still hold some blocks, try_clean returns 0.
3181
3182 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3183 This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3184 [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3185   [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
3186   [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
3187   [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3188
3189 NOTE these inode data blocks are not pinned and so did not get written!!
3190
3191 FIXME I should wait for the checkpoint to finish
3192 nextfree = 1
3193 reserved = 0
3194   0: 1 0 0 301
3195 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3196 This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3197 [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
3198   [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
3199
3200 16Aug2009
3201   When I clean and find an inode that is already deleted, I need to be
3202   very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
3203   to be.  lafs_delete_inode gets called a lot, but mostly for dead inodes.
3204
3205   BUGS:
3206 FIXED orphans don't get cleaned up.  It seems a 'create' fails and leaves
3207       and orphan block un-released.
3208    - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
3209    - Not sure that we handle complete truncation, then adding blocks properly.
3210      - what should the state of the InoIdx block be?
3211    - On remount, the filesystem contains rubbish.
3212    - create fails even when there should be free space.
3213    - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
3214    - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
3215           and 74 has similar issue
3216      327 = adir/big1   74=adir
3217
3218
3219 17Aug2009
3220   Segusage blocks aren't always Pinned when we make them dirty.
3221   Yes. That is correct.  They are not forced out by phase change but by
3222   lafs_seg_flush_all at the end of a checkpoint.  So they need to be
3223   preallocated, but not Pinned.
3224   But, once we have finished the last checkpoint we don't want to
3225   dirty Segusage blocks any more.. I wonder if we are.
3226   No, but we were Pinning inodes without PinPending and they
3227   lost the pinning straight away!
3228
3229   OK, other annoyance.
3230    InoIdx block and similar are getting erased at the wrong
3231    time.
3232    We can only safely erase them when they have no children.
3233    I guess what we really want is the incorporation leaves them
3234    existing but empty, and when we go to write them out, if they
3235    are empty we register an address of 0.
3236    When we drop the ->parent pointer of an Index block it
3237    just goes away...
3238    So:
3239     When incorporate or truncate produces and empty index block
3240      it simply clears B_Valid.
3241     When incorporate want to add to an index block, we set B_Valid
3242     When cluster_allocate gets a non-Valid index block it call
3243     block_allocated with phys of 0.
3244
3245     Yes, that seems to work.  Mostly
3246
3247 18Aug2009
3248   On remount, check_credits dies: 16/20-0
3249     In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
3250
3251 19Aug2009
3252   OK, this index block clearing is a mess.  There must be a neat model I can
3253   follow that will make it "just work".
3254   The key seems to be children.  If an index block has children, then it
3255   really must exist.  If it has no children and no content, then it can
3256   be discarded, in which case it needs to be unlinked from its sibling list.
3257   What locking do we use here?  Probably IOLock on the parent index block.
3258   So we need iolock while looking in a parent for children, and we take
3259   IOLock while incorporating or pruning.
3260   Once the empty index block has dropped out it will never be found again.
3261   When we incorporate the zero address, the index block becomes invisible
3262   unless it is shortly after it's predecessor in the sibling list.  But
3263   that is hard to ensure, especially if the first child is the one that
3264   is being erased.  So if an index block is erased, then it must be
3265   discarded quickly and any children need to be relocated...
3266   Or maybe not.... maybe if there are children, we just write and empty block?
3267
3268 22Aug2009
3269   We need better locking of the index information.
3270   It seems best to use IOLock as that is already held during incorporation.
3271   So any code that accesses or updates and index block must hold IOLock.
3272   This might be a bit of a restriction if we try to do a lookup while
3273   writeout is happening.... Maybe we need a separate writeback flag for that.
3274   But I think it is good to use IOLock for now.
3275   Places we need this are:
3276      flush_data_to_inode needs to lock the InoIdx block
3277        - DONE
3278      lafs_leaf_find as it recurses down.  This should return a locked leaf.
3279        - DONE
3280      callers of clear_index
3281          erase_dblock for depth=0??
3282        - DONE
3283      incorporate should lock new blocks for consistency
3284        - DONE
3285
3286    Locking dependency rule is that if we hold a lock, we are allowed to
3287    lock a child index block, but not a parent.  IF we hold a data block,
3288    we are allowed to lock the an index block.
3289
3290
3291   The read/write completion seems all wrong.  It unlocks if the page was locked,
3292    and that isn't really safe, because it might not have been locked for read..
3293    We need to flag block0 to say if lock or writeback need to be cleared.
3294    Given that, I don't need IOPending any more:
3295     Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
3296     Write: We queue all writes, then set 'do_clear_writeback', then check.
3297
3298   Now... can we use a writeback flag to avoid waiting to read while writeout
3299   is happening?  We would need:
3300      set writeback in cluster_allocate
3301      wait_writeback after some lock_block
3302      clear_writeback when writeout finishes.
3303      Extra checks where we already check for IOLock
3304
3305
3306 24aug2009
3307  Lots of progress but....
3308    cluster_flush calls cluster_done calls refile call iput call
3309     drop_inode call write_inode_now calls writepage calls cluster_flush
3310   and we get a locking loop.
3311    I think we need the run that cluster_done from a different thread.
3312
3313
3314  We seem to have a refcnt problem with segsum.
3315
3316 25aug2009
3317  Lots more progress but.....
3318
3319   orphan_release is finding that the orphan block has no credits.
3320   We can allocate credits and simply not do the update if they
3321   are not available:  having an extra entry in the orphan file isn't
3322   a problem.  However we need some mechanism to clean up other than
3323   waiting for a remount..
3324   I think we leave that until we redo orphan handling.
3325
3326  and: adir sometimes loses one block so it and the contents don't get
3327    deleted.
3328
3329  and: it seems we sometimes try to clean the segment being written
3330    to.  We must avoid that.
3331
3332  (long ago I wrote::
3333   FIXME When pin fails, we need to remove PinPending from everything!!!
3334  and never followed up ... I wonder?
3335  )
3336
3337 25Aug2009
3338  Orphan handling.
3339   Every orphan block goes on a per-fs list and gets removed only
3340   if the B_Orphan bit is clear.
3341   There are two times when we want to expedite orphan handling.
3342   1/ on rmdir we need to know if the directory is really empty.
3343      This requires that we expedite the orphan handling of all
3344      blocks.  As soon as we find a non-orphan, we can give up.
3345      Then we need to make sure the index tree has collapsed.  WE
3346      can borrow that code from truncate.
3347
3348   2/ When writing past Trunc_next.  We just pass the block to
3349      special orphan handling.
3350
3351   This requires that orphan handling is re-entrant.
3352   For dir, that is protected by i_mutex, but rmdir needs to come
3353    in under the radar.
3354   For trunc, the iolock on the index blocks should be enough.
3355   I wonder if IOLock can be used on dir as well... allowing
3356   parallel orphan handling in the one dir even!!.
3357
3358   We need to ensure exclusion of orphan handling, including:
3359       - only one orphan handler at a time
3360       - don't run orphan handler while still processing action
3361         that makes it an orphan.
3362   Maybe if we just use IOLock for that?  Does that work?  Maybe
3363   but it gets messy for directories (on first attempt anyway).
3364   For directories we can just use i_mutex.
3365   Maybe i_mutex for files as well?
3366
3367 27Aug2009
3368   Orphan handling is going well... but not perfect.
3369   I'm using IOLock to ensure exclusion for orphan handling.
3370   However:
3371     I'm not really implementing that on directories
3372     Inodes go bad because lafs_erase_dblock needs the lock too.
3373     The call from rmdir will always faile because we hold i_mutex.
3374
3375   Bigger problem.  I'm IOLocking inodes across checkpoints to preserve
3376    Orphan status.  But that might stop the checkpoint proceeding.
3377    .. so use i_mutex, not IOLock - find.
3378
3379   Now... it seems I've confused myself.  Orphans don't get handled
3380   immediately.  In particular, inodes should not be handled until
3381   they final delete_inode.  So setting the B_Orphan flag and putting
3382   on the list are two separate events.  The flag must come first,
3383   but the list may come much later.  So some of that mucking around
3384   with i_mutex is pointless.
3385   So:
3386     make_orphan makes sure it is in orphan file, sets bit, and removes
3387       from list (if present).
3388     add_orphan puts it on the list for handling.
3389
3390     For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
3391         as does any unlink/rmdir/rename that fails.
3392
3393     For directories: put it on list in commit/abort.
3394
3395
3396   And...
3397     I hit the BUG where find_leaf wants and address of 0.
3398       If an index block gets cleaned out it doesn't disappear
3399       immediately.. there is no leaf to find in that direction.
3400       We probably need to avoid non-Valid blocks or something...
3401   And...
3402     Orphans 0/299 to 0/329 and  0/280 are still on the list
3403      but are not orphans.
3404      Maybe I need to catch mutex_unlock to run the orphans??
3405   And...
3406     We underflow a segment through orphans are unmount.
3407       We are cleaning and truncating at the same time.
3408       The same block gets allocated to 0 and to 1225
3409       in quick succession.
3410       Problem is that we apply new address while in writeback
3411       so a new lafs_allocated_block
3412
3413 29Aug2009
3414
3415   Review of inodes in orphan list:
3416     lafs_new_inode makes are orphan for a non-existant inode.
3417     If the inode cannot be created, orphan_release is called.
3418     If it can, a 'struct inode' is filled in with valid type
3419     and nlink==1 (!!) and attached.  The inode will only be
3420     detached when the refcnt hits 0, and the orphan list implies
3421     a refcount, so if we ever find something on the orphan list
3422     with a NULL my_inode, it must be very new and can be ignored.
3423
3424     When we find an inode block with a my_inode there are a few options:
3425       if I_Trunc is set, we must progress truncation providing we can
3426             get the i_mutex
3427       else if I_Deleting we must delete the inode
3428       else if nlink is 0, we remove from the list
3429       else nlink > 0 and we must remove orphan status.
3430     This means that if nlink is elevated, we need to be holding the mutex...
3431     So don't elevate nlink any more...
3432
3433     When nlink becomes non-zero the block need to be put back on the
3434     orphan list (it must already be an orphan).  Also when we set
3435     I_Deleting or I_Trunc it must go on the list.
3436    .. OK, I think I have all of that.
3437
3438
3439 30Aug2009.
3440    I have some wierdness that seems to be caused by the orphan stuff,
3441    probably due to it all being async now.
3442    - A deleted inode clears I_Trunc and then sets it again.  The only
3443      explanation seem to be that delete_inode is being called again,
3444      so I must be igrabing it again, maybe from cleaning.
3445    - bits of directories aren't getting deleted.  Sometimes single
3446      blocks, though the referred files are deleted.  Sometimes
3447      the whole directory... More interestingly, those blocks then
3448      don't get cleaned, so something about them means that they
3449      don't get deleted and don't get cleaned either.
3450
3451    Even weird... I just had a case where file 331 had a different
3452    index block for every 4 data blocks...
3453
3454
3455    FIXME:
3456     - What stops pinned blocks from being flushed by bdflush in middle
3457       of operation and so losing allocation?  Must make sure to set
3458       them dirty very late.
3459     - orphan_release can fail, so much make sure we can always call
3460       it, even if my_inode is NULL.... but how?
3461
3462
3463     - make_orphan could fail due to lack of space, which is not OK.
3464       I made it loop, but I'm not 100% sure that is right... it isn't.
3465       I need to pass down the 'I'm freeing space' flag, and I need to
3466       not require Credit of Dirty is set, etc.
3467
3468
3469     - I seem to have a deadlock and unmount.
3470        umount is waiting for lafs_checkpoint_lock_wait in
3471           lafs_put_super
3472        pdflush is in down_read in sync_supers
3473        lafs_cleaner is iget_locked/ifind_fast/inode_wait
3474                 This is waiting for I_LOCK to be clear.
3475
3476
3477 31Aug2009
3478   - When a file shrinks and becomes level-0, make sure
3479     old addresses get deallocated.  I seem to have
3480     a directory where they didn't.
3481
3482   - Due to the fact that we over-preallocate, we really shouldn't
3483     return ENOSPC until we have flushed dirty data and performed
3484     a checkpoint??
3485
3486
3487   - When I removed the last index from an inode
3488     (Indirect type) it seems that I didn't write
3489     out the corrected block..??
3490
3491 1sep2009
3492  I ran my simple test run repeatedly overnight.
3493  It ran 208 times before I stopped it.
3494  There are 3 possible failure modes:
3495    1/ didn't completed within 500 seconds
3496    2/ triggered a BUG
3497    3/ appeared to complete, the number of blocks
3498       in use was not the correct '7'.
3499
3500  74 (35%) did not fail!
3501  31 () did not complete
3502  40 () triggered a BUG
3503  2 did not complete but did not trigger a bug
3504
3505  94 of those that failed did not have a BUG
3506  92 actually completed.  Of these:
3507       1 final blocks 1
3508       1 final blocks 110
3509       1 final blocks 23
3510       2 final blocks 12
3511       5 final blocks 0
3512       6 final blocks 10
3513      11 final blocks 8
3514      21 final blocks 11
3515      44 final blocks 9
3516
3517  of the BUGs,
3518        1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
3519       1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
3520       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
3521       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
3522       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
3523       2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
3524       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3525       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
3526       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
3527       6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
3528       7 BUG: unable to handle kernel paging request at 6b6b6bfb
3529      11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
3530
3531
3532  super.c:655 is "block is still pinned" at unmount time.
3533   The block was always an InoIdx with a child.
3534   Either inode 0 or 16.
3535   child is held by various things:
3536       [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
3537       [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
3538       [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
3539       [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
3540       [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3541       [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3542       [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3543       [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
3544       [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
3545       [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
3546       [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
3547
3548  The "unable to handle kernel paging request" is always in
3549  umount.
3550      invalidate_inode_buffers(26/46)/lock_acquire
3551
3552
3553  block.c:529
3554     This is iblock valid when erasing a block
3555     The block we are erasing is always 0/327 or 0/328.  It is
3556     an orphan we are handling, iolocked but not always pinned
3557
3558  lafs.h:276
3559     Map an iblock which is not IOLocked
3560        always in lafs_clear_index for the InoIdx block for a directory
3561        which is in Writeback.
3562        Call is in lafs_allocated_block from cluster_flush.
3563
3564  segments.c:351
3565     seg_inc reduces seg usage below 0
3566       - lots of blocks (inode 327) that were cleaned, where then erased twice.
3567       - 2 block (inode 328) were erased twice, both from prune
3568       - ditto
3569
3570  segments.c: 1028
3571      The free list is empty.... odd as only first segment is currently
3572      in use.
3573
3574  soft lockup:
3575      Still orphan: 0/328  Index(1) is in Writeback and Dirty
3576        again inode_handle_orphan2 is in Writeback
3577
3578  inode.c:821
3579      inode_handle_orphan are end, child list is not empty.
3580        The children seem to be in Realloc - cleaner need to let go.
3581
3582  cluster.c:1219
3583      my_inode is null while cluster_flush an inode and want to set
3584         WritePhase.
3585
3586
3587  block.c:485
3588      no ICredit for unincredit in dirty_dblock from dir_delete_commit
3589      from lafs_unlock.
3590
3591
3592  spinlock lockup in subsequent to real bug
3593  ditto for sleeping function.
3594
3595  Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
3596  appear to have other strange values....
3597
3598  A select '9' has two extra block for the directory '74'.
3599  But that directory is long gone.
3600  These dir blocks are currently fully populated with numbers.
3601  This seems to be the pattern with all non-7 blocks.
3602
3603
3604  02Sep2009
3605   Found a problem, possibly related to the dir blocks not being
3606   cleaned up.
3607   When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
3608   so that fact is never copied in to the datablock.
3609   On further exploration, the I_Dirty bit is set but never used, which
3610   isn't good.
3611   So: exactly when do we copy inode into datablock, and what do we do
3612   when dirty_inode is call (if anything).
3613   We could just set I_Dirty when dirty_inode is called, checking that
3614   the block is Pinned which it usually will be.
3615   Then we copy inode to data just before writing data block.
3616   However that defeats transactional properties.  We to copy in the
3617   same transaction, and that means either straight away, or when
3618   the data block's phase changes.
3619   So dirty_inode either copies to the block, or sets I_Dirty.
3620   When lafs_refile unpins an inode data block, it need to check
3621   I_Dirty and possibly re-dirty it.
3622
3623   To redirty it we must steal the NCredits.  Any further dirty attempt
3624   will have to allocate more.
3625   The stealing is done automatically by dirty_dblock, so we just flip
3626   the phase and call dirty_inode ... making sure it doesn't try to
3627   prealloc too hard.
3628
3629   Need to review when inodes get dirtied.
3630     - commit_write only sets I_Dirty !
3631
3632     We call lafs_dirty_inode:
3633       dir_create_commit - a child of inode is PinPending
3634       lafs_create - ditto
3635       lafs_link - before dir_create_commit
3636       lafs_unlink, lafs_rmdir - data block is pinned
3637       lafs_symlink - before create_commit
3638       lafs_mkdir - before create_commit, or block pinned
3639       lafs_mknod - before create_commit
3640       lafs_rename - (moved to) before create_commit/update_commit
3641                      or data block is pinned
3642       lafs_dir_handle_orphan - (assured that) child is pinned.
3643       choose_free_inum - child is pinned
3644       lafs_incorporate - block is pinned
3645
3646     So either the data block is pinned, or the index block is pinned.
3647     In either case it is OK to set something to Dirty.
3648
3649     (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
3650     this is called from:
3651         inode_inc_link_count
3652         inode_dec_link_count
3653         ..various quota ops...
3654         inode_setattr
3655         __set_page_dirty (Which we don't use)
3656         other buffer stuff
3657         other quota stuff we won't use
3658         touch_atime
3659         file_update_time
3660         page_symlink
3661
3662     only the time updates are interesting.  Others we have locking
3663     for.
3664     file_update_time is called from generic_file_aio_write_nlock etc
3665     before ->prepare_write/->commit_write.  So they can pick up the
3666     change.
3667     Similarly before set_page_dirty is called.
3668     touch_atime is called from do_follow_link and readlink and
3669     file_accessed which is called all over the place.
3670
3671     So what to do?
3672     If block is pinned, then dirty it to ensure writeout.
3673     If not, don't.  But copy data in any case.
3674
3675
3676 4sep2009
3677
3678     OK, I've decided that I don't like clearing B_Valid when an index
3679     block contains no indexes.  The final straw was that I seemed
3680     to need to initialise the index block when I didn't hold IOLock.
3681     That was probably fixable, but I'm sure more problems were coming.
3682
3683     So: what to do instead?
3684     One issue that must be resolved is that an index block can still
3685     have valid children even when it become empty.
3686     This can happen if we erase blocks from a file, then add them back
3687     after a checkpoint, and so in the next phase.
3688     The checkpoint writeout could need to show an empty index block,
3689     but the next phase will see real addresses.
3690     We cannot easily avoid this, so we must handle it.
3691     This interact badly with the index lookup algorithm that finds
3692     the best index block currently in the parent, and then scans
3693     the children.  If there is no index block in the parent, we
3694     cannot find any children.
3695     This could be handled by responding to an empty index block by
3696     scanning all children.  But that isn't a full solution as if
3697     just one index block got erased, it's unincorporated siblings
3698     would still be lost.
3699     We could treat empty index blocks like orphans.  i.e. don't
3700     discard them immediately but leave them with possibly real
3701     addresses.  Then when they have no children we allocate the
3702     0.
3703     But we still need to ensure that index blocks off which siblings
3704     have been split but not yet incorporated remain present in the
3705     tree to mark the place for their siblings.
3706     There is another problem.  A horizontal split could leave the
3707     new block with no addresses and everything in the uninc list.
3708     Nothing can be found in there.
3709
3710     So maybe we need to revise the lookup mechanism.
3711     The goal is to find an index block that starts at or before
3712     the target and contains an address at or after the target.
3713     Then out search can stop.
3714     In rare cases.....
3715
3716 7sep2009
3717     I thought about this more over the weekend and think I have an answer.
3718     We need to treat internal and leaf index blocks somewhat differently.
3719
3720     An internal index block must never be empty (while unlocked).
3721     Any child block which has not had it's address incorporated must be
3722     attached (simply in the sibling list) to a block which has been
3723     incorporated.  This will be the block that it was split off.
3724     The uninc block needs to hold a reference so that the primary isn't
3725     released.
3726     When a 'primary' becomes empty it cannot be discarded, so the
3727     addresses in the first dependent index block must be copied
3728     across.  This is awkward for indirect blocks so they might be
3729     allowed to be empty (they aren't internal so don't violate the
3730     above).
3731     When a horizontal split break a sequence of dependent blocks
3732     between two parents, the second parent must be incorporated
3733     immediately so that the first block in the second half of the
3734     sequence is incorporated.
3735     If an internal index block does become empty and it has no
3736     dependent blocks to fill from, it must be invalidated immediately.
3737     It cannot have any children - even in next phase - as at least one
3738     would have to be incorporated and so the block would not be empty.
3739     Invaliding involves allocating to address 0.
3740     If index lookup finds a block with PhysValid address of 0, it
3741     must look to the previous index block.  If there was none .... it
3742     gets a bit complex.
3743
3744     Leaf index blocks can become empty, but we try to avoid it.
3745     If a leaf has blocks which have been created in the next phase,
3746     and others which have been deleted in this phase, it can be empty
3747     but still have children.  In this case we just treat it as a real
3748     index block that doesn't actually have any addresses.  We still
3749     write it out even though that is a waste of space.
3750
3751     We have been working on the assumption that every address always
3752     has a corresponding leaf index block.  It is the leaf with the
3753     highest index at or below the target address.
3754     However this requires the every internal index block has a child
3755     with the same address as the parent.
3756     Preserving this requirement when the first child of an internal
3757     become empty requires either:
3758        - loading the 'next' child and reassigning this to the start
3759        - changing the address of the parent to match the first child.
3760     The former requires possibly reading a block from storage.
3761     The latter only involves modifying blocks that are due to be
3762     written out anyway, but makes block look up slightly interesting.
3763     When lookup finds an invalid block that is 'first', it needs to
3764     start again from the top.
3765     When incorporation creates an invalid block that is first, it
3766     needs to walk down from the top and any index block at the same
3767     address needs to be relocated/rehashed.  If the block is
3768     incorporated, the incorporated address needs to be updated.
3769     So:
3770      - flag for unincorporated index blocks which implies a reference
3771        on primary
3772      - after split, immediately incorporate second block
3773      - change lookup to retry when finding invalid block
3774      - When internal block becomes empty, either merge with
3775        first dependent or invalidate.  If first in parent,
3776        update address and parent and recurse.
3777        Need some 'clever' locking here.
3778        Before unlocking the invalidated block, we take i_alloc_sem,
3779        then walk up the ->parent tree locking blocks as
3780        required.
3781        The index lookup, when it finds an invalid block will take
3782        i_alloc_sem, then drop it, then start again.
3783        Or maybe some other lock than i_alloc_sem...
3784      - When leaf becomes empty, invalidate only if it has no children.
3785        When internal leaf becomes unpinned, check if empty.
3786
3787 21sep2009
3788    That locking doesn't look like it will work, and we can never 'merge
3789    with first dependant' as it is not valid to have a index block
3790    where the first child is at a different address.
3791    And we cannot always change the parent address, particularly if it
3792    is zero - increasing it then cannot work.
3793    And there is no need to load a block if we are just going to change
3794    its start address (not internal index blocks anyway).
3795    Let's drop the idea of relocating the parent.
3796    If an internal index block becomes empty:
3797      If it is last in parent, no loss, just discard
3798        If parent would be empty, need to recurse up.
3799      If it is not last relocate the next sibling to this location,
3800       rehashing it and updating the parent.
3801    If a leaf index block becomes empty we cannot just delegate to
3802       next as it might be indirect... not a problem if address is
3803       stored.  But that requires a format change... now might be a
3804       good time!
3805
3806
3807    So:
3808      If we hold an index block locked and it becomes empty and we choose
3809      to invalidate it, we need to ensure that doing so does not
3810      break any indexing paths.
3811      So we take a separate lock (i_alloc_sem??) and flag the block as invalid
3812      by setting physaddr to 0 while PhysValid is set, and unlock the block.
3813      Any lookup that finds such a block must take and release i_alloc_sem,
3814      and then restart from the top.
3815      - If the block was not incorporated, we just remove from sibling list
3816           and all is done - the space in implicitly included in
3817           previous block.
3818      - If the block has a different fileaddr than the parent then update
3819           the parent directly, either removing the entry, or changing it to
3820           point to the first unincorporated sibling (if there is one).
3821           This requires taking the lock on the parent of course.  That is
3822           why we dropped the lock on the child.
3823           Then all done.
3824      - If the block has the same address as the parent we need to find
3825           a 'next block' to relocate to the start of the parent.
3826           It is either the first unincorporated sibling, or the next
3827           block in the index block, or nothing, meaning the parent is
3828           about to become empty.
3829         We lock the parent (still holding i_alloc_sem), and rehash the
3830           chosen child.  If it doesn't exist, or is not dirty, we need
3831           to update the phys address directly in the
3832           accordingly, erasing or replacing the first address.
3833           Then we need to rehash the index block, but we need to lock
3834           the parent for that.
3835           So set a 'busy' flag on the block, unlock it, lock parent,
3836           rehash, clear busy flag, and repeat.
3837       - We can never relocate a block with fileaddr of zero, as the
3838           InoIdx block cannot be relocated.  So leaf index block 0
3839           must never be erased unless the file is empty.  So
3840
3841 28sep2009
3842   New idea.
3843   We store the start address of an indirect block in the block.
3844   These means that the meaning of any index block is completely
3845   independent of the location of the block, so we can change the location
3846   easily and without touching the block.
3847   So if a block becomes empty, we simply move the next block back to
3848   fill the gap.
3849   i.e. when an index block becomes truely empty (i.e. no children)
3850    - if it wasn't incorporated, simply remove it
3851    - if it was,
3852        - if there is a dependent block, rehash it to take my address
3853        - if there is a next block that is dirty, rehash it
3854        - if there is a next block that is not dirty,
3855           update parent to merge my entry with next, and rehash next
3856           if it exists
3857        - if there is no next block but we are not first, just update
3858           parent
3859        - if no next block and we are first, parent becomes empty,
3860           recurse upwards.
3861
3862 12Oct2009
3863  - too long, I've forgotten what I was up to..
3864    + I've changed the format of indirect blocks to store an address.
3865    + I've handled incorporation of an empty block
3866    So now internal index blocks can never be empty - they get immediately
3867    unlinked if they are.
3868    Leaf index blocks can be empty while they have children.  We don't
3869    flag them as empty, but rather wait until another child gets incorporated.
3870    But I don't think I really like that.  It is an external ugliness based
3871    entirely on internal implementation details.  Empty index blocks should
3872    not get written out.  We need some way to reliably find an empty index
3873    block.  The address won't appear in the parent so a lookup will find the
3874    previous block which we cannot link to now as it may not exist yet.
3875    Worse - if first index block goes empty, we can only unlink it by moving
3876    the parent to start at the next block.  That would make this index block
3877    totally unfindable.
3878    So I think we have to stick with writing out empty index blocks very
3879    rarely.  So we need to be sure they disappear properly.
3880    The difficult case is if an index block becomes empty while it has some
3881    children which don't end up getting dirtied. e.g. an update aborts.
3882    We need to leave the block with enough credits to be written out.
3883    I guess the Ncredit should be enough...
3884    Maybe worry about that later.
3885
3886  - what about InoIdx blocks when they become empty?  It would be helpful
3887    to flag them so that inode deletion can check....
3888    Maybe just set depth to 0..
3889
3890  ARRGGG... I've completely lost it.  In need another ITO week.
3891   I just got a bug in summary.c:71!!
3892
3893 7 Jun 2010
3894  - summary.c:71.
3895    ablocks_used has hit zero too soon.
3896    This should be the count of blocks for which space has been allocated
3897    (B_Prealloc is set) but have not been given a phys address yet - at which
3898    point the usage count is moved to cblocks_used or pblocks_used.
3899    The last block (which may not be the cause of the problem) does not have
3900    B_Prealloc set, yet physaddr == 0.
3901    The block is 0/1, so the inode for the inode usage map.  This should have
3902    physaddr 8 !!
3903    We did find 8, then change to 73, but then changed to 0!
3904   Ahhh... recent fix exposed a subtle bug ... fixed.
3905
3906  Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3907      cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3908    We are allocating an InoIdx block, but data block is not valid??
3909