README

   1
   2 So, let's try to write a kernel module that implements this filesystem.
   3 It would be good to have a plan.
   4
   5 - Mount filesystem, providing empty root directory
   6    o parse mount options - DONE
   7    o find/load superblocks and stateblocks - DONE
   8    o present empty directory - DONE
   9    o Compile external module - DONE
  10    o test DONE
  11
  12 - Mount filesystem read-only with no roll-forward
  13    o IO address mapping
  14           sync_page_io or bread? - not bread I think
  15    o Index blocks management
  16    o search cluster-header for root inode
  17    o file read
  18    o Directory lookup/read
  19    o test
  20
  21 - Support roll-forward for blocks, orphans, whatever
  22    o manage segusage files
  23    o manage quota files
  24
  25 - Support writing
  26    o inode bitmap
  27    o cluster creation / block sorting
  28
  29 - Support Cleaning
  30
  31 - Interface for snapshots and other admin
  32
  33
  34
  35 ------------------------
  36 FIXME
  37  If a device is removed from the filesystem, we cannot reliably
  38  tell from the other devices or state that this is so.
  39  Maybe we need to update all devblocks with a new 'seq' number...
  40 FIXME
  41  How do we specify mounting subordinate filesets?
  42  What superblock do they have?
  43  I suspect we do a -F lafs-sub mount from the original filesystem.
  44
  45 FIXME
  46  If mount fails, we seem to be leaving a super lying around,
  47  and sync_supers dies on it. - DONE
  48
  49 FIXME
  50   Umount appear to work, but a sync_supers dies. - DONE
  51
  52 FIXME
  53   subordinate supers aren't being locked as much - is that a problem?
  54
  55 FIXME
  56   index pages never get put on an LRU - how is this supposed to work?
  57
  58
  59
  60
  61
  62
  63
  64 --------------------------
  65 Thoughts:
  66   Inodes live in an address-space, much like a file.  To load the
  67   first inode, we need an address-space, so may as well have an
  68   'struct inode' as we may want to expose it to user-space.
  69
  70  Loading an inode, need
  71    fs (lafs filesystem structure)
  72    which subfs (maybe a lafs inode)
  73    which snapshot - this is implied by the subfs inode.
  74    and fs can be obtained from inode, so just inode, inum
  75
  76
  77
  78 UPTO
  79 03nov2005
  80   review block_leaf_find and make_iblock
  81   need to do setparent and block_adopt next
  82
  83 10nov2005
  84   need to resolve locking for ->siblings list
  85
  86 24nov2005
  87   peer_find
  88   lock_phase
  89   lafs_refile
  90
  91   I can read a file.....!!!!!
  92   Code review / tidy up.
  93      resolve locking buffer vs page
  94
  95   Export on a web page somewhere??
  96
  97 16feb2006
  98  (I spent a while getting large-directories to work again in prototype..
  99   and some holidays).
 100  - Priority: clean mount and unmount
 101  - large directories
 102  - multiple devices.
 103
 104   FIXME how do we record and handle write errors???
 105
 106  The iput in lafs_release - which is needed - is oopsing
 107   at iput+0xe!
 108
 109 23feb2006
 110  Ok, I finally have a clean mount/unmount.
 111  .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.
 112
 113  Next:
 114   - make sure we can handle 'large' directories.
 115   - make sure we can handle files with indexes
 116   - handle filesystems that span devices.
 117
 118 02mar2006
 119   Hurray - clean unmounts!!!
 120   There is a nasty circular reference of the root inode which is stored in
 121   a block that it manages.  Maybe this should not happen, rather than having to be
 122   explicitly broken - the root-block can live elsewhere, not in the inode.
 123
 124   Next multi-level index blocks.
 125
 126   But first, need to understand memory pressure and pageout.
 127    How are dirty pages found to be cleaned?
 128    How is pressure put on a filesystem to clean up?
 129    How are clean pages reaped?
 130
 131   - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
 132       lru_cache_add{,_active}(page) might be easier, but isn't exported.
 133   - call mark_page_accessed(page)  to keep the page 'active'.
 134
 135 09mar2006
 136   - make sure indexes work...
 137
 138
 139  lafs_load_block+0xf
 140   eax,bx,cx,dx,s1 all zero
 141 from block_leaf_find 203
 142
 143   ... OK, indexes seem to work.
 144    But 'lafs' have problems creating some large files.
 145    Try 'tt'
 146
 147    This is due to not handling error properly.. fix it later FIXME
 148
 149 16mar2006
 150
 151   Must make sure the index address-space gets clearred up... I wonder
 152   how we find all the pages to free.  This might be one reason to keep them
 153   in a radix tree.  Though we should be able to walk our own data structures.
 154
 155
 156   Then work on mounting a 2-device filesystem.
 157
 158
 159   FIXME dir_next_ent always starts from the beginning rather than
 160    remembering where it is up to... can this be fixed??
 161
 162
 163 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
 164
 165   Mounting snapshot needs a way to identify that it is a snapshotmount
 166   and which snapshot, and which filesystem.
 167   We could use a different filesystem type, but that isn't really needed
 168
 169     mount -t lafs -o snapshot=name /original/mount/point /new
 170
 171   This grabs the named snapshot of /original/mount/point and places it at
 172       /new
 173   The 'snapshot=' option is the trigger.
 174
 175   For a control FS, we
 176         mount -t lafs -o control /original/mount/point /new
 177
 178   To grow a filesystem, we initialise a device (super/state blocks) and
 179         mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
 180
 181   as the dev_name isn't passed to remount
 182
 183   So, mount options are:
 184         snapshot=name
 185         dev=/dev/device
 186         new=/dev/device
 187         control
 188     and various
 189           name=value
 190     pairs matching what is exposed in the control filesystem
 191
 192 23mar2006
 193  - factored out super-block finding preparatory to finding snapshots.
 194
 195  Thoughts:
 196     superblocks for snapshots and sub-ordinate filesystems do
 197     not get stored in the 'state'.  There is, however, a usage count so that
 198     the prime filesystem cannot be unmounted until all snaps and subs are gone.
 199     This should just refcount the prime_sb I suspect.
 200
 201     So: a snapshot sb points to the 'struct fs' but doesn't .... what???
 202
 203 30mar2006
 204  - remove the super-block finding code by changing the layout to store
 205     superblock locations explicitly :-)
 206
 207  - teach 'mount' to mount snapshots.
 208
 209  - need to audit for bad use of ss[0]
 210  - need to find better way to map 'sb' to snapshot number.
 211  - need to make unmount work.
 212
 213 01apr2006 (no, really!!)
 214  - rewrite index to kmalloc index blocks and use a shrinker to free them.
 215    This means that indexblock no longer has a 'page', which makes sense.
 216    It also means they cannot live in highmem, which is sad, but could
 217    be fixed.
 218
 219   Notes: superblocks and refcounts.
 220    Each device holding the filesystem gets a superblock.
 221     One of these (arbitrarily) is the 'prime' superblock and gets to
 222       manage the whole filesystem.
 223     Each snapshot also gets a superblock, as does each
 224       subordinate filesystem.  These are anon sbs - using anon dev.
 225     Each anon sb takes a reference to the 'struct fs', and also to the
 226       prime sb.... how about the reference relationship between fs and prime_sb???
 227
 228     Need to ponder this,
 229
 230    - problem with getting parent superblock due to semaphores...
 231    - when unmount, put_super isn't being called, so inode 0 isn't released!
 232
 233 13apr2006
 234   (Took a week off to play with rt2500 wireless cards)
 235   - Use different filesystem type for snapshots and subordinate filesystems.
 236     This removes the semaphore problem
 237   + OK, mount and unmount works for snapshots... what next?
 238      - review index block - worry about himem?
 239      - review ss[0] usage - OK
 240      - general code review
 241
 242   FIXME - what should leaf_lookup/index_lookup return on format error?
 243       The currently return '0' which will quietly make an empty block.
 244       Many '-1' would be better to make an error block.
 245   FIXME check how other filesystem lock the setting of PagePrivate
 246      Maybe just need to lock_page
 247   FIXME combine find/load/wait into one operation
 248   Review dir, super, roll, link
 249
 250   FIXME module refcount increases on failed mount!
 251
 252 18may2006
 253   I've been sick for too long, and not much has happened... However I think more than
 254   the above comment says.  I started looking at roll-forward and have the
 255   basic block parsing in place so that it reports what it sees in the roll.
 256   Also, the format has been changes a little: the address in the state block
 257   is the CheckpointStart cluster, and we simply roll forward to the
 258   CheckpointEnd, and then keep going beyond there - there is no longer any
 259   walking back to find the start.
 260
 261   Next step is to start incorporating rolled elements into the filesystem
 262
 263    - data blocks: shouldn't be too hard.  Don't need to update the
 264            index pages just yet
 265    - inode updates: should be straight forward enough, but care is needed
 266            as the data might be in multiple places
 267    - directory updates: these are probably most interesting..
 268
 269
 270   Question: how are symlinks created?
 271     Currently we:
 272       log the inode creation
 273       commit the new inode
 274       log the directory update.
 275     This allows the 'value' stored in the inode to appear after the directory
 276     update.
 277     That might be OK for files (Which are created empty and then extended)
 278     but is bad for symlinks (which are created atomically).
 279     So, options include:
 280      - ensure inode is in a previous cluster to directory updates.
 281        This slows things down too much I think
 282      - log the content as well.  This is awkward if it is big, certainly if more
 283        than a block, which is possible.
 284      - directory updates could be dependant on the inode being valid.
 285        This is ugly.
 286      - log content if it is small, else write inode, flush, then create link.
 287
 288     So the fast option is:
 289       log inode create, log content, log filename
 290     and the slow/safe option is
 291       log inode ceate, sync file, log filename
 292
 293     So on roll-forward if we see the inode we just save the data.
 294     Saving the whole inode seems attractive, but we want minimal order
 295     dependance: an inode update in the same cluster as the new inode should
 296     still over-ride, even though it is earlier.
 297
 298   Ok, rollforward is proceeding slowly.  I think I am now incorporating
 299   new blocks into the tree properly, though the code probably won't compile.
 300   It will be nice to test this and see the file have the right data.
 301
 302   Next step would be to include the index incorporation code.
 303   Then
 304     - directory updates
 305     - segusage summary
 306     - quota
 307     - stuff..
 308
 309 08jun2006
 310  - what exactly should happen when rollforward finds a file with a linkcount of 0?
 311    Currently all updates get lost - I wonder if they are lost safely?
 312  - rollforward is getting the size right, but not the content
 313  - do I need to flag a block that ->phys is valid?
 314
 315  : Ok, roll-forward picks up new blocks in a file OK,
 316   but umount has stopped working.
 317     Presumably because there are pages attached to the inode which aren't
 318     getting released.  What do we want to do here?
 319     Normally those pages, or their addresses need to be recorded before
 320     they are lost.  But on a read-only mount we don't care so much.
 321
 322 22jun2006    continuing above thought..
 323
 324    When we roll-forward and pick up the pieces of a file, we don't
 325    want to allocate pages to hold those pieces (and definitely don't
 326    want to read them all).  We just want to attach the addresses
 327    to the parent for incorporation.  Similarly after writing
 328    dirty blocks in a file we want to be able to release them
 329    immediately rather than waiting for the addresses to be
 330    incorporated (as incorporation can be more efficient when delayed).
 331
 332    We could just allow the page associated with a block to be released,
 333    except that the page provides the indexing to find a block.  We might
 334    be able to live without the indexing, and hunt down the indexblock tree,
 335    but living without the mutual-exclusion provided by block indexing would
 336    be more awkward.
 337    And the 'struct datablock' still contains a lot more than is needed.
 338
 339    So maybe we should just have a completely separate structure attached to
 340    the indexblock which lists fileaddr/physaddr.  This could include
 341    extent information.  The trick would be guranteeing allocation.
 342    We could either allocate-late with a fallback of attaching the 'struct block'
 343    or performing an immediate incorporation, or allocate-early and block
 344    the dirtying of a page until there is space to record the new address.
 345    This last is bound to be easiest.
 346
 347    So: what exactly do we use to store addresses?
 348     Probably a linked list of tables.
 349     Each table contains a link pointer and an array of
 350         fileaddr/physaddr/extentlen
 351     But we would need to allocate lots of these if there are hundreds of
 352       dirty pages, but possibly only end up using a few if they made
 353       extents very nicely.  That might be wasteful.
 354
 355     Or we could allocate just one.  When it is full we perform an
 356      incorporation.  But if that causes a page split we are in trouble.
 357        We could have a spare page, split to it, write out one
 358         and wait for the spare page to be written and free.
 359         But we cannot just release the index page as it might still have
 360         children.
 361
 362     (I think I've been here before).
 363     A worst-case scenario involves writing one block and that requires
 364       spliting every index up the tree to the inode.  This requires
 365       arbitrarily many pages to be allocated.  To accomodate this we either
 366       pre-allocate a spare page at every level of the tree down to the data
 367       block (a bit like storage space allocation) which seems very wasteful,
 368       or we make sure we can release one of the split pages, which seems impossible.
 369
 370     I could decide not to worry about it.  Have a pool of index pages and hope
 371      it always works.  Afterall, most pages are data pages, and they can be
 372      freed successfully.  We would only have a deadlock if all dirty memory were
 373      index pages, and that seems unbelievably unlikely.  If we trigger a
 374      checkpoint when the count of locked-pages hits some limit we should be
 375      safe.
 376
 377     So: Keep one table per index block.  Use simple append and sequential search.
 378      When table gets full, force an incorporation
 379
 380      Do we allocate the table separately, or embed it in the indexblock??
 381
 382      Probably embed it.  indexblocks that don't need it can be freed at any
 383      time so that space waste hopefully isn't significant.
 384
 385      How big?
 386       If the file is written sequentially, then everything should gather into
 387       extents, and so it doesn't need to be enormous.
 388       If the file is written randomly then the index block can be expected to
 389       be 'indirect', so incorporation will be cheap.
 390      So 'small' seems ok in both cases.
 391
 392      Let's say 8.
 393
 394      But wait a minute.....
 395      On a checkpoint we can be getting phys updates for prev and next phases.
 396      next-phase updates cannot be incorporated until the indexblock has passed
 397      on to the next phase.  So in that case, I think we still keep a linked
 398      list of unincorporated blocks and live with the fact that we cannot
 399      free them until the phase change passes.  That shouldn't be a big problem
 400      as it is a limited time frame - especially for data blocks..
 401
 402      But does this solve our initial problem??
 403      During roll-forward we want to keep the addresses but not the blocks,
 404      and we don't want to force incorporation. That means an arbitrary list
 405      of addresses attached to an index block.
 406      I guess we could possibly allow incorporation, but I would rather not
 407      as I want the fs to be able to be read-only nicely.
 408      So that means we need to have a list of address tables.
 409      Maybe the normal approach is 'add a table if possible, else incorporate'?
 410
 411      OUCH... we may write a block a second time before incorporating the
 412      new address, so when adding an address to the table we need to check
 413      if it already exists.  That could be expensive.
 414      For index blocks might it even be a different address?  I think
 415      not but the vague possibility (in the future?) does complicate
 416      things somewhat.  Maybe we just keep thing in chron order and
 417      don't worry about duplicates until incorporate time, when we have to
 418      sort anyway.
 419
 420
 421      todo:
 422         lafs_find_block  DONE
 423         free_block must free tables DONE
 424
 425
 426      Unmounting still doesn't work.
 427      Problem is that an index block is holding a reference on parent,
 428      and parent references aren't getting cleaned up.
 429      On read-only unmount I guess we need to walk the list of leafs,
 430      discard any address info, and unlock the blocks.
 431      So that should be the first task for next time.
 432
 433 27jul2006
 434   Leafs are locked blocks which have no locked children.
 435   So any locked data block (non-inode) is a leaf
 436   Any locked index block with lockcnt[phase] 0 is a leaf.
 437
 438   OK - fixed numerous bugs, but I can unmount now!!
 439   I can even rmmod and insmod and all is cool.
 440
 441
 442 TODO:
 443  - review refile and get all the code in there from prototype
 444        DONE (I hope)
 445  - write a combined find/load/wait function and use it
 446        DONE
 447  - allocate inodes in single memcache and avoid generic_ip
 448        HALF DONE. (still using kmalloc, not doing initonce well)
 449  - review recording of new block addresses
 450     + make sure we lookup there on index lookup - YES
 451     + make sure ->uninc_next gets tranferred to table at phase change.
 452     + write incorporation code as it is tricky
 453  - review how directory updates can be incorporated into a RO filesystem.
 454     No, they cannot.  We need to update the directory.
 455  - write directory update code
 456  - write cluster construction code
 457  - make sure indexblocks with unincorporated addresses get on to inc_pending
 458     ?? or is locking them enough?
 459
 460
 461 INCORPORATION - ARgggghhhhh.
 462  The current uninc_table doesn't really lend itself to building
 463   index block... though maybe....
 464  Question: what happens when an index block disappears? i.e. it has no
 465   addresses in it?
 466   We clearly need to remove it from the parent.  This should be trivial,
 467   a direct operation on the parent index block. etc some number to 0.
 468   Then the next incorporation pass with simply lose that entry.
 469
 470  OK, that might be all well and good, but how do we sort unincorporated
 471   addresses so we can merge them?
 472  A linked-list merge sort is nice and open-ended, but does waste
 473   quite a bit of space in pointers.
 474
 475  Or maybe I should just always do small-table incorporations.
 476  Is there a way that a bad ordering of writes could force very bad
 477   index layout in this case? i.e. cause a table split every time,
 478   but new blocks go in the first (full) table.
 479  OK Decision: always do small-table incorporation.
 480   i.e. not a list of blocks: just a table of addresses.
 481
 482  FIXME check validity of index type when it is first read in,
 483    and reject early if it cannot be recognised.
 484
 485 24aug2006
 486  Took a break from incorporation.
 487  Looking at directories.
 488  Wrote dir.doc in module to sum lots of stuff up.
 489  Issue:
 490    dir blocks have an info structure attached.
 491    This included a counted reference to the parent.
 492    How long does this need to hang around for??
 493
 494    - when there is any orphan issue happening, it must stay, via
 495      the 'pinned' flag.
 496    - when actually performing a dir op, we need to create and
 497      maintain this info.
 498
 499    When last ref of a dir block is dropped, should drop
 500    the parent reference.
 501
 502
 503  Status:
 504     free list management mostly done.
 505     Next:
 506       create/delete prepare/commit/abort
 507       orphan handling
 508       dirty_block lock_block
 509
 510
 511  FIXME should dir_new_block zero out the block?
 512    How will commit_create know what to do with this block?
 513
 514  NOTE another type of directory orphan is a free leaf block which
 515    is on the part-free list.
 516
 517 -------------------------------------------------------------
 518 09spe2006 0 on the plane to Frankfurt
 519  Don't tell me I am rethinking preallocation again ???
 520
 521  TODO
 522    dirty_inode needs to record the phase it is dirty in
 523    inode_fillblock needs to check current phase and act accordingly.
 524      we inode.doc
 525    Make sure the B_Orphan flag is set and used - or discard it.
 526
 527    How do we commit creating a symlink?
 528    If it is a full block in size we cannot make an update record.
 529     - maybe have two update records? We cannot guarantee they are in
 530       the same  cluster.
 531     ... but if we put the 'make dir entry' last it should work.
 532
 533    Change 'struct descriptor' definition
 534    the 'block_type' aka 'length' 16 field becomes
 535       0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
 536       0x8001 -> 0xc000 -> miniblock upto 16K+
 537       0xffff           -> index block.
 538
 539    Need to write IO routines which decrease pending-block-count in
 540      'wc'.
 541
 542
 543    Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
 544      gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
 545      - 512 segments per block - is 512 blocks in each seg usage file
 546
 547 12oct2006
 548  Need to write
 549  - lafs_lock_{d,}block  DONE
 550        Make sure the block has parents and allocation and set the locked
 551        flag and the phase.
 552
 553  - lafs_flush
 554        Given a datablock, wait for it to be written out
 555        This is needed before updating a block that is still locked in the
 556        previous phase.
 557  - lafs_inode_init
 558        Used when creating a new object/inode
 559        Given a datablock which is to hold the inode
 560          and a type (Type*) and a mode,
 561        Fill in the data block with appropriate data so that
 562           when lafs_import_inode looks at it, the right stuff happens.
 563  - phase_flip
 564  - lafs_prealloc
 565  - lafs_seg_ref
 566  - lafs_lock_inode
 567
 568 lafs_dirty_dblock
 569 lafs_cleaner_pause
 570 lafs_dirty_inode
 571 lafs_seg_flush_all
 572 lafs_write_all_super
 573 lafs_quota_flush
 574 lafs_space_use
 575 lafs_cluster_update_abort
 576 lafs_cluster_update_commit_buf
 577 lafs_cluster_update_commit
 578 lafs_seg_apply_all
 579 lafs_cluster_update_prepare
 580 lafs_inode_phase_check
 581 lafs_seg_dup
 582 lafs_dirty_block
 583 lafs_cluster_update_lock
 584 lafs_checkpoint_unlock_wait
 585 lafs_orphan_drop
 586 lafs_free_get
 587 lafs_find_next
 588
 589 2nov2006
 590  - I need to know if a block is undergoing write-io so that I can
 591    avoid modifying it in certain circumstances.  But I don't track
 592    this information.  Options:
 593     1/ track the info.  This means an extra field in the 'struct block'
 594         because I still need to know which wc has had a write.
 595     2/ For blocks that we care about copy the data on write...
 596         But we care about all inodes and directory blocks.  That is a waste.
 597    I think we put extra info in the block.
 598    We need to know which wc was used (0,1,2) and which pending cluster
 599    in there (0-3) which comes to 4 bits.
 600    But we only care about the block for wc=0. and we could include the
 601    which-pending in the b_end_io, or maybe put it all in low bits
 602    of the block pointer....  Need max 4 bits.  Can only be sure of 2...
 603
 604    Maybe:
 605        'which' goes in bottom two bits of bi_private
 606        'wc' goes in ->flags
 607
 608
 609 4apr2007  (What a long gap !!)
 610
 611  - lafs_cluster_update_*
 612    How do we prepare for a cluster update?  How do we lock it.
 613
 614    The important thing is that the update can be written.  That
 615    requires that there is space available.  So we need to preallocate
 616    space and then release it.
 617    It is possible that each update might go in a different cluster, so maybe
 618    we need to preallocate one block per update.  That sounds a little expensive.
 619    After all, we aren't preallocating a cluster block for every data block
 620    that is dirty.
 621    So: prepare does nothing
 622         lock preallocates the space - a full block.
 623         commit copies it in.
 624     For now at least.
 625
 626 24May2007
 627
 628  - Can now create and delete lots of files.  This is cool.
 629   But:
 630     Orphan slots just grow and grow - never to be reclaimed - why?
 631     After rm f*, 7 files remain.  but rm f* again and the go.
 632          FIXED - readdir wasn't returning them
 633     Size of directory remains large.
 634     And sometimes, files become ghosts... (try just removing one after first rm f*).
 635
 636   TODO - process those orphans to clean up the directory.
 637
 638 20June2007 (Happy Birthday Dad)
 639
 640  - Creating lots of file and then deleting them leaves 5 orphan slots
 641    for the directory busy, and one for inode 0??
 642
 643    Directory handling uses the following orphans:
 644     CREATE:
 645         A new index block is created by splitting.  This needs to be linked in.
 646     DELETE:
 647         The dirent block we are deleting from
 648            If it becomes empty, it needs to go on free list
 649         The index block we are deleting from
 650            If it has lots of free space it might need to be rebalanced.
 651      The inode that was deleted.
 652
 653
 654  - When a file is fully deleted, we need to drop any orphan info... DONE
 655  - Need to do orphan handling of free blocks in directory, and
 656    unmerged parents - but there doesn't seem much point as I am going to
 657    change the directory layout (again).
 658
 659  So: writing to a file.
 660    We need prepare_write, commit_write, and writepage.
 661    Prepare loads and links the page and checks there is space.
 662    commit marks it as dirty so writeout is possible.
 663    writepage chooses a page to write out
 664
 665 25June2007 - HACK week, thanks Novell!!
 666  - write - DONE
 667  - sync
 668      Somewhat done.
 669      Need to revise the process whereby async completion
 670      clears PAgeWriteback,
 671      We need locking in there, and need to worry about
 672        'which' wrapping too soon.
 673      Need to not start IO before we set page writeback
 674  - chmod
 675      Maybe, but syncing to disk needs more thought.
 676  - 'df'
 677     Partly done, need actual content.
 678  - mkdir
 679     Can make directory, but creating first entry fails. - FIXED
 680  - symlink
 681  - readlink
 682  - new directory structure.
 683
 684 27Jun2007 - More HACK week :-)
 685
 686  - new directory layout done - much easier!!
 687  - If I delete a file that was created, the blocks still have a ref-count
 688    and we crash.
 689  - mkdir doesn't increase link count on parent. - FIXED
 690
 691  TODO:
 692    Orphan handling.
 693      Infrastructure to process orphans
 694      Handle specific cases
 695      flush orphans at key times.
 696      load orphans at roll-forward
 697
 698    checkpoint
 699      Write out a checkpoint (when?)
 700      Make sure refcount goes back to zero on blocks I write.
 701
 702   Check on inode_phase_check and checkpoint_unlock and inode_dirty
 703    in all directory operations.
 704
 705  FIX: Writing a small file leaves something non-dirty but
 706     due to be written, and lafs_cluster_allocate complains.
 707   - seems to work now.
 708
 709  FIX: dir_handle_orphan doesn't lock the orphan transaction required.
 710
 711  FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
 712
 713  FIX: lafs_allocate hasn't been written!!!
 714
 715  FIX: before updating any block in a depth=0 file, we must first load
 716       and 'lock' block 0.
 717
 718 29Jun2007 - still HACK week.
 719   Summary of how incorporation works.
 720
 721   Each index block has a small table for unicorporated changes. i.e.
 722   blocks number and their addresses.
 723   This supports efficient storage of extents, and is extensible by allocating
 724   more tables.  This last is done rarely.
 725
 726   When a block gets a new address, this is added to the table or, if
 727   there is a phase missmatch, it is added to a list until a phase change
 728   happens (so the whole block is pinned pending the phase change).
 729
 730   If the table is full then:
 731    - if the filesystem is read-only (including during roll-forward),
 732      a new table is allocated (else rollforward fails).
 733    - otherwise we incorporate the table into the block, then add the new
 734      address to the (now empty) table.
 735
 736   If incorporation requires that we split the index block we allocate one
 737    from a pool.  If there are none in the pool, we wait.
 738
 739   As the table is much smaller than a block, the incorporation into
 740   two block will always succeed.
 741   The 'uninc_next' and 'children' lists will then need to be shared
 742   between the two blocks before the new address is added to whichever
 743   table is appropriate.
 744
 745   When looking for a block address, we must always check the table and
 746   then children lists.  We do not need to check uninc_next as they will always
 747   be children.
 748
 749   How to ensure that the pool always has sufficient index blocks and we don't
 750   deadlock?
 751   We have two halves of the table, one for each phase.  Before we allow
 752   a block to be dirty in a phase, we ensure that the pool has adequate
 753   index blocks for that phase.  e.g. twice the depth of the block.  If it
 754   doesn't we block the dirtying until space becomes available.
 755   For syscall writes, this is easy as we catch in prepare_write.
 756   When we perform a phase change, we must be sure there are enough index
 757   blocks for the deepest bloc that will stay dirty.  If there aren't, we need
 758   to flush all dirty block, and unmap all writable mappings before
 759   starting the checkpoint.
 760
 761
 762  FIX: need to work out life time rules so that inodes hang around while they have blocks.
 763     currently have an igrab that is never put.
 764
 765  FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
 766
 767 3Jul2007
 768  Checkpoint flushing is getting close.
 769  Current problem.
 770    InoIdx blocks are not changing phase.
 771    Phase change should happen when all children have been incorporated, and
 772     then the write has been triggered marking us clean.
 773   For InoIdx blocks, we need to be marked clean when the data block
 774    completes.
 775
 776 5jul2007 - a week off
 777  Checkpoint flushing seems to work !!!!
 778  FIX: what should filesize of symlink be?
 779      other filesystems use len, but still zero-terminate for vfs.
 780
 781  Problem.  A chmod is followed immediately by an unlink then a checkpoint.
 782    The chmod update gets into the checkpoint cluster, but the unlink completes
 783    before the checkpoint is finished so the new superblock sees the file
 784    as gone.  Roll-forward find the update and want to update a missing file.
 785
 786    This isn't a big problem, but with slightly different details, it could be.
 787
 788    One option is to ignore updates that preceed the updated block.  That might
 789    be awkward with e.g. directory updates and checkpoints that cross multiple
 790    segments.
 791
 792    Another option might be to prohibit updates once a checkpoint has started
 793    unless they are known to be after the phase change.
 794
 795  FIX: unlink isn't punching a hole in the inode file.
 796       Inode usage map isn't being updated. - FIXED (For create, not unlink).
 797
 798  FIX: roll forward does not pick up inodes, only data blocks.
 799     But tiny files are synced to inode, so they might not be picked up.
 800     So we must process a level=0 inode like a data block.
 801
 802 6July2007
 803  Time for lots of clean up.
 804
 805 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
 806 DONE 2/ rename 'lock' -> 'pin'
 807  3/ Review and fix up all locking/refcounts.  See locking.doc
 808 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
 809          B_Alloc inside the semaphore
 810        Also lock inode when copying in block 0 and probably
 811        when calling lafs_inode_fillblock (??)
 812 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
 813          more allocations can come in at any time.
 814 NotYet 3c/ cluster_flush should start all writes before calling _allocate
 815          as _allocate might block on incorporation/splitting.
 816        No.  We really want _allocate to not block, but to queue...
 817         I think this is too hard to get perfect just now, so I will leave it.
 818 DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
 819 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
 820      so that locking of lru doesn't have to be too global
 821 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
 822        lru system.
 823 DONE 3g/ revise refile lru handling based on new understanding
 824  3h/ Utilise WritePhase bit, to be cleared when write completes.
 825      In particular, find when to wait for Alloc to be cleared if
 826       WritePhase doesn't match Phase.
 827        - when about to perform an incorporation.
 828  3i/ make sure we don't re-cluster_allocate until old-phase address has
 829      be recorded for incorporation.
 830  3j/ Check that index blocks cannot race when getting locked....
 831   k/ Check what locking is needed to set PagePrivate exclusively.
 832 DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
 833      We need to get it done in process context I think and lock
 834       ->waiting access with fs->lock after changing it to ->lru
 835 DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
 836         only when *all* have finished.
 837 DONE  n/ on phase change, uninc_next blocks need to be shared out.
 838 NO 3o/ Make sure lafs_refile can be called from irq context.
 839  3p/   lock all lru accesses.
 840  3q/ Lock those index blocks!!!
 841  3r/ Can inode data block be on leafs while index isn't, what happens if we
 842        try to write it out...
 843  FIXED Why are extent entries only grouped in 4s?
 844  If InoIdx doesn't exist, then write_inode must write the data block.
 845  4/ resolve length of symlink
 846    FIXED - long symlink followed by 'sync' crashes.
 847    FIXED - rollforward isn't calling 'allocated' on blocks, or something
 848    FIXED - I cannot find 'bfile'. (inode isn't written)
 849    SEEMS OK...- Must flush final segment of a cluster properly...
 850  5/ Review what does, and does not need to be initialised in a new datablock
 851  6/ document and review all guards against dirtying a block from a previous phase
 852     that is not yet safe on storage.
 853           See lafs_dirty_dblock.
 854  7/ check for proper handling of error conditions
 855      a/ checkpoint_start might fail to start a thread!
 856      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 857  8/ review checkpoint loop.
 858        Should anything be explicit, or will refile do whatever is needed?
 859  9/ Waiting.
 860        What should checkpoint_unlock_wait wait for?
 861        When do we need to wait for blocks the change state. And how?
 862 DONE 10/ rebase on 2.6.current
 863 DONE     - use s_blocksize / s_blocksize_bits rather than fs->
 864
 865  11/ load/dirty block0 before dirtying any other block in depth=0 file
 866  12/ Add writecluster flag for old-phase updates.
 867      Why is this needed?  updates should always go in the new phase???
 868  13/ use kmem_cache for 'struct datablock'
 869  14/ indexblock allocation.
 870         use kmem_cache
 871         allocate the 'data' buffer late for InoIdx block.
 872         trigger flushing when space is tight
 873         Understand exactly when make_iblock should be called, and make it so.
 874  15/ use a mempool for skippoints in cluster.c
 875  16/ Review seg addressing code in cluster.c and make sure comments are good.
 876 DONE 17/ Make sure create inherits uid etc from process.
 877  18/ consider ranges of holes in pending_addr.
 878
 879 DONE 20/ Implement rest of "incorporate"
 880 DONE 21/ Implement staged truncate
 881 DONE         use for setattr and delete_inode
 882 DONE 22/ block usage counts.
 883  23/ review segment usage /youth handling and make a todo list.
 884       a/ Understand ref counting on segments and get it right.
 885  24/ Choose when to use VerifyNull and when to use VerifyNext2.
 886  25/ Store accesstime in separate (non-logged) file.
 887  26/ quotas.
 888         make sure files are released on unmount.
 889
 890  30/ cleaner.
 891        Support 'peer' lists and peer_find. etc
 892  31/ subordinate filesystems:
 893      a/ ss[]->rootdir needs to be an array or list.
 894      b/ lafs_iget_fs need to understand these.
 895  32/ review snapshots.
 896       How to create
 897       how they can fail / how to abort
 898       How to destroy
 899  33/ review unmount
 900       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 901  34/ review roll-forward
 902       - make sure files with nlink=0 are handled well.
 903       - sanity check various values before trusting clusters.
 904
 905  34/ Configure index block hash_table at run time base on memory size??
 906  35/ striped layout.
 907          Review everything that needs to handle laying out at cluster
 908          aligned for striping.
 909
 910  36/ consider how to handle IO errors in detail, and implement it.
 911  37/ consider how to handle data corruption in indexing and directories and
 912      other metadata and guard against problems (lot of -EIO I suspect).
 913
 914  - check all uninc_table accesses are locked if needed.
 915
 916 And more:
 917   1/ fs->pending_orphans and inode->orphans are largely unused!
 918   2/ If a datablock is memory mapped writeable, then when we write it out,
 919      we need to with fill up it's credits again, or unmap it.
 920   3/ Need to handly orphans asynchonously.
 921
 922 ---------
 923 22nov2007
 924 Free index block are on two lists, both protected by the global
 925 hash_lock.
 926   1/ The per-inode free_index, so they can be destroyed with the inode
 927   2/ The global freelist so they can be freed by memory pressure.
 928
 929 11feb2008.   Where was I up to again?
 930    reviewing phase_flip and lafs_refile.
 931
 932   UPTO
 933      Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
 934      Need to thnik about how index block really change.  How old blocks get
 935       dis-counted from segment usage, and what optimisation are really good
 936       for re-incorporating index blocks.
 937         Operations to consider are:
 938               i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
 939           i/ leaf block splits, index block gets new entry at end, and replacement
 940                   for other entry.  Easy to handle
 941          ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
 942         iii/ probably caught in leafs.  May cause internal split so we add new
 943              index address, which is easily handled if there is space.
 944          iv/ same as iii, though split more likely.
 945
 946        What about merging index blocks.  That just makes addresses disappear, which
 947         we handle the slow way.
 948        Do we ever re-target index blocks?  Would need to be careful about that.
 949        Make it look like a split where one block ends up empty as a hole.
 950      Need to write
 951            grow_index_tree (DONE - untested)
 952                   ib is a leaf inode that is getting full.  Copy addresses
 953                   into 'new', and make 'ib' an index block pointing at new.
 954
 955            add_index/walk index (DONE - untested)
 956
 957            end of do_incorporate (DONE - untested)
 958                 new contains the early addresses.  Some remain in ib
 959                  and/or ui.
 960                 the buffers much be swapped, so ib has the early address.
 961                 ui needs to be attached to new
 962                 return 2; - then new uninc needs to be split
 963
 964            lafs_incorporate
 965                 case 2 - horizontal split
 966                 case 3 - vertical split
 967   12feb2008
 968    Bother - uninc_table is a problem (again).
 969    We can currently add at any time with just a spinlock.
 970    So when we split a block horizontally,
 971
 972
 973    Still need to
 974           share out children and uninc_table in do_incorporate
 975           share out credits in do_incorporate
 976
 977 14feb2008
 978    Still need to do incorporate as above but took a break to...
 979
 980    Counting allocated blocks now works - stat show right info, hopefully
 981      storage is correct too. - DONE
 982
 983    next: truncate?  orphan thread?
 984       Then segment usage and the cleaner.
 985
 986
 987    thoughts:
 988     truncate - removing blocks doesn't need to erase them...
 989     - nothing forces a cluster_flush promptly!!!  We need a timeout
 990          or at least we need a flush before truncate_inode_pages...
 991
 992     - in lafs_truncate we need to make the block an orphan an pin in
 993       all in a checkpoint.
 994
 995 21Feb2008 (Research morning)
 996    Discard checkpoint thread created on demand in favour of a cleaner
 997    thread that runs all the time.  It cleans and checkpoints and
 998    orphans and scans.
 999
1000      want to:
1001         do segment scan and get a real list of free segments and
1002         free-space info!
1003
1004 25Feb2008
1005  - segment usage scanning to count free blocks
1006  - fix up re-reading of erased blocks
1007  - FIX truncate can still block waiting for writeback to complete.
1008  - FIX allocations aren't failing when we run out of free space
1009  - FIX df doesn't agree with du.
1010
1011  problem:
1012    Truncate when an index block has addresses in uninc_table.
1013      The summary for the new address has already been performed.
1014      We need to deallocate the new without disturbing the old.
1015      However a simple allocation may not be possible.
1016      I guess we can prune them all to zero, then incorporation
1017       can proceed.
1018
1019  TOFIX: when truncating a recently created file, it is still depth=0 so
1020     nothing happens.
1021     We really need to increase the depth to 1 as soon as we dirty
1022     any block, then reset back to 0 if it fits.
1023
1024 26Feb2008
1025   We have a file that we have written to, and the data blocks have been
1026   written out and the addresses stuck in uninc_table.
1027   We then truncate the file.  Who releases the usage of those blocks?
1028   And who removes them from uninc_table?
1029
1030   OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1031   I really should make sure that inodes are getting freed properly and the
1032   inode map is clean and everything.
1033
1034   BIG QUESTION
1035     Do we reserve segment-usage blocks.
1036      We cannot do it naively as we get infinite recursion.
1037      But we need it to be allowed to dirty the segment block.
1038      But we cannot pin them to this phase as we want to write them out
1039      after this phase
1040      This still needs more thought.  I avoided the recursion by setting SegRef
1041      before getting the ref.  But that isn't safe.
1042
1043 28Feb2008
1044   The table of cleanable segments is not working out.  Each segment appears multiple
1045   times which wastes space and adds confusion.
1046   We really want to be able to lookup by dev/seg and also find the least.
1047   'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1048
1049   We could have a skiplist for dev/segment lookup and do a merge-sort on
1050   a different link when we want to find the best segment.
1051   We then remember the best number found since a sort, and re-sort if the top
1052   is worse than the best.
1053
1054   We keep all this in a fixed size table.  Each entry has
1055    seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1056       addr-sort-skip links.
1057    This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1058    Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1059    One page of 16byte entries (256 of them)
1060    2/3 page of 24byte entries, 1/3 of 32byte entries.
1061    Total 2 pages, and 256+113+43 = 412 entries.
1062
1063   But deleting random elements is awkward... but not too awkward.  We can delete
1064   lots of entries by marking them as old, then performing a single pass of the skip
1065   list deleting them.
1066
1067   We should keep free segments here too, on a separate list.
1068
1069   So how about:
1070    2 pages of 16byte entries
1071    1 page of 24
1072    1 page of 32
1073
1074   free list randomly threads through all.
1075
1076   When using from 24 or 32, randomly choose height of 2-5 or 2-9
1077   Two lists run through the skiplist entries.  One for cleanable, one for free.
1078   Remember the nth element for some small n (10, but it decreases as we pull
1079   things off the front) and if we add something less than that, we trigger a
1080   mergesort on the next time we want to clean.... maybe.
1081
1082   Remember end of free list and add to there.  Maybe merge-sort the free list
1083   by addr occasionally.
1084
1085   Quesitions:
1086     When can we clean, when can we free wrt checkpoints?
1087       - we an clean a segment as soon as we have a checkpoint after it.
1088         So we record the youth of the segment holding the (start of the)
1089         checkpoint, and can clean any segment with a lower youth.
1090       - we can free a segment after the checkpoint after itfs usage has reached
1091         zero.  So if usage is zero and youth....
1092         We could offset the usage by one (say - for the first cluster header..)
1093         then when we find a segment with usage of '1', we schedule an update to
1094         0 in the next checkpoint...
1095     Have about segments with different sizes - they get different weights.
1096        Need to divide by segment size:  usage * youth / size.
1097
1098   TOFIX
1099    - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1100    - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1101
1102    - Lots of 'creates' makes lots of little clusters - need to optimise!
1103         Or it could be deletes as we currently cluster_flush for each
1104         delete.
1105          - I think this is fixed
1106
1107 29Feb2008
1108   Started looking at the cleaner.
1109   Need to understand how much to clean each checkpoint
1110   Need to track free-space-in-active-sectors while scanning.
1111
1112 3Mar2008
1113   TOFIX
1114     - the cluster head is currently limited to one page.  This is not good.
1115
1116     - Should the cleaner start before the scan is complete after a checkpoint?
1117       Probably it can, but while the scan is still happening it might be best
1118       to be cautious ??
1119
1120   STATE:
1121     try_clean is taking shape and has a few FIXMEs.
1122     need to write async find_block code and get it to watch for
1123        block in a cleaning segment.
1124
1125 28Mar2008
1126   - where can padding appear in a cluster? between miniblocks? at
1127     end of device blocks?
1128   - need to track phys block while parsing headers for cleaning.. why?
1129   - determine rules for avoiding block lookup during cleaning
1130     based on youth/snapshot age, and truncate generation.
1131      We need to load the inode from each snapshot
1132     Can we optimise based on snapshot age?
1133     only if we know the block is newer than the snapshot.
1134     So when we relocate blocks (cleaning) they must go in a segment
1135     that is marked as being old. we cannot really guarentee that.
1136     I guess blocks that are marked as 'new' can safely be skipped if
1137      segment is newer than snapshot. This 'age' is not the youth, but
1138     is the cluster_head->seq which is stored in creation_age.
1139
1140  - Store the rootdir for a filesystem in the metadata for the root inode.
1141    Then 'struct snapshot' doesn't need rootdir.  It can have a root
1142
1143 30Jun2008
1144   Looking at lafs_find_block_async.
1145      Needs async flag to make_iblock.
1146         Check that.  Can we block_adopt if there was an error?
1147              iblock will exist.
1148      setparent has async flag.
1149      lafs_leaf_find has async flag
1150      lafs_wait_block_async
1151
1152   FIXME I wakeup the cleaner every time an IO completes.
1153   Do I really want that?  Maybe only when number of async IOs hits
1154   half the recent maximum??
1155
1156   FIXME need to ensure that lafs_pin_dblock flushed committed
1157     B_Realloc blocks.
1158
1159   FIXME when we incorporate a dirty (non-realloc) address to an index block,
1160     we need to clear B_Realloc on the indexblock.
1161
1162   FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1163     giving it any credits.  Where should they come from?
1164
1165   We don't seem to scan for free/cleanable segments often enough.
1166
1167   FIXME we shouldn't start a checkpoint while cleaning is happening.
1168
1169   FIXME need to be careful when cleaning about finding inodes that
1170     don't exist any more.
1171
1172   FIXME give credits to realloc blocks.
1173
1174   FIXME think about/document transitions between realloc and dirty,
1175     and what locking is needed.
1176
1177 2Jul2008
1178   Allowing for the FIXMEs above, the cleaner is now identifying
1179   blocks that need to be cleaned and marking them B_Realloc (I think).
1180   We now need to gather these into a write cluster and write them.
1181   They will all be on the clean_leafs list, so we can iterate that
1182    allocating or incorporating as needed.  This will be similar to
1183    do_checkpoint.
1184   Important question is: when?
1185    Ideally we would have some auto-flush mechanism.  The cleaner just
1186    keeps finding blocks to clean and when we start running out of
1187    resources we flush the cleaning queue.
1188    However we will still want to flush the cleaner always before a
1189    checkpoint, so for now we cna implement that bit and wait for a
1190    need for the other to arise.
1191
1192
1193   FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1194       don't record that location the same way.. how to handle?
1195      Should check that 'adopt' doesn't do the wrong thing with this block.
1196
1197
1198   Realloc blocks need to be pinned.  That makes sense.  Only that way
1199     will they get onto the clean_leafs list.
1200   When checkpointing we should probably examine clean_leafs to be
1201    on the safe side.
1202
1203
1204   Realloc and Dirty:
1205
1206      Both of these hold a Credit.
1207      Both can be set at the same time.
1208      Cleaner ignores Dirty and sets Realloc anytime the block is in
1209       the wrong segment.  It also Pins the block.
1210      When the cleaner is flushing to the cleaning segment, it
1211       ignores Dirty blocks.  They get their Realloc cleared, but
1212       the remain pinned.  So they will get moved at the next checkpoint.
1213      How do we know whether an indexblock should be Dirty or Realloc?
1214       The Dirty/Realloc bit is cleared before we get to incorporation.
1215       Maybe we lafs_dirty_iblock the parent of any block we write
1216        out.  Then after incorporation, we set Realloc if it is not
1217        dirty.
1218
1219 STATUS:
1220   I think I'm pinning cleaner blocks now.
1221   Need to make sure the dirty ones are dropped. DONE
1222   Need to make sure the usage is transferred
1223   Need to get free segments back into use
1224   Need some more 'dump' options.  Maybe youth/usage files.
1225       Maybe tree.
1226   Need to make sure scan etc are triggered often enough.
1227
1228   FIXME lafs_prealloc walks up ->parent without locking
1229     I think we want i_mapping->private_lock like lafs_pin_iblock.
1230
1231 TODO:
1232   1/ a 'dump' option that triggers a scan and prints everything out.
1233   2/ scan must mark freeable as such, then subsequently free them.
1234   3/ Look at code that decreases usage of old segments.
1235   4/ Review lafs_cluster_wait_all and decide exactly how long we need
1236      to wait.
1237   5/ Review 'FIXME that is gross' HZ/10 thing.
1238   6/ Review 'wait for checkpoint to flush' msleep(500);
1239            Maybe remove that altogether.
1240
1241   FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1242   FIXME BUG in lafs_allocated_block fired.
1243             from lafs_erase_dblock from invalidate_page from .. vmtruncate
1244              from lafs_setattr
1245
1246   Current problem:
1247     An inode data block is dirty and pinned, but the inoidx is no longer
1248        pinned.  Presumably it isn't dirty.
1249      Recheck what 'dirty' means on the two blocks and see how this can happen.
1250
1251 10july2008
1252   Tree gets very big!  Lots of 'Realloc' blocks that should
1253    be long gone.
1254
1255   WE are spinning in cleaner again, and not in try_clean.
1256
1257   Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1258   In general it shouldn't be.  The flush_cleaner process will remove
1259    the Realloc bits so the blocks fall off clean_leafs.  They then either
1260    go onto phase_leafs or get unpinned.
1261   But I currently have a problem with InoIdx/data.
1262   The Pin is transferred to the Data block, but it doesn't go from the
1263    InoIdx block because it has a pincnt.  Now that is probably a bug, but
1264    what if it weren't?  What if, while we were cleaning, a block got dirtied.
1265    That would pin the whole tree.
1266   I guess the rule about not allocating an inodedata block while the
1267    InoIdx is pinned needs to be revised.  If the inodedata block is
1268    Realloc (and not Dirty) while the InoIdx is not Realloc, we
1269    can go ahead (in a cleaning segment).
1270
1271  FIXME to check:
1272    adir/big1 is garbage.... big1 was removed, so why is it even there?
1273               FIXED.
1274    echo tre > dump  # still too much stuff.
1275
1276
1277
1278  Put cond_sched in checkpoint loops!
1279
1280
1281  Thoughts about cleaning and pinning.
1282
1283   When cleaning we need to know how many dependant blocks are being cleaned
1284   so that we know when *this* block can be written - i.e. when the could hits 0.
1285   We cannot use the pincnt for this phase because there may be dependant blocks
1286   which are dirty.  They, and therefore this, may get flushed at next checkpoint,
1287   but they may not.  If we could be certain they would, we could just write
1288   to the clean-segment blocks which can become unpinned.  However if there
1289   is an index block being cleaned, and no dependant is being cleaned, but some
1290   are dirty but not pinned, then the checkpoint can go past without the block
1291   being moved.... but maybe we can detect that.
1292
1293   Try this:
1294     We set B_Realloc precisely on blocks found in segments being cleaned.
1295     We pin these blocks and leafs which are Realloc go in clean_leafs.
1296     If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1297        That way it gets written at end of checkpoint, but to main cluster.
1298     When we incorporate Realloc blocks into an index block, it gets marked
1299        Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
1300     On a checkpoint, we process both phase_leafs and clean_leafs
1301
1302
1303  FIXME do inode reads async better when cleaning...
1304
1305  FIXME if a realloc inode has been allocated to a cluster when we try
1306      to dirty it, confusion can ensue as the writeout won't mark it
1307      clean, but will use up the credits.
1308      Maybe we need something similar to phasewait to not set PinPending...
1309       But normal dirtying doesn't phasewait.   I think we just need to
1310       detect this case and wait for the clean-cluster to flush.
1311       Messy...
1312
1313  FIXME make sure incorporate is doing the right thing with credits.
1314
1315  FIXME lafs_write_inode. We need to be careful about clearing Dirty
1316            when making an update.  Need some sort of locking.
1317            Need to review all inode dirty stuff and make sure we do
1318            write thing no matter when it is called.
1319
1320  FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1321         anymore so we don't know how to flag the index block.
1322
1323 2008jul13
1324  UPTO: unlink etc don't prealloc the inode that will be modified.
1325     And a warnon inode.c:579 is very noisy.
1326
1327 2008jul22
1328  FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1329      but it doesn't get set until AFTER lafs_reserve_block is called.
1330
1331  Here I am...
1332    Cleaning cleans an InoIdx block which schedules the data block.
1333     Subsequent the InoIdx block gets pinned again.
1334     Now when we go to write the data block, we cannot because InoIdx is pinned
1335      in same phase.
1336      Maybe given that data block is pinned, we write it anyway...
1337
1338  FIXME: when we realloc an block embedded in the inode, don't pluck it out
1339         and put it back in again.  Just realloc the inode.
1340
1341  FIXME: when cleaning a directory that has shrunk, we think we have
1342      blocks that don't exist any more. FIXED - we thought '0' was in
1343      segment '0'.
1344
1345 2008jul23
1346   FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1347      flush finds no credit. for InoIdx block of 8501
1348
1349   FIXME: do we do SEGREF on all the index blocks? do we need to?
1350
1351
1352 2008jul24
1353   FIXME: seg usage for segment 0/5 isn't dropping to zero.
1354     Part of a file got moved off, but count is still there.
1355     FIXED - seg_move wasn't being called.
1356   FIXME: segusage file has inconsistent extents:
1357       Extent entries:
1358        0 -> 694 for 2
1359        1 -> 1291 for 1
1360        1 -> 15 for 1
1361    FIXED several bugs in walk_extent
1362
1363   FIXME qphase:  any locking between that changing and lafs_seg_move??
1364     I don't think so.  Just that seg_apply_all must be called after qphase is set.
1365
1366   FIXME make sure we don't try to clean the current segment!!
1367
1368   FIXME 'Available' goes negative!
1369       Creating large file doesn't instantly reduce 'Used'.
1370       Deleting files plus sync doesn't increase Avail?
1371
1372   FIXME a segment is in the table but doesn't print out!
1373
1374   FIXME we don't cope with running out of free segments (not that we ever should).
1375
1376   FIXME check all Credit usage and make sure credits are returned when
1377     ->parent is dropped.
1378     provide visibilty into credit counts.
1379     Make sure we are keeping enough space for cleaning.  We should always
1380      have a few segments unallocatable.
1381
1382 2008jul25
1383   FIXME cannot do io completion in cleaner thread as it can block on
1384      a i_mutex which might be waiting for completion. FIXED (keventd).
1385
1386   FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1387             If we 'know' we have a reference, e.g. a child with a ->parent
1388             link, we can access it without locking.
1389        So:
1390            lafs_make_iblock should return a counted reference.
1391
1392        If we own an (indirect?) reference to iblock, we can access
1393         both iblock and dblock for free... but iblock can change???
1394        If not, we need to get a reference to on or other under a lock.
1395
1396   FIXME block->inode should be a counted reference?
1397
1398 lafs_make_iblock OK
1399   lafs_leaf_find OK
1400     lafs_inode_handle_orphan OK
1401       inode_handle_orphan_loop FIXED
1402     __lafs_find_next OK
1403     find_block FIXED
1404   __lafs_find_next OK
1405     lafs_find_next FIXED
1406       dir_lookup_blk
1407       dir_handle_orphan
1408       lafs_readdir
1409       lafs_inode_handle_orphan
1410       choose_free_inum
1411   find_block - FIXED
1412
1413  FIXME root->iblock should always be refcounted.  Is it?
1414  FIXME walking siblings - what lock?
1415
1416 2008jul28
1417  FIXME several times we clean PinPending without refiling, in dir.c in particular.
1418     that looks wrong. FIXED
1419
1420   Maybe  lafs_new_inode should return a reference to the dblock
1421     Or pin it. or something. FIXED  And pinned (when needed).
1422
1423  FIXME lafs_inode_dblock might return a block without valid data...
1424    Need to get valid data, then load block 0 in find_block rather than
1425        load_block.  FIXED
1426
1427  FIXME we really should own a reference to ->dblock before calling
1428     lafs_pin_inode.  We don't want IO during a pin request.
1429     FIXED
1430
1431  FIXME review use of PhysValid FIXED
1432
1433  lafs_orphan_abort - what if lafs_orphan_pin not called?
1434    or if 'b' is NULL.  FIXED
1435
1436  Do I Need to clean PinPending when retrying??
1437    Well, we need to be phase-locked when we set PinPending, so
1438     it must be Pinned to the current phase.
1439     So when we unpin a datablock, we must clear PinPending.
1440   FIXED we now clear PinPending in do_checkpoint.
1441
1442  Does phase_wait do the right thing when pinning an inoidx block
1443    for an inode? FIXED
1444
1445
1446 Pending
1447   Need to understand and document the lifetime of a page with datablocks.
1448     who hold what refcount, and when can it be freed?
1449    Then fix up locking in lafs_refile, __putref.
1450
1451  FIXME how keep what refcount on orphan blocks/inodes??
1452  FIXME should dirty/pinned/etc hold a refcount?  they don't.
1453
1454
1455 Later:
1456  FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1457
1458  FIXME make sure empty files have depth of 1.
1459
1460  FIXME Truncate proceeds lazily. All data blocks need to be gone
1461
1462 26aug2008
1463  If I call lafs_erase_dblock while a write is underway, we have a problem.
1464   We need to wait potentially for a checkpoint to let go of the block and
1465    a write to complete.
1466     This should be done with waiting for PG_writeback on the page to disappear.
1467   Check this out.
1468
1469   When end_page_writeback is called, we must have dropped all references to the
1470    page.
1471   When we commit to writing a block, we have to set PG_writeback on the page
1472    so that truncate et al can wait for it.  Before we have committed, truncate
1473    can just remove the page.  Internally we differentiate by B_Alloc.
1474   So before setting B_Allocated we need to test_set_page_writeback(page).
1475   Be careful of races.
1476   I don't think we can ensure all references are dropped.  After all, that is
1477   the point of refcounts.  So dblock array must exist without page!
1478   But we need to ensure that we don't start a writeout after truncate
1479   has done wait_on_page_writeback.
1480   This is done with the page locked so when we want to write a page
1481     in a checkpoint, we need to lock the page first.  Once we have the lock,
1482     we check if the page is still dirty.  If it has been truncated it
1483     will be clean.
1484    But how do we safely reference the page if b->page can be cleared?
1485     How about:
1486       When we clear PagePrivate, we take a counted reference to the page
1487       for db->page.  This is dropped when the page is freed by lafs_refile.
1488       But while it is held, it is still safe for db->page to be dereferenced.
1489     So before we commence writeout we have to lock the page and set
1490      PG_writeback.  After locking, we need to test if writeback is still
1491      appropriate.
1492
1493   Maybe not.  I think we can submit blocks for writeout without setting the
1494   page to writeback.  If we do, then we need to be sure those writes
1495   finish before invalidatepage calls releasepage (block_invalidatepage
1496   calls discard_buffer which calls lock_buffer which waits).
1497   In our case invalidatepage need to make sure that no new write commenses.
1498   Maybe we should lafs_iolock_block before we allocate to a cluster and check
1499   again if the block is dirty.
1500
1501   So:
1502     lafs_cluster_allocate does:
1503        lafs_iolock_block
1504        check if still dirty.  If not, unlock and return
1505        set allocate flag
1506        allocate and write
1507        when write completes, allocate is cleared.
1508                     unlock block
1509
1510     invalidatepage does
1511        lafs_iolock_block
1512        clear Valid,Dirty,Realloc
1513        lafs_iounlock_block
1514
1515
1516
1517 2008 aug 28 - happy birthday.
1518 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1519 lafs_prealloc complains.
1520
1521   mark_cleaning does too, but cleaning only happens well away from a checkpoint
1522   lock.
1523 segsum_find is being called to reference a new segment when we flush a cluster.
1524  segment usage blocks are special.  Their index information doesn't
1525 need to be written out in the current checkpoint.  We can do that, but
1526 the backstop is to write just the data block in the tail of the
1527 checkpoint and write indexing information later.
1528
1529 2008sep10
1530  unlink is getting "No space left on device".  This is when trying to
1531  pin the directoory block, the physaddr is 0, so it looks like we want
1532  NewSpace.  But we should even be trying to prealloc in that case becase
1533  there should already be a prealloc on the block.  i.e. there should be
1534  credits.
1535  Hmmm. after multiple 'syncs' how can the block not be written out.
1536  Maybe it is embedded in the inode?
1537  When we pin a block that was embedded in the inode it isn't clear what to
1538  do.  If we might grow the file so it doesn't fit any more, we need to
1539  allocate NewSpace.  If we know it won't grow. we use Release.
1540   This still needs a proper fix.
1541
1542  Cleaning seems to be working nicely.  However we don't get all the space
1543  back that we should because lots of blocks still have credits that
1544  aren't being returned.
1545
1546  So when should credits be returned?
1547  They are set when a block is pinned.  It then gets dirtied which
1548  consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
1549  then it doesn't need any credits.
1550
1551
1552  It seems that cluster_flush is not always writing things in the correct
1553   order.  Root gets written before some other things below it.
1554    Maybe they are temporarily out of the loop??
1555  No.  There are dirty blocks which one checkpoint doesn't pick up, but
1556   they aren't holding the index block pinned. so they lose allocation.
1557
1558  But they must hold the indexblock pinned, even though they aren't pinned
1559  themselves.  We maybe do this just with the refcnt... maybe.  That will cause
1560  it to phase-flip rather than drop pinning, which I think is right.
1561
1562  So: too many credits remain allocated.  Where are they?  There are 1464
1563    outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1564    But things removed from the tree have credits removed.
1565
1566
1567
1568 FIXME roll forward ignores inodes.  But what about an inode that contains
1569    data.  Should that be ignored?  I think not.
1570 FIXME delete adir/big2 then delete adir and it cannot release:
1571   Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1572  presumably there is orphan processing or something to complete???
1573 FIXME when files are deleted, the space isn't returned!
1574    This seems to be mostly fixed - need to test.
1575 FIXME when I "rm [b-z]*" it waits for writeback on something???
1576    zfile again!!!  OK, I think that is fixed.
1577
1578
1579 12sep2008
1580   Current problem:
1581     seg_apply_all dirties dblocks.  When should they be reserved?
1582     The originally get reserved by a lafs_reserve_block call in
1583     segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1584     However: that block might get written before *and* after a checkpoint.
1585     So we need N* Credits.  These are usually only used for Index blocks.
1586     We can set these easily enough if inode type is TypeSegmentMap.
1587     We move them across to Credit in seg_apply_all.
1588     But when to we clear them if they aren't needed?  I guess
1589      when we drop the last segref.  Yes, we already do that.
1590     FIXME need to make sure these get flushed on next checkpoint
1591      if we cannot allocate new credits after a checkpoint.
1592
1593   New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
1594     Think that is fixed.
1595
1596   Some problems.
1597     1/ see above:  rm x/y; rmdir x -> BUG - FIXED
1598     2/ Spins on 'CURRENT=1' ??
1599     3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1600     4/ When I create/delete a file, ablocks_used increments by one.
1601         The inode hasn't been allocated yet, so it seems the deallocation
1602          isn't adjusting ablocks_used??
1603     5/ open_namei (for dd) got caught on a mutex_lock.
1604     6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1605        I'm not sure where we should and am not thinking very clearly.
1606        Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1607     7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
1608        the lock?  Writeout that hasn't completed?
1609        Yes.  writepage calls lafs_allocated_block without calling flush.
1610        So the block could be sitting waiting for a flush.  How long do we
1611        wait??
1612     8/ It seems that some datablock can need NCredits.  Make sure these
1613        are handled properly re flush-or-refill after checkpoint and
1614        flip_phase rather than unpin.
1615     9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1616        enough, and we lock up (see 7).  Need to flush the first block
1617        straight away, and the next one as soon as the first finishes, etc.
1618        Or something like that.  Then remove the comment from lafs_writepage.
1619
1620 8th December 2008
1621
1622   I seem to be getting only 4 blocks to a cluster at the moment.
1623    This is good as it motivates the code to handle block splitting in
1624    the Btree.   But it shouldn't happen.
1625
1626   ....
1627   Block spliting might work - it doesn't crash at least.
1628   But
1629   After deleting all files, the tree is full of stuff.
1630   Lots of inode data/InoIdx blocks.
1631   Many but not all a Pinned.  The others are OnFree
1632   The Pinned ones have outstanding references.
1633   Others
1634
1635   ....
1636   Problem with the block splitting, when adding an index block.
1637   The index block is initially empty - we need to find things by looking
1638   at children.  But we don't.  We BUG_ON the iphys==0.
1639   In general, when we add a block below and index block and before we incorporate,
1640   the block must be found by finding the first indexed block and looking to
1641   see if there is a 'next' block that contains the address we need.
1642   FIXED
1643
1644   But if we truncate a file while an index block is pinned and dirty,
1645   we spin on trying to incorporate it, which should make it empty.
1646
1647 11th December 2008
1648   deadlock.
1649   sync is trying to get lock in lafs_cluster_flush
1650   pdflush holds the lock and is stuck in cluster_flush_0xa40
1651     some wait_event I expect.
1652     Maybe we need an unplug ??
1653
1654  - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1655    This is in clean_free.  We try to update the 'youth' to mark
1656    the segment as free, and we don't have a reservation to do it.
1657    Maybe just reserve it there and then.
1658
1659
1660 12th December 2008
1661   When doing a lookup in an index block, we need to check the unincorp
1662   address list.  It isn't enough to look for unincorp blocks as they
1663   might have disappeared.
1664   For INDIRECT and EXTENT this is easy enough as full information is in
1665   'uninc'.
1666   For INDEX it is a little tricky as we need to look at the full set of
1667    addresses to know where a particular address fits.
1668    We could force and incorporate first, but that has awkward implications
1669     if it requires a split.
1670    Maybe if we get from the lookup "start+range"....
1671      That is not enough as the 'start' might get zeroed by an update.
1672
1673
1674    rm adir/* doen't work as readdir doesn't get all the entries
1675     for some reason.
1676    Reason is that they are being put in the wrong block.
1677    lafs_find_next doesn't correctly find the 'next' block if it
1678    hasn't been incorporated yet.
1679    Block can be:
1680      in index tree -- easy to find
1681      in uninc_table -- not too hard
1682      in only in the ->children list, or attached to a page.
1683    It would be nice to use find_get_pages but that isn't exported so try
1684     something else for now.
1685    For index blocks
1686         Look in index block for 'next
1687
1688 15th December 2008
1689    FIXME when we split an index block, we need to hold a reference to
1690    the original so it doesn't disappear until the split-off copy is
1691    written.  This is because we search from an index block to find
1692    split-off copies.
1693    [ note from Feb09.  This should be OK now. Both will need
1694    incorporation, and we now hold on to blocks until they are
1695    incorporated.]
1696
1697
1698
1699 23rd February 2009
1700   - index block.  What changes are allowed exactly.
1701      - splitting certainly makes sense.
1702      - merging two adjacent blocks is fine, of which a special case
1703        is finding that a block is empty and so removing it.
1704      - What about a 2->3 split which would require removing a block
1705         and adding another at the same time?
1706        or noticing that the first blocks addressed are all missing, so
1707        moving the index forward?
1708        In each case, searching down by indexes will find a block that
1709        has been replaced by a later address.  We could manage that as
1710        long as the new block is attached after the replaced block.
1711        So we cannot move a block.  We must delete and replace.
1712
1713   - unincorporated index blocks..
1714     unincorporated data blocks are not pinned in memory.  Once they have
1715     been written out, they can be freed.  Their address is stored in the
1716     uninc-table.  This means we can delay incorporation while many
1717     extents are written out and freed.  When we come to incorporated, we
1718     may have many hundred of address in a few extents that can be incorporated
1719     efficiently without holding all that data pinned in memory.
1720     The same scale doesn't apply to index blocks.  An index block can
1721     reference only 102 blocks (for 1K block size).  And the uninc table can
1722     hold far fewer so we will naturally incorporate more often.
1723     So keeping index/indirect/extent blocks pinned until they are incorporated
1724     is reasonable.  And it makes lookup a lot easier, as we have
1725     guarantees about ordering of block in the children list that we
1726     don't have in the uninc table.
1727
1728     Incorporation could have some atomicity issues.  There is no
1729     concern about bad stuff appearing on disk as the phase-change
1730     process handles that.  In memory it might be awkward if we split
1731     an index block before incorporating a block what would span them.
1732     That could conceivably happen if we only incorporate 8 blocks
1733     (size of uninc table) at a time.
1734     So maybe we should incorporate a full uninc list (not table) at
1735     a time.
1736     This means quite different code paths for incorporating leaf
1737     and internal index blocks....
1738
1739
1740   - uninc_table lists are a real problem.
1741     They can only be created during roll-forward so they hardly ever
1742     happen.
1743     But if the block is split while processing earlier things on the
1744     list, then splitting an uninc table would be very messy.
1745     Is there any way around this?
1746     Why not just do incorporation during roll-forward?
1747     We only need to incorporate leafs, not internal blocks because we
1748     don't use uninc_table for internal blocks any more.
1749     So during roll forward, all index blocks that are touched need to
1750     be held in cache...
1751     I think we live with that.  If it every becomes a problem, we will
1752     need to perform the roll-forward twice.  The first time collects
1753     the usage information so that we know where we can start writing,
1754     then the second just applies all the changes. to the rest of the
1755     filesystem.
1756
1757
1758    So:
1759      uninc table only used for leaves, and has no linked list
1760      unincorporated index block are stored on a list, which we
1761      sort before applying.
1762      All uninc index blocks are therefore kept in the index tree.
1763      Their order on the children list allows us to find the correct
1764      index. Each block for which the fileaddr is in the parent is
1765      followed by any blocks that have been split off and end after
1766      this one starts.  Blocks that have been emptied are Hole and are
1767      skipped over when looking for a block.
1768
1769      When we split an internal block, the remaining uninc blocks
1770      must not start with a Hole.
1771
1772    FIXME: what locking do I need around lafs_incorporate?
1773       i_mutex?? i_alloc_sem??
1774       i_alloc_sem is imposed by truncate (inode_setattr) and
1775          direct_io possibly.  So it is really about adding/removing
1776          blocks.  Not updating internals.
1777          Maybe our own mutex.  Could even be per-index-block !!
1778       Whatever it is, we need to protect walking ->children too.
1779
1780
1781 24th February 2008
1782   "rm -r" problem from 12/dec/2008 fixed now.
1783   incorporate code got a make-over and is probably much better.
1784
1785   New problems:  After test runs, cannot create files due to no space
1786      on devices!!  But directory tree is empty.
1787   I can see:
1788
1789     free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1790
1791   The problem is that we think 1425 has been allocated to data that
1792   might still need to be written, leaving not enough room for more.
1793   Index Dump shows
1794   ====================414 credits ==============================
1795   which doesn't explain everything, but does explain a lot.  There
1796   really should be nothing in the Index tree (except fs-root and
1797   tree-root)
1798   There is also:
1799   Some inodes which are OnFree and hold no credits.
1800     0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
1801     52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1802
1803   Some other inodes which are pinned with lots of credits and are
1804     on the phase_leaf list
1805     0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1806    299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1807
1808   And that is about it.  some are not Valid, some are...
1809   checkpoint just wants to 'flip' them.
1810   They mostly have a refcnt of 1... I wonder who is holding that....
1811   The reference of on the dblock is held by the iblock.
1812   But what is the iblock remaining?  Who holds that reference?
1813
1814   I restored some code to clean iblock, and now:
1815   free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1816   ====================244 credits ==============================
1817   which saved 130 credits.  That helps.
1818   There seem to be many fewer of the many-credits blocks
1819   Lot of index blocks in tree are 'OnFree' and have a
1820   0 refcnt, but haven't been removed.  Why?
1821   It seems that the have ->parent == NULL, so lafs_refile never
1822   bothers to remove them.  I guess it should...
1823   OK, lots of InoIdx block have gone now with their DATA blocks.
1824
1825   So, remaining blocks are pinned to their phase with lots of Credits,
1826     have not pincnt, mostly have physaddr==0.
1827    It is just the stray refcnt that keeps them there..
1828    inums are 40, 56, 62-73, 275-278, 280
1829     40 is f22
1830     56 is first adir
1831     63-69 are directories 2/3/4/5/6/7/8/9
1832     70-73 are looooong symlinks
1833     275 is cfile
1834     276 is dfile - same as cfile but truncated.
1835       Then some nbfile-X that were big enough.
1836
1837    So: what do they have in common:
1838      Several only use the in-inode data block, but
1839        probably not all
1840
1841     Can it be that it is refcounted on the Leaf list, and so
1842     cannot get off??  Yes, I think so!
1843     We only unpin things that have a zero refcount.
1844
1845     So: what to do?
1846       checkpoint takes it off the list, then flips the phase and puts it
1847       on the other list with refile.  During that time it has a refcount
1848       it doesn't lose the pinning.
1849       Do we want to:
1850         1/ Not have it on the list despite being pinned.
1851         2/ Drop the PIN despite the refcnt.
1852         3/ have refile do the phase_flip so it has a chance to
1853            notice the refcount has hit zero.
1854
1855       2 isn't really an option.  We need PIN to persist whenver we have
1856        a reference.  We could possibly use PinPending for index blocks too,
1857        but that would require a lot of thinking.
1858       1 requires another criterea for being on the list.  I suspect that would
1859        get messy fast.
1860       3 we used to do I think... But refile is in a big lock, and we
1861         cannot really do a phase_flip under that.. and phase flip calls
1862          refile anyway so we would get recursion.
1863       So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1864
1865       FIXME use kzalloc where appropriate.
1866
1867       FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1868
1869 25th February 2009
1870   Good progress.
1871   Only 54 credits in Index Tree now.
1872   Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1873   plus '74', which seems to be schedules for deletion - root has uninc_table.
1874    ... and 'sync' got rid of that and left 44 credits.
1875   Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1876     50  link
1877     55  zfile
1878     72  long84
1879     73  long85
1880     74  adir
1881   These seem to be the files that used data-in-the-inode
1882   They still have a refcnt of 1 (or 2 for adir).
1883   ... OK, that's gone now.  I fould a refcount leak.
1884
1885   So now:  42 Credits in Index Dump.   No stray files.
1886
1887   df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1888   So we still seem to have 1085 blocks allocated.  42 are accounted
1889   for, so 1043 still missing... either we lost the count, or lost the tree.
1890
1891   create a finy file, remove, and sync, now
1892   df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1893
1894   so I lost 15, b ut now 48 are in tree.  Lets try again...
1895   df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1896   and 44 in tree
1897   and again:
1898   df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1899
1900   Definitely losing more thant the difference in the tree.
1901
1902   Try creating empty files...
1903 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1905 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1906 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1907 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1908 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1910
1911  very strong pattern there.
1912  What about 2 files at a time.
1913 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1914 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1915 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1916 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1918
1919   Slightly different pattern - not as bad.
1920   Have to try 4 now.
1921 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1922 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1923 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1924 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1925
1926   Strange, isn't it....
1927
1928   Making sure we clear UnincCredit... result looks worse.
1929
1930 26th February 2009
1931   I fixed up the credit accounting 'incorporate' and then fixed a couple
1932   more little bugs.  And now:
1933
1934
1935
1936 ====================48 credits ==============================
1937 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1938
1939 So we still have 720 allocated credits that aren't accounted for.
1940 But we are nicely under 100...
1941
1942 .... and now
1943
1944
1945 ====================76 credits ==============================
1946 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1947
1948 That is different.  The count of missing blocks is way down,
1949 but there is some extra cruft in the index tree.
1950 Quite a few like
1951     0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1952     0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1953 and even one
1954     0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1955    330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1956 Time for a commit though....
1957
1958 and now
1959 ====================46 credits ==============================
1960 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1961
1962 so the strays in The index tree are gone. but still have 159 outstanding
1963 credits.
1964 Now change but now
1965 ====================36 credits ==============================
1966 df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
1967
1968
1969 That is a little weird...
1970 Hmmm. back to
1971 ====================48 credits ==============================
1972 df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
1973
1974 Oh well.
1975 ====================34 credits ==============================
1976 df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
1977
1978 It seems that the unaccounted blocks are (or can be) created by
1979 writing to a file then removing the file without a sync.
1980 ..but why is cb (cblocks_used) so high?
1981
1982 27th February 2009
1983
1984  Got onto a bit of a tangent...
1985  What happens if we truncate a block while it is on a list to
1986  be cleaned?  Clearly we want to cleaner to drop it ASAP.
1987  But what if invalidate_page wants to drop it *now*
1988  Hopefully it is either still on clean_leafs and we can remove it,
1989  or it is now iolocked and we can wait for it.  So should be OK.
1990
1991  I keep getting caught in "looping on..."
1992  We are truncating an inode and some index block which is now empty
1993  is not getting removed from the tree because there is an outstanding
1994  reference.... 327/0 depth=1.  I guess I turn on the tracing.
1995
1996  ... and it seems that it is in the process of checkpointing.
1997  I guess I need to lock against that ... maybe with the iolock.
1998
1999 Credits = -1, rv=2
2000 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
2001 ------------[ cut here ]------------
2002 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
2003
2004  -------
2005  Every time I create/delete a file, I get an extra 'ab' which disappears
2006  on 'sync'.
2007    ablocks_used is:
2008      decremented when +ve summary_update on non-index
2009      increased on lafs_summary_allocate... should not be done for index blocks.
2010
2011  OK:  after test run, filesystem is empty, but cblocks_used is around 360.
2012   cblocks_used:
2013         is loaded at mount time
2014         collects pblocks_used on a phase flip
2015         is updated in lafs_summary_update (unless pblocks is)
2016    So we must be missing a lafs_summary_update when phys->0
2017
2018
2019  Lots of problem:
2020    truncating big (multi-level index) seems to be bad
2021      Leaves 'pb-338 !!! and cb+689, even after sync.
2022    still 'looping on' occasionally
2023    Haven't found cblocks_used leak yet.
2024    Occasionally non-B_Valid blocks are actted on.
2025      I think I need to improve io locking.
2026
2027 ---------------
2028 1st March 2009
2029   Need some improvements to iolock locking.
2030   We use this lock to wait for a block to be written out (if that is happening)
2031    before we allow lafs_invalidate_page to complete.n
2032    It is also use in lafs_erase_{d,i}block (Similar purpose)
2033   We take the lock in lafs_cluster_allocate, and then make sure the block is
2034    still dirty.
2035
2036   Also lock in lafs_new_inode as initing the inode is a form of IO ??
2037   load_block takes the lock
2038   We only clear_bit(B_Valid, ) under this lock.
2039
2040   So the issue is this:
2041     A block that is going to be written is passed to lafs_cluster_allocate.
2042     This happens either after taking it of a _leafs list, or when
2043     lafs_writepage requests the write.
2044
2045     lafs_invalidate_page needs to be able to release the page, so there needs to
2046     be no transient references.  In particular, once the block has been
2047     removed from a _leafs list it must already be iolocked.
2048     Invalidate_page can then either remove from that list and erase the block,
2049     or use io_lock_block to wait for the IO to complete.
2050   So when a datablock comes out of get_flushable it must be iolocked, and must
2051   remain iolocked until after Dirty and Alloc are clear
2052   Index blocks belong entirely to the fs, so we can be more relaxed with them.
2053   If get_flushable finds the block already iolocked, it is either being invalidated
2054   or already has IO pending, so it can be dropped.
2055
2056
2057 16th Match 2009
2058
2059   FIXME  When we sync a small file, we just write out the inode.
2060      rollforward currently ignores data in inodes I think.
2061      Thanks needs to be fixed to ensure this data is safe.
2062
2063  - stop iblock from disappearing so much.
2064
2065  - I think...
2066     While cleaning a file, I truncate it.  This makes it appear
2067     to fit in the inode but it is very big and we get confused.
2068    We cannot allocate block 0 until all the others have been
2069    allocated to 0 and forgotten.
2070    But what if we truncate a file to 10 bytes, then fsync?
2071     We need to write the data promptly, but we like doing truncate
2072     in the background.
2073    When we extend a file we already need to wait for truncation
2074     to complete (FIXME do we do that?)  We could wait on fsync too.
2075    We cannot just delay block0 as it might be part of a checkpoint
2076     that has to complete promptly while truncation can take a long time.
2077    i.e. we have a very large file.  We update the first byte, then
2078     truncate to 2 bytes.... we don't need to write until fsync which will wait...
2079     Directory?? delete lots of entries so it shrinks to one block?
2080        There is no delayed truncate there.
2081    ?? Never clean an I_Trunc file.
2082    If we try to allocate a file with other indexes:
2083      clear Realloc
2084      if Dirty and Pinned, just do normal alloc
2085      if Dirty and not pinned, skip.
2086
2087
2088   Sometimes I run out of credits while truncating a file.
2089   I need credits - maybe only briefly - to dirty the index blocks.
2090      -- FIXED I think.
2091
2092   An indexblock remains pinned while the refcount is non-zero.
2093   A pinned index block can be on a _leaf lru
2094   The _leaf lru holds a refcount.
2095   This is an awkward referential loop.
2096   We break it at checkpoint time with special code in phase-flip.
2097   But there are other awkward times such as truncate.
2098
2099   We cannot use PinPending like we do with data blocks because there
2100   could be multiple pending Pins (from different children).
2101
2102   We could possibly treat checkpoint_lock like pinpending, but that
2103   might be racy.
2104
2105   We could not count the _leaf lru, but that might just make the race
2106   harder to find.
2107
2108   I think we want to explicitly drop the pin when we truncate a block.
2109   Normally, once we Pin an index block is will become dirty so we don't
2110   want to de-pin before a checkpoint anyway...
2111
2112   Just to clarify: an index block gets dePinned:
2113    - during checkpoint on a phase_flip if it is no longer dirty etc
2114    - on truncation when we erase it
2115    - during pre-emptive write-out which is a bit like an early phase_flip
2116            not sure that we implement that one yet.
2117
2118 17th March 2009
2119  Deadlock?
2120    - checkpoint calls incorporate call erase_iblock calls iolock_block
2121    - rm calls orphan_pin calls phase_wait
2122  The problem is in lafs_incorporate.  It expects the block to be iolocked,
2123   but can call erase_iblock which try to get an iolock itself...
2124  ...fixed that and it still happens.
2125  checkpoint calls phase_flip calls allocated_block (on uninc list) calls
2126     iolock_block before calling incorporate
2127  Maybe all of these should assume an IO lock.
2128
2129  FIXME truncate assume truncate-to-zero.  We need proper ftruncate support.
2130
2131  It nearly works....
2132   Things to do:
2133     - sort out individual patches and review DONE
2134     - allow compilation without refcount tracking DONE
2135     - don't hold a 'leaf' reference. NO
2136     - clean up *ref calls - differentiate those that can be called when zero DONE
2137     - use enum for B_* DONE
2138     - support truncate to non-zero offset DONE
2139     - "looping on" found an 'OnFree' block!
2140     - clean out lot of debugging
2141
2142  Hmmm.... deadlock.
2143   rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
2144   checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
2145   Not cool.  i_mutex must not be taken by checkpoint
2146  Fixed that, though it is a bit of a hack....
2147
2148  New deadlock:  checkpoint calls phase_flip which calls allocate_block,
2149     to move the uninc_next across, and that tries to iolock the parent to
2150     perform a partial incorporation.  But that seems to be iolocked.
2151     Generally that is ugly as ->uninc_next might be very long and require
2152     multiple splits, and direct-driving that from phase_flip is bad.
2153     I should just move the list across
2154
2155
2156 19th March 2009
2157   Spent too long trying to remove refcount help by *_leaf lists.
2158   This leaves InoIdx block with zero refcount so Data block can get
2159   lost and bad things happen.
2160   I might be able to fix it up, but it is probably better to try the
2161   checkpoint_lock approach if I can only remember what that is.
2162
2163 Locking:
2164   Available locks:
2165
2166    Spin:
2167
2168     lafs_hash_lock
2169         Used in:
2170            lafs_shrinker
2171            lafs_refile ???
2172         Protects:
2173            ib->hash
2174            ->lru when on freelist
2175
2176     i_data.private_lock
2177         Used in:
2178            lafs_shrinker
2179         Protects:
2180            ->iblock / refcnt
2181            ->dblock / my_inode
2182            ->children / ->parent within an inode
2183            setting ->private
2184
2185     fs->alloc_lock
2186         fs->allocate_blocks
2187
2188     fs->stable_lock
2189         segsum hash table
2190         segsummary counters (in blocks)
2191
2192     fs->lock
2193         _leafs lru
2194         ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
2195         Pinned consistent with lru
2196         ->checkpointing / ->phase_locked
2197         fs->pending_orphans
2198         ->uninc and ->chain ??  Should use parent->B_IOLock ??
2199         uninc_table - should use B_IOLock
2200         free list / clean list segtrack
2201
2202    Mutex:
2203
2204     fs->wc->lock
2205       wc[0] .. something in prepare_checkpoint
2206        ->remaining etc
2207       cluster_flush
2208       mini blocks
2209
2210     i_mutex
2211       inode_map
2212       orphans
2213
2214    Other:
2215
2216     B_IOLock
2217        erase_block
2218        incorporate
2219        cluster_allocate
2220        allocated_block
2221        IO
2222        Phase flip
2223        Initialising new inode
2224     B_IOLockLock
2225          IOLock across a page
2226
2227
2228 --------------------
2229 This is a list from 18 months ago, with updates
2230
2231  - Understand how superblock 'version' should be used.
2232
2233  -  Review and fix up all locking/refcounts.  See locking.doc
2234        Also lock inode when copying in block 0 and probably
2235        when calling lafs_inode_fillblock (??)
2236  -  lafs_incorporate must take a copy of the table under a lock so
2237          more allocations can come in at any time.
2238
2239  - We don't want _allocated to block during cluster flush.  So have
2240    a no-block version and queue blocks on ->uninc if we cannot
2241    allocate quickly.  Find some way to process those ->uninc blocks.
2242
2243  - Use above for phase_flip so that we don't need to _allocated there.
2244
2245  - Utilise WritePhase bit, to be cleared when write completes.
2246      In particular, find when to wait for Alloc to be cleared if
2247       WritePhase doesn't match Phase.
2248        - when about to perform an incorporation.
2249  - make sure we don't re-cluster_allocate until old-phase address has
2250      be recorded for incorporation.
2251
2252  - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
2253
2254  - Can inode data block be on leafs while index isn't, what happens if we
2255        try to write it out...
2256
2257  -  If InoIdx doesn't exist, then write_inode must write the data block.
2258
2259  - document and review all guards against dirtying a block from a previous phase
2260     that is not yet safe on storage.
2261           See lafs_dirty_dblock.
2262  - check for proper handling of error conditions
2263      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
2264  - review checkpoint loop.
2265        Should anything be explicit, or will refile do whatever is needed?
2266  - Waiting.
2267        What should checkpoint_unlock_wait wait for?
2268        When do we need to wait for blocks the change state. And how?
2269
2270  - load/dirty block0 before dirtying any other block in depth=0 file
2271
2272  - use kmem_cache for 'struct datablock'
2273  - indexblock allocation.
2274         use kmem_cache
2275         allocate the 'data' buffer late for InoIdx block.
2276         trigger flushing when space is tight
2277         Understand exactly when make_iblock should be called, and make it so.
2278  - use a mempool for skippoints in cluster.c
2279  - Review seg addressing code in cluster.c and make sure comments are good.
2280  - consider ranges of holes in pending_addr.
2281
2282  - review correct placement of state block given issues with stripes.
2283
2284  - review segment usage /youth handling and make a todo list.
2285       a/ Understand ref counting on segments and get it right.
2286  - Choose when to use VerifyNull and when to use VerifyNext2.
2287  - implement non-logged files
2288  - Store accesstime in separate (non-logged) file.
2289  - quotas.
2290         make sure files are released on unmount.
2291
2292  - cleaner.
2293        Support 'peer' lists and peer_find. etc
2294  - subordinate filesystems:
2295      a/ ss[]->rootdir needs to be an array or list.
2296      b/ lafs_iget_fs need to understand these.
2297  - review snapshots.
2298       How to create
2299       how they can fail / how to abort
2300       How to destroy
2301  - review unmount
2302       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
2303  - review roll-forward
2304       - make sure files with nlink=0 are handled well.
2305       - sanity check various values before trusting clusters.
2306
2307  - Configure index block hash_table at run time base on memory size??
2308  - striped layout.
2309          Review everything that needs to handle laying out at cluster
2310          aligned for striping.
2311
2312  - consider how to handle IO errors in detail, and implement it.
2313  - consider how to handle data corruption in indexing and directories and
2314      other metadata and guard against problems (lot of -EIO I suspect).
2315
2316  - check all uninc_table accesses are locked if needed.
2317
2318  - If a datablock is memory mapped writeable, then when we write it out,
2319      we need to with fill up it's credits again, or unmap it.
2320  - Need to handle orphans asynchonously.
2321
2322  - support 'remount'
2323  - implement 'write_super' ??
2324
2325  - pin_all_children has horrible gotos - remove them.
2326
2327  - perform consistency check on all metadata blocks read from disk
2328    e.g. don't assume index blocks are type 1 or 2.
2329
2330 23rd March 2009
2331  + looking at cleanup for unmount.
2332  - various more refcounts fixed up
2333  - B_SegRef is never dropped!  and we take a ref on a segment when
2334    we start a cluster on it, but never drop that reference.
2335   THIS is next thing - review all setting and clearing of B_SegRef.
2336
2337 30th March 2009
2338  - SegRef and lafs_reserve_block...
2339    There is room for recursion here, I need to be careful.
2340    To dirty a data block, all parent index blocks must be Pinned and must
2341    be able to be written.  That means their segusage blocks must be
2342    available for update.  And Pinning a segusage block for update requires
2343    all its parents.  So the segment for the block, the indexes, and the
2344    segusage and indexes and so-on must all be pinned.
2345    When we pin a block, we do it from the root down to avoid recursion.
2346    We probably wany whatever reserve_block calls, to return an unreserved
2347    block rather than call reserve_block itself.
2348
2349   When do we clear SegRef?? We set it when Pinning, so I guess we
2350     clear it when unpinning.
2351    pin_dblock, mark_cleaning, prepare_write, truncate
2352    seg_move clean_free
2353   We it is really when Pinning, or Dirtying or Reallocing.
2354   So we clear when unpinning, or when a dblock gets written...
2355   Maybe just when we lose ->parent
2356
2357 6th April 2009
2358  - sometimes sugsum counter goes zero for random data block
2359      Something is going wrong in roll-forward.  The block looks transiently valid
2360      so doesn't get read, but has no good data in it.
2361  - After deleting a directory, the block might still have incorporation
2362    to happen, but is not marked dirty
2363  - at unmount, there are various blocks that are still dirty.
2364  - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
2365
2366 12th April 2009
2367  - that rollforward problem above:
2368     When rolling the checkpoint, if we find segusage blocks we want to include
2369     them directly into file.  But by pinning the block we might preread a
2370     segusage block.. but we must be sure not to update it.
2371     So during the early stages of rollforward while still in the checkpoint,
2372     seg_inc must be called with in_phase == 0.
2373     so seg_move is called with phase != qphase.
2374     ditto for summary update.
2375     So the block must be pinned to the previous phase...
2376     Normally 'phase' changes at checkpoint-start,
2377              qphase changes at checkpoint-end
2378     So we probably want to start with qphase being 0 and phase being 1.
2379     When we reach the end of the checkpoint, we flip qphase to 1.
2380
2381  - blocks still in phase_leafs at unmount:
2382     After we force a final checkpoint we still have Pinned:
2383         root InoIdx
2384         ino==8 InoIdx due to Dirty block0
2385         ino=16 InoIdx due to dirty block0
2386      and dirty:
2387         inode block 1,  inode usage map
2388                     2,  root directory
2389                     8,  orphan
2390                    16   seg usage
2391      Problems:
2392         inode blocks dirty but not pinned?  No InoIdx...
2393         Segusage dirty - probably by seg_apply_all - disable that at umount
2394         orphan dirty ??... but not pinned!
2395            This is possible - we don't pin for clearing entries, just for setting.
2396         The inode problem stems from the datablock being dirty while the
2397          InoIdx block isn't.  That is, at best, confusing.
2398
2399 13th April 2009
2400    segusage blocks aren't being pinned
2401    They need to be pinned  whenever dirty.
2402    and youth blocks aren't even made dirty some times.  They need to be
2403     pre-pinned in many cases.
2404
2405    So: segusage gets changed when we write out a cluster, and when we
2406       delete/relocate blocks.
2407       In the first case we pin the block when it becomes part of the free list,
2408       and need to keep it pinned across checkpoint changes.
2409       In the second, we pin when the block is dirtied and again must keep it pinned.
2410       Youth gets changed when a segment becomes free and again when we allocate
2411       a segment to it.
2412
2413       Keeping a datablock pinned across checkpoints is awkward - we currently need
2414       to repin for each dirty... I guess we can re-pin for each checkpoint
2415       in lafs_seg_apply_all.  That might work for segusage, but not for youth!
2416       If segsnum for ssnum==0 held a reference to the youth block, that might
2417       help.  Segstat on 'clean' or 'free' would imply a reference to that segsum.
2418
2419       Is it OK to keep all youth/usage blocks for free/clean blocks
2420       pinned?  We can currently have 810 entries.  Only half will be clean/free.
2421       For each entry there can be two blocks, youth and usage.  So that could be
2422       810 blocks. 1Meg?  Normally much less.  If it became a problem we could
2423       reduce the number dynamically I guess.
2424
2425       maybe segusage blocks need to get phase_flipped, as other blocks do
2426       depend on them,   pin_all_children wouldn't be able to find them though..
2427
2428     1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
2429       Youth block.
2430
2431 14th April 2009
2432    I think I want to link dirty block to the space in free segments that we
2433    actually know about.  Each of those segments has youth and usage blocks
2434    pinned (at least parent pointer is active).  So we have everything we need
2435    to write everything that is dirty.  So 'free' or 'clean' implies
2436    a segsum reference which holds youth block.
2437
2438    When we get low on space, we wait for cleaning/finding to progress.
2439    This would limit us to  400 segments, say 16Meg each, so 6Gig of dirty
2440    memory.  I guess that we need to scale the 'free' list based on available
2441    memory (FIXME).
2442
2443    When cleaning needs a segment, it needs to load the usage blocks for other
2444    snapshots too.
2445
2446    When cleaning in the presence of snapshot we need to be careful never to
2447    duplicate a block that is shared.  To allow for v.many snapshots, we don't
2448    even want to duplicate in memory.
2449    So we need to choose a 'primary' copy - probably first one found - and
2450    follow the peers link when possible...
2451
2452 18th April 2009
2453    (continuing).
2454
2455    So clean and free segments in the list carry a SegRef.  But it could be
2456    excessive if all of them did - we shouldn't be required to pin more
2457    data than we need.
2458    So for segments with a usage of 0, we use the score to record if a
2459    segref is held.  0 means 'no', 1 means 'yes'.
2460    When space_alloc wants more space we need to find an entry and
2461    segref it.  Maybe we want free lists - reffed and not-reffed.
2462
2463    Then again, SegRefs are fairly cheap as they are heavily shared.
2464    maybe 512 to a block.  If we hold 400 refs they could easily all be
2465    in one block.  We could possibly encourage this by sorting the list
2466    and discarding from one end if it is too full.
2467    Sorting is a good idea definitely.  It keeps youth/usage updates
2468    together.
2469
2470    Just check the numbers.
2471    a 1TB device with 1K blocks might have 32M segments of which there
2472    would be 32768.  512 per block means 64 blocks or 16 pages (64K).
2473    So total segusage files is 128K plus snapshots.  Not worth worrying
2474    about surely.
2475    For 16TB, that is 2Meg plus snapshots.
2476
2477    So
2478     - keep a SegRef for all free and clean blocks.
2479       This must include a youthblk reference.
2480     - sort the free list when 'clean' is merged or when a pass
2481           finishes.
2482         sort clean list
2483         fix youth value
2484         merge as many as fit into free
2485         sort
2486
2487    How is the code flow...
2488       add_cleanable is called during the periodic scan.  It could hold
2489                a SegRef easily.
2490       add_cleanable calls add_clean as does lafs_get_cleanable during
2491           clean.  That might block getting a segref, might even
2492           deadlock?
2493       add_free is also called by seg_scan
2494
2495       So seg_scan should get a segref and leave it with everything!
2496
2497     BUT.....
2498     A SegRef implies a 'struct segsum' for each segment.  We don't
2499     want to allocated one of these for every segment in the table.
2500     We only want a reference to the youth and segusage block, which
2501     are heavily shared.
2502
2503     But these blocks need to be Pinned and SegReffed etc so we can
2504     write them at any time.
2505
2506 20th July 2009
2507   The refcount held by the 'leaf' lru is a problem.
2508   While it holds a count we do not unpin an index block, so it cannot
2509   be removed from the list.
2510   Thus we can only remove from the leaf lru on a phase change.....
2511   Or when doing lru based flushing... Maybe we can remove from the
2512   lru while holding the checkpoint lock.
2513   This happens when truncating..
2514
2515   No, that is just too messy as it is too easy to get put back on the list.
2516
2517   Maybe the leaf lru should not imply a reference count ... or maybe
2518   we need to split the refcount:  'inuse' and 'active'.....
2519   How about we test refcnt against list_empty(->lru)...
2520
2521   ....
2522
2523   During truncate, we need each index block to get unpinned so they can
2524   all be cleaned up.
2525   But the InoIdx block is held pinned by by the inode block being dirty.
2526   In this particular case, the InoIdx block is Invalid as the file is empty.
2527   But.... InoIdx should always be valid until after Inode is destroyed??
2528
2529
2530  umount
2531  I need to stop the cleaner and flush everything before trying to
2532  clean up.
2533
2534  This is awkward though.
2535  The 'sync' of umount is done by kill_block_super, but I call
2536  that rather late, after checking that the tree is empty.
2537  There are pinned/dirty bits left after sync that we want to magically
2538   clean.
2539  We have:
2540    - segusage/youth blocks.  Maybe if we don't seg_apply_all...
2541    - orphan block.  Maybe don't mark it dirty when we remove things?
2542    - inode map?? why is that dirty
2543
2544    - root directory is dirty still??  But it has been erased.
2545      InoIdx is valid-but-empty.  Inode Data is dirty
2546         Data block 0 is Dirty at block 0.
2547
2548   ......
2549  Ahh... need to mark page dirty when block is marked dirty !!
2550
2551  The seg usage blocks are now flushed out but not incorporated.
2552  I feel that might be correct - we don't want to care about
2553  incorporation as we will never use it.
2554  For this, segusage and quota are very special cases.
2555
2556  Inode map is no longer dirty, but is pinned
2557  Orphan does have a dirty block still
2558     The orphan table contains the root directory.
2559  root is now clean and gone
2560
2561  Segusage doesn't get incorporated after last checkpoint now
2562  so that is better.
2563  But now we have a circular reference for SegRef.  This should not
2564  be surprising given the circular problems we had setting SegRef.
2565  I guess we just erase the references in the segsum table...
2566
2567 22nd July 2009
2568  Hurray!!! I can unmount without crashing!
2569  Now I need to sort through all the fixes required to achieve that
2570  and make discrete patches, and be sure it is all OK.
2571
2572 DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
2573 DONE - (block.c) Mark page dirty when block becomes dirty
2574 DONE - (checkpoint.c) print orphan_slot with Orphan flag
2575 DONE - Don't incorporate segcount etc after final checkpoint
2576 DONE - Don't apply seg changes after final checkpoint.
2577 DONE - Don't start opportunistic checkpoint after final.
2578 DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
2579 DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
2580 DONE - (cluster) be more flexible about credit usage when flushing InoIdx
2581 DONE - (dir) do add_orphan when we abort as well as on success
2582 DONE - use inode_dec_link_count, not i_nlink--
2583 DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
2584 DONE - change %d/%d to strblk
2585 DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
2586 DONE - (index) refile: when unpinning, remove from lru
2587  - lafs_refile: ->iblock can be non-null for inode 0.
2588 DONE - Make sure I_Deleting gets cleared when deleting finished.
2589 DONE - phase_flip should have something separate to call, not lafs_allocated_block
2590  - inode.c: lafs_dirty_inode: getref_lock used to get dblock
2591 NONO - ?? getref_locked allowed if PagePrivate
2592 DONE - segment: lafs_seg_put_all needed at unmount
2593 DONE - segdelete_all: need to put intable references
2594 DONE - lafs_free_get: put the intable references
2595 DONE - lafs_get_cleanable: put the intable references
2596 DONE - fix sort splitting in add_cleanable
2597 DONE - add lafs_empty_segment_table for unmount
2598 DONE - lafs_release: flush all dirty blocks
2599 DONE - lafs_release: force a final checkpoint
2600 DONE - lafs_release: move kill_block_super before final check
2601 DONE - lafs_put_super: release orphans and segsum files.
2602 DONE - lafs_destroy_inode: putref should be 'iblock'
2603  - lafs_destroy_inode: allow for iblock to be present but no ref held....
2604 DONE - can roll forward call lafs_allocated_block without dirty???
2605
2606 27th July 2009.
2607  - I've re-arranged lafs_release so that the flush is all done in
2608    generic_shutdown_super.  However it calls invalidate_inodes, and that has
2609    problems with pinned inodes.  So we need for fsync_super to checkpoint
2610    out all inodes that we don't hold our own reference to.
2611    If we do hold a reference, then invalidate_inodes will skip them,
2612    and ->put_super can be used to drop the references and perform the final
2613    checkpoint.
2614    fsync_super calls ->sync_fs. after syncing call files.  Maybe I can
2615    do some sort of checkpoint there...
2616    There almost is a checkpoint in there.... But only when called without
2617    'wait'....
2618    I need to understand 's_dirt'.
2619    This is controlled entirely by the filesystem, common code only examines it.
2620    If it is set:
2621           file_fsync (the generic 'fsync' method) will call ->write_super
2622           fsync_super will call write_super
2623           generic_shutdown_super will call write_super
2624           sync_supers will call write_super
2625           sync_filesystems(0) will call ->sync_fs
2626    sync_fs is called:
2627         twice from 'sync', once with '0', once with '1' for 'wait'.
2628              (though in emergency_sync, both are '0').
2629         once from unmount and remount with 'wait' set to '1'.
2630         We don't want two checkpoints for a 'sync', but we want to start
2631         on 'wait=0'.
2632         Maybe if we get called with '0', we set a flag and treat the '1'
2633         differently..  There is no locking to make this really safe, but
2634         it will probably be OK...  I could take a process_id, but then
2635         parallel 'sync's could race.
2636         write_super is called before the syncs.  So it could start the checkpoint,
2637         and sync could wait for it.
2638         write_super is called multiple times at shutdown,  We really need
2639         to utilise sb_dirt to avoid some of these.
2640         We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
2641             - when we pin a dblock or dirty a this-phase iblock.
2642
2643 29jul2009
2644   at unmount, we iput the root inode which de-references the dblock
2645   before clearing ->iblock, which fails an assertion ... why?
2646    Apart from the shinker, ->iblock is only set to NULL in refile
2647    when we find an I_Destroyed inode... I guess the root block isn't
2648    getting Destroyed...
2649  The protocol for freeing iblocks is bad.  Should be:
2650    - it only gets freed by the shrinker
2651    - when inode dies, set ->inode to NULL
2652    - when InoIdx iblock dies, set ->iblock to NULL
2653    ...???
2654 30Jul2009
2655   So, what exactly is the protocol?
2656     - index blocks live either in the parent/sibling tree, or
2657       on the inode's free_index list
2658     - when refcnt is 0, they live on 'freelist.lru'.  When refcount
2659       is elevated they stay on lru until they need to be
2660       added to some other lru (leafs or cluster)
2661     - when shrinker finds block on freelist.lru with non-zero refcnt,
2662       it just removes from lru
2663     - when shrinker finds free block, it removes from free_index and discards
2664       the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
2665         I think not as such would either have children or be on an lru
2666     - When we destroy an inode, all index blocks get disconnected from the
2667       inode and freed.  This must include the ->iblock
2668     - When an index block becomes free due to index tree shrinkage,
2669       we set the ->depth to -1 so that it cannot be found by mistake,
2670       and leave it for shrinker or inode destruction.
2671
2672    Confused about inode<->dblock dependence.
2673    We don't want the inode to refcnt the dblock as that wastes space.
2674    We don't want the dblock to refcnt the inode as that stops it from being freed.
2675    So each must disconnect from other when freed.
2676    What locking?
2677    inode takes private_lock, then checks dblock
2678    dblock cannot take private_lock before checking ->my_inode..
2679    Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
2680      drops ref
2681
2682 1Aug2009.
2683   Tracking down the 'credit' count and making sure it stays correct.
2684   It seems that I have a Dirty InoIdx block which is not pinned.
2685   Due to this it has no refcount and so the data block disappears so
2686   the InoIdx block is not visible in the tree.  This isn't a definite bug
2687   but it means I cannot count credits properly.
2688   And surely Dirty index blocks must always be pinned!!??
2689
2690   When as small file is flushed to the inode we were dirtying the
2691   iblock.  That seems wrong - should dirty the dblock?  Need to
2692   check that is valid
2693
2694   I got a hang in 'rm adir/4'.
2695   rm is in lafs_cluster_update_commit_both
2696        getting a mutex.
2697   cleaner is in lafs_do_checkpoint+0xe4
2698   pdflush is in writepage/lafs_cluster_flush waiting on a lock
2699   so I guess cleaner is holding a mutex and waiting for something
2700    that wont happen?
2701
2702
2703   Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
2704    cleaner is at some point, holding a mutex to stop 'sh'.
2705   0e4 == 228
2706
2707   ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
2708    to be allowed.
2709   So when something locks the checkpoint and needs to flush, we have problems....
2710
2711
2712   I seem to have fixed the above.  Now:
2713     Free space is a real problem.  When I remount after the successful unmount,
2714     we find a usage pattern like:
2715 CLEANABLE: 0/0 y=10 u=34179
2716 CLEANABLE: 0/1 y=0 u=65144
2717 CLEANABLE: 0/2 y=0 u=65535
2718 CLEANABLE: 0/3 y=32773 u=32910
2719 CLEANABLE: 0/4 y=32772 u=149
2720 CLEANABLE: 0/5 y=0 u=0
2721 CLEANABLE: 0/6 y=32770 u=16529
2722 CLEANABLE: 0/7 y=32769 u=35084
2723 CLEANABLE: 0/8 y=32768 u=31877
2724
2725     Which is ridiculous.
2726    Better fix up what I have first...
2727
2728  ...
2729  In rm /mnt/1/nbfile* we hang..
2730    rm is in lafs_phase_Wait from pin_dblock in unlink
2731 wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
2732
2733    cleaner is in lafs_iolock_block from add_block_address in phase_flip
2734 iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
2735
2736  So cleaner is probably deadlocking against itself via iolock_block.
2737   This is taken:
2738     - in lafs_invalidate_page just to wait for any io - it isn't held long
2739     - in lafs_erase_dblock while we erase and 'allocated_block'
2740     - in lafs_get_flushable to protect blocks being checkpointed
2741     - in lafs_writepage to call cluster_allocate (which releases), both for
2742              data block or for inode when data was flushed there.
2743     - lafs_add_block_address to process pending incorporations to make room.
2744          This is what is trapping the cleaner.
2745     - lafs_inode_handle_orphan when truncate finishes to erase_iblock
2746     - lafs_inode_handle_orphan again to incorporate all removal
2747     - and again to erase_iblock
2748     - and for partial truncate to incorporate some removals
2749     - and again....
2750     - lafs_new_inode to keep it from being cleaned while being created
2751     - roll_block to add addresses
2752     - lafs_load_block during IO
2753
2754   So: who holds it?.... let's use the code to find out...
2755   And the answer is : lafs_get_flushable.
2756    So get_flushable iolocks the block then calls phase_flip which tries to
2757    incorporate other-phase children which try to iolock the block.  Deadlock.
2758    Do we need to hold iolock during phase_flip ??.  Not for all of it..
2759
2760 02August2009
2761    FIXME When erasing a block, do I need an uninc credit?  I usually don't
2762     have one and the need certainly isn't as great...
2763
2764   Now... let's try to get free space accounting right.
2765    Observed problems:
2766      - unlink sometimes failed with ENOSPC
2767      - usage scan shows segmetns with enormous usage - 23039!!
2768
2769   no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
2770   no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
2771
2772   no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
2773
2774
2775   after umount/remount df says "4608 7 1544" but cannot
2776    create anything.
2777 df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
2778 ============= Cleanable table (7) =================
2779 pos: dev/seg  usage score
2780   0:   0/0        1 0
2781   1:   0/5        1 64
2782   2:   0/6        6 384
2783   3:   0/7        2 128
2784   4:   0/8        3 192
2785   5:   0/3        1 64
2786   6:   0/2        2 128
2787 ...sorted....
2788   0:   0/0        1 0
2789   1:   0/3        1 64
2790   2:   0/5        1 64
2791   3:   0/2        2 128
2792   4:   0/7        2 128
2793   5:   0/8        3 192
2794   6:   0/6        6 384
2795 --------------- Free table (1) ---------------
2796 12290:   0/4        0 0
2797 --------------- Clean table (0) ---------------
2798 CLEANABLE: 0/0 y=10 u=1
2799 CLEANABLE: 0/1 y=32775 u=3
2800 CLEANABLE: 0/2 y=32774 u=2
2801 CLEANABLE: 0/3 y=32773 u=1
2802 CLEANABLE: 0/4 y=0 u=0
2803 CLEANABLE: 0/5 y=32771 u=1
2804 CLEANABLE: 0/6 y=32770 u=6
2805 CLEANABLE: 0/7 y=32769 u=2
2806 CLEANABLE: 0/8 y=32768 u=3
2807
2808
2809 03Aug2009
2810  Current issues:
2811 FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
2812 Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
2813  3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
2814  4/ invalidate_page find Realloc set, even after iolock ..
2815      This is during umount  in generic_shutdown/lafs_put_super/iput
2816  5/
2817
2818
2819  Thoughts:
2820    If we flag a block for Realloc then Dirty before it is allocated,
2821      then all is fine.
2822    But if we have already allocated to a cleaning cluster... what happens?
2823     We need to treat this like it was dirties after being written, so
2824     it gets written to a regular cluster as well.
2825     As we only have one uninc bit for both Dirty and Realloc, we need
2826     to *not* incorporate the Realloc update if the block is still dirty.
2827    So:
2828         - block gets chosen for cleaning and allocated to a clean-cluster
2829         - block gets marked dirty.  This must not clear Realloc
2830         - cluster is flushed, block is dirty, so don't call lafs_allocated_block
2831         - Return the Realloc credit, but keep dirty and Uninc.
2832      Is there a race if Dirty is set after we enter lafs_allocated_block?
2833       As long as the index block gets marked Dirty, not Realloc we might
2834        be safe... though it gets awkward if the Dirty writeout falls in to
2835        the next phase.  But reserve_block will have provided NCredits for that.
2836      So:
2837         1/ don't clear Realloc when setting Dirty
2838         2/ do clear Realloc if cleaner finds the block is Dirty
2839         3/ avoid calling lafs_allocate_block when cleaning a dirty block.
2840                    This is an optimisation.
2841
2842     Almost...  A B_Realloc block no longer has B_Credit so B_Dirty cannot be
2843        set.
2844
2845
2846   Thoughts3.
2847      When cleaning blocks we hold no reference to the inode and it can disappear.
2848      We don't want to hold the inode active, but need a reference much like
2849       the truncate code has.
2850      I think we need a subordinate refcount for both cleaning and truncate.
2851       These hold inode present but not active.
2852      Maybe every block->inode should be counted like this.
2853      And this might simplify the my_inode->dblock inter-relationship.
2854      For later..
2855        We need to ensure that if a new iget is called on an inode that still
2856        exists, we don't allocate a new one but just reuse the old.
2857        But that won't work as we cannot add an inode back into the hash table.
2858      So I think when cleaning a block we need to ref the inode.
2859       i.e. B_Realloc implies an i_grab
2860
2861 05aug2009
2862  So I have a problem with the cleaner wanting to hold and inode that
2863  the VFS is destroying.
2864  I don't want the cleaner to hold i_count as that delays truncate etc.
2865  So we need a second counter subordinate to i_count.
2866  This is held by the cleaner and by delayed truncate, and by i_count.
2867  Possibly ->my_inode holds this, which means it can be a single bit...
2868
2869  When a lookup wants an inode, we need to load the inode data block and
2870  see if it has my_inode.  If it does, we insert that inode in to the
2871  hash table.  If not we fall back to regular inode creation....
2872
2873  On reflection, that is too complicated and hard and error prone.
2874  When relocating a file we need the data so it had best be in the page
2875  cache so the filesystem really needs to know that the inode is still
2876  active.
2877  So cleaning needs to keep a reference to the inode.
2878  The cost of this is that if an inode is being deleted while it is
2879  being cleaned the truncate cannot happen until the cleaning
2880  completes.  This means that space usage will be wrong.
2881  When nlink becomes zero we can drop the cleaner reference.  When
2882  the inode is dropped/destroyed we can tie the cleaning in with the
2883  delayed truncate so that the final destruction doesn't happen until
2884  the cleaner has let go.
2885
2886  So: how to track that the cleaner has a reference to the inode?
2887  Maybe every B_Realloc block owns a ref on the inode.... but dropping
2888  those references when i_nlink hits zero would be difficult.
2889  They could hold a secondary refcount which, if non-zero, implies a
2890  ref on the inode.
2891
2892  So:
2893   - Set B_Cleaning when we look at a block for cleaning, and clear
2894     it when we find Realloc clear and ....????
2895   - Whenever a block has B_Cleaning set, it holds a counted reference
2896     on LAFSI(b->inode)->cleaner_ref
2897   - When cleaner_ref is non-zero and I_Deleting is not set, we hold
2898     a reference on the inode (i_grab).
2899   - when i_nlink hits zero, set I_Deleting and drop any reference
2900     held by the cleaner.
2901  DONE - cleaner must be careful not to process any block that has been
2902     truncated, or file that is dead.
2903  DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
2904   - What about filesystem inode... how do they fit in??
2905
2906
2907   Question. When are the index blocks for an inode flushed?
2908   We need to have them gone when the inode disappears.
2909   For deleted inodes, this happens in background truncate.
2910   For memory-pressure inodes it will hopefully happen well in advance,
2911   but we need to make sure in destroy_inode that everything is
2912   written. - FIXME
2913
2914
2915   Thinking again about B_Cleaning, any B_Realloc block will hold a
2916   reference through to InoIdx and so dblock will be present and the
2917   inode won't be freed.  So we only need an extra reference during
2918   the first little phase of cleaning when we are collecting blocks.
2919   After that a reference can be useful as it will delay flushing so it
2920   can be more efficient...
2921
2922   Maybe this is all much simpler than I thought.
2923   If we hold a ref on the inode whenever the InoIdx block is Pinned
2924   and i_nlink is non-zero, then we won't be forgotten until all
2925   index blocks are written.  We may still be deleted, but as that
2926   is one-way we can hold on to the inode at little cost.
2927
2928   getting/putting that ref at exactly those times turns out to be
2929   messy.
2930   It might be best to have a flag to say "We hold an extra ref".
2931   Then we occasionally call a function that validates the setting.
2932   It is most important to drop the count at the right time, so
2933   after unlink/rmdir/rename and when B_Pinned is dropped.
2934
2935   B_Pinned is set in:
2936      set_phase which is called from:
2937           lafs_cluster_allocated when moving 'pin' across to data block
2938               so don't need checkpin
2939           lafs_pin_block_ph
2940               only need check_pin if dropping spinlock
2941           pin_all_children
2942               only pins data blocks (Index are already pinned if relevant).
2943           grow_index_tree
2944               where "inoidx block pinning" doesn't change
2945           do_incorporate_leaf
2946               No InoIdx involved
2947           do_incorporate_internal
2948               ditto
2949    So only need check in lafs_pin_block_ph and maybe pin_all_children...
2950
2951 08Aug2009
2952   - credits get out of sync from
2953       lafs_incorporate->refile->space_return from checkpoint.
2954       counter is one more than we can find.
2955       returning space on
2956          i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
2957        Note it in an Index but not InoIdx.  The parent is still in the tree.
2958      This that is FIXED
2959
2960   - and out by 8! at
2961       delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
2962     FIXED that.
2963
2964   - BUG credits<0 in space_return from lafs_incorporate from add_block_address
2965      from phase_flip
2966 Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
2967      from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2968 msg: (1,3,1)(1,1,-1)
2969 Credits = -1, rv=1
2970 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2971
2972     This is a predicted but not handled problem.
2973     The answer is that not all blocks need ICredit/UnincCredit.
2974     The purpose of this credit is to allow for a split in the parent.
2975     pre-existing index blocks can never split the parent themselves
2976     If an index block becomes full, it will split and this might split
2977     the parent.
2978     If an index block has free space, then it will only over flow if it
2979     gets multiple child updates and this will provide multiple credits.
2980     So an index block with space for 3 or more new addresses does not need
2981     and ICredit/UnincCredit.  So when we split we don't need to provide an
2982     uninc credit.
2983     In particular.
2984     When we have a fully InoIdx block and a single new child with 1 UnincCredit,
2985     each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
2986     an ICredit, then create a new intermediate such that
2987         InoIdx is Dirty and has an ICredit
2988         New Index is Dirty with no ICredit - it used the UnincCredit
2989         New child looses its UnincCredit
2990     When another block in the new index arrives, it's unincredit is used to
2991     provide an ICredit
2992
2993     When a leaf block cannot fit a single address it will have ICredit.
2994     The block is split so that each has 3 spaces and so do not need ICredit,
2995     but as soon as ICredit is available, they take it.
2996
2997     Worst case is that every ancestor is full and the leaf is split
2998     We then get two full branches, each block half empty so not needing ICredit.
2999
3000
3001   Then...
3002     free data being used in lafs_refile from cleaner.
3003     b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
3004     Answer: lafs_refile was derefering ->inode when it wasn't safe.
3005      Need to at least have a parent before it is safe.
3006
3007   Hang:
3008      soft lockup cleaner->lafs_iget->ifind_fast ....
3009     Then (may be caused)
3010 Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
3011 .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
3012 Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
3013 ------------[ cut here ]------------
3014 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
3015
3016     It seems the cleaner gets confused and goes spinning.
3017
3018
3019   So: space problems:
3020     After the run, we have -14 used and 2055 available (of 4608), and
3021     cannot create anything.
3022     4 segments ar free, one is cleanable.
3023    free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
3024 or
3025    free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
3026 or
3027    df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
3028    free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
3029    and very little free
3030
3031   ablocks_used is going negative - why?
3032    Probably we erase a dblock without clearing Prealloc.
3033    Then when Prealloc later gets cleared, ablocks_used is
3034    wrongly decremented.... no...
3035
3036
3037 10aug2009  (don't forget above problems)
3038   Another problem.
3039    read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
3040      getiref_lock triggers BUG.
3041    This is presumably because I have just fixed it to get the correct
3042      iblock and not the iblock of the filesystem.
3043
3044   FIXME I hacked around this but I'm not sure the result is right.
3045     The question is about when the InoIdx should be dirty and when
3046     the inode data block should be dirty.
3047    In this particular case we are writing a page of a small file.
3048      cluster_allocate calls flush_data_to_inode which tried to dirty
3049      the inode dblock but finds that iblock is not pinned...
3050      When we dirty a data page we aren't pinning the parent!
3051    That might be OK - we only need to count and reserve the parent.
3052     We don't need to pin it until it becomes dirty.
3053
3054    Still need to resolve when which block gets to be dirty, and also
3055     exactly when an index block needs to be pinned.  And how does that
3056     related to holding a ref on the inode when the inoidx is pinned.
3057     Maybe it should be when the inoidx is referenced.
3058    FIXME
3059
3060 11aug2009
3061    Another problem. unlink->handle_orphans->erase_dblock->allocated_block
3062     and get a zero from lafs_add_block_address but parent is not pinned.
3063   And... One unmount, orphan file still has pinned blocks so the inode
3064     isn't free.
3065   And ... root still old phase after lots of 'rm' then sync.
3066     Inode 244 has pinned inode block held by writepage0 and writepage
3067          this is adir/170
3068
3069 13aug2009
3070   - lots of bugs introduced by change to marking inode blocks dirty:
3071      writepage/cluster_allocate wants to Dirty inode data block with no credits.
3072          because I put credit in iblock!
3073
3074   - ohhh.... The phase contour is broken.  When a block is added to a
3075     cluster for allocation it isn't in the phaseleafs any more, but prevents
3076     it's parent from joining.  So we cannot assume that if dblock is on
3077     list then iblock or a child will be too.
3078     So when we find dblock we do need to remove it.... done that.
3079
3080   - root not changing because Data 1/0 is Pinned and IOPending
3081      and held by writepage!!
3082      Problem is that IOPending blocks aren't put back on lru.
3083      But that should only be blocks on the cluster list.....
3084      But that is where I am putting it.
3085      Maybe I need exclusion between checkpointing and any other
3086        code that writes to checkpoint so checkpoint can wait
3087        for that ... can we use wc->lock??  That doesn't lock
3088        against cleaner, but that isn't a problem...
3089    But now 0/228 is still pinned and in writepage and IOPending
3090     So there is more to it than that.
3091     When checkpoint finds an IOLocked block, it might be about to
3092      join a cluster, in which case we don't really want to wait, or it
3093      might be undergoing incorporation in which case we want to wait.
3094      or it could be being erased, so wait..
3095      Maybe I wait until it appears on some list.... yes.
3096
3097 14aug2009
3098     At unmount Index 8/0 with child and leaf is still pinned
3099   This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3100
3101   and..
3102
3103   A problem is that something goes wrong in the erase process.
3104   We find new children after we erase the inoidx block!
3105
3106   This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
3107
3108   When/how do we erase indexblock and particularly inoidx blocks?
3109   Does and inValid InoIdx simply mean there is no indexing and does not
3110   reflect on the Data block?
3111
3112 .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
3113
3114  Orphan problem:
3115 nextfree = 0
3116 reserved = 0
3117 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3118 This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3119 [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3120   [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
3121   [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
3122   [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3123
3124 nextfree = 1
3125 reserved = 0
3126   0: 1 0 0 304
3127 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3128 This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3129 [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3130   [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
3131 badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
3132
3133
3134 erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
3135 erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
3136 ------------[ cut here ]------------
3137 WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
3138 unlink/orphan/erase_dblock_allocated_block
3139 ---[ end trace 61b8bd59512ea4da ]---
3140 zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
3141    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3142    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3143 ------------[ cut here ]------------
3144 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
3145
3146 BINGO.  When we remove last entry from directory we erase the InoIdx block,
3147  then when we add entries, we hit problems.
3148
3149
3150 nextfree = 3
3151 reserved = 0
3152   0: 1 0 0 306
3153   1: 1 0 0 307
3154   2: 1 0 0 74
3155 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3156
3157 This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3158 [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3159   [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
3160
3161 This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3162 [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3163   [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
3164
3165 This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3166 [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3167   [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
3168
3169 We have stray 'cleaning' references.
3170 It is taken -
3171    on a data block that was in a to-clean segment
3172      at which point we igrab the inode
3173      the block is put on the ->cleaning list.
3174 It is put:
3175    when we get an error finding the block
3176    when we find that it isn't in the segment
3177    when an error occurs loading the block-to-be-relocated
3178    and when we mark that block for cleaning.
3179   i.e. always unless we got EAGAIN or some space error.
3180    If we still hold some blocks, try_clean returns 0.
3181
3182 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3183 This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3184 [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3185   [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
3186   [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
3187   [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3188
3189 NOTE these inode data blocks are not pinned and so did not get written!!
3190
3191 FIXME I should wait for the checkpoint to finish
3192 nextfree = 1
3193 reserved = 0
3194   0: 1 0 0 301
3195 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3196 This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3197 [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
3198   [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
3199
3200 16Aug2009
3201   When I clean and find an inode that is already deleted, I need to be
3202   very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
3203   to be.  lafs_delete_inode gets called a lot, but mostly for dead inodes.
3204
3205   BUGS:
3206 FIXED orphans don't get cleaned up.  It seems a 'create' fails and leaves
3207       and orphan block un-released.
3208    - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
3209    - Not sure that we handle complete truncation, then adding blocks properly.
3210      - what should the state of the InoIdx block be?
3211    - On remount, the filesystem contains rubbish.
3212    - create fails even when there should be free space.
3213    - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
3214    - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
3215           and 74 has similar issue
3216      327 = adir/big1   74=adir
3217
3218
3219 17Aug2009
3220   Segusage blocks aren't always Pinned when we make them dirty.
3221   Yes. That is correct.  They are not forced out by phase change but by
3222   lafs_seg_flush_all at the end of a checkpoint.  So they need to be
3223   preallocated, but not Pinned.
3224   But, once we have finished the last checkpoint we don't want to
3225   dirty Segusage blocks any more.. I wonder if we are.
3226   No, but we were Pinning inodes without PinPending and they
3227   lost the pinning straight away!
3228
3229   OK, other annoyance.
3230    InoIdx block and similar are getting erased at the wrong
3231    time.
3232    We can only safely erase them when they have no children.
3233    I guess what we really want is the incorporation leaves them
3234    existing but empty, and when we go to write them out, if they
3235    are empty we register an address of 0.
3236    When we drop the ->parent pointer of an Index block it
3237    just goes away...
3238    So:
3239     When incorporate or truncate produces and empty index block
3240      it simply clears B_Valid.
3241     When incorporate want to add to an index block, we set B_Valid
3242     When cluster_allocate gets a non-Valid index block it call
3243     block_allocated with phys of 0.
3244
3245     Yes, that seems to work.  Mostly
3246
3247 18Aug2009
3248   On remount, check_credits dies: 16/20-0
3249     In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
3250
3251 19Aug2009
3252   OK, this index block clearing is a mess.  There must be a neat model I can
3253   follow that will make it "just work".
3254   The key seems to be children.  If an index block has children, then it
3255   really must exist.  If it has no children and no content, then it can
3256   be discarded, in which case it needs to be unlinked from its sibling list.
3257   What locking do we use here?  Probably IOLock on the parent index block.
3258   So we need iolock while looking in a parent for children, and we take
3259   IOLock while incorporating or pruning.
3260   Once the empty index block has dropped out it will never be found again.
3261   When we incorporate the zero address, the index block becomes invisible
3262   unless it is shortly after it's predecessor in the sibling list.  But
3263   that is hard to ensure, especially if the first child is the one that
3264   is being erased.  So if an index block is erased, then it must be
3265   discarded quickly and any children need to be relocated...
3266   Or maybe not.... maybe if there are children, we just write and empty block?
3267
3268 22Aug2009
3269   We need better locking of the index information.
3270   It seems best to use IOLock as that is already held during incorporation.
3271   So any code that accesses or updates and index block must hold IOLock.
3272   This might be a bit of a restriction if we try to do a lookup while
3273   writeout is happening.... Maybe we need a separate writeback flag for that.
3274   But I think it is good to use IOLock for now.
3275   Places we need this are:
3276      flush_data_to_inode needs to lock the InoIdx block
3277        - DONE
3278      lafs_leaf_find as it recurses down.  This should return a locked leaf.
3279        - DONE
3280      callers of clear_index
3281          erase_dblock for depth=0??
3282        - DONE
3283      incorporate should lock new blocks for consistency
3284        - DONE
3285
3286    Locking dependency rule is that if we hold a lock, we are allowed to
3287    lock a child index block, but not a parent.  IF we hold a data block,
3288    we are allowed to lock the an index block.
3289
3290
3291   The read/write completion seems all wrong.  It unlocks if the page was locked,
3292    and that isn't really safe, because it might not have been locked for read..
3293    We need to flag block0 to say if lock or writeback need to be cleared.
3294    Given that, I don't need IOPending any more:
3295     Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
3296     Write: We queue all writes, then set 'do_clear_writeback', then check.
3297
3298   Now... can we use a writeback flag to avoid waiting to read while writeout
3299   is happening?  We would need:
3300      set writeback in cluster_allocate
3301      wait_writeback after some lock_block
3302      clear_writeback when writeout finishes.
3303      Extra checks where we already check for IOLock
3304
3305
3306 24aug2009
3307  Lots of progress but....
3308    cluster_flush calls cluster_done calls refile call iput call
3309     drop_inode call write_inode_now calls writepage calls cluster_flush
3310   and we get a locking loop.
3311    I think we need the run that cluster_done from a different thread.
3312
3313
3314  We seem to have a refcnt problem with segsum.
3315
3316 25aug2009
3317  Lots more progress but.....
3318
3319   orphan_release is finding that the orphan block has no credits.
3320   We can allocate credits and simply not do the update if they
3321   are not available:  having an extra entry in the orphan file isn't
3322   a problem.  However we need some mechanism to clean up other than
3323   waiting for a remount..
3324   I think we leave that until we redo orphan handling.
3325
3326  and: adir sometimes loses one block so it and the contents don't get
3327    deleted.
3328
3329  and: it seems we sometimes try to clean the segment being written
3330    to.  We must avoid that.
3331
3332  (long ago I wrote::
3333   FIXME When pin fails, we need to remove PinPending from everything!!!
3334  and never followed up ... I wonder?
3335  )
3336
3337 25Aug2009
3338  Orphan handling.
3339   Every orphan block goes on a per-fs list and gets removed only
3340   if the B_Orphan bit is clear.
3341   There are two times when we want to expedite orphan handling.
3342   1/ on rmdir we need to know if the directory is really empty.
3343      This requires that we expedite the orphan handling of all
3344      blocks.  As soon as we find a non-orphan, we can give up.
3345      Then we need to make sure the index tree has collapsed.  WE
3346      can borrow that code from truncate.
3347
3348   2/ When writing past Trunc_next.  We just pass the block to
3349      special orphan handling.
3350
3351   This requires that orphan handling is re-entrant.
3352   For dir, that is protected by i_mutex, but rmdir needs to come
3353    in under the radar.
3354   For trunc, the iolock on the index blocks should be enough.
3355   I wonder if IOLock can be used on dir as well... allowing
3356   parallel orphan handling in the one dir even!!.
3357
3358   We need to ensure exclusion of orphan handling, including:
3359       - only one orphan handler at a time
3360       - don't run orphan handler while still processing action
3361         that makes it an orphan.
3362   Maybe if we just use IOLock for that?  Does that work?  Maybe
3363   but it gets messy for directories (on first attempt anyway).
3364   For directories we can just use i_mutex.
3365   Maybe i_mutex for files as well?
3366
3367 27Aug2009
3368   Orphan handling is going well... but not perfect.
3369   I'm using IOLock to ensure exclusion for orphan handling.
3370   However:
3371     I'm not really implementing that on directories
3372     Inodes go bad because lafs_erase_dblock needs the lock too.
3373     The call from rmdir will always faile because we hold i_mutex.
3374
3375   Bigger problem.  I'm IOLocking inodes across checkpoints to preserve
3376    Orphan status.  But that might stop the checkpoint proceeding.
3377    .. so use i_mutex, not IOLock - find.
3378
3379   Now... it seems I've confused myself.  Orphans don't get handled
3380   immediately.  In particular, inodes should not be handled until
3381   they final delete_inode.  So setting the B_Orphan flag and putting
3382   on the list are two separate events.  The flag must come first,
3383   but the list may come much later.  So some of that mucking around
3384   with i_mutex is pointless.
3385   So:
3386     make_orphan makes sure it is in orphan file, sets bit, and removes
3387       from list (if present).
3388     add_orphan puts it on the list for handling.
3389
3390     For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
3391         as does any unlink/rmdir/rename that fails.
3392
3393     For directories: put it on list in commit/abort.
3394
3395
3396   And...
3397     I hit the BUG where find_leaf wants and address of 0.
3398       If an index block gets cleaned out it doesn't disappear
3399       immediately.. there is no leaf to find in that direction.
3400       We probably need to avoid non-Valid blocks or something...
3401   And...
3402     Orphans 0/299 to 0/329 and  0/280 are still on the list
3403      but are not orphans.
3404      Maybe I need to catch mutex_unlock to run the orphans??
3405   And...
3406     We underflow a segment through orphans are unmount.
3407       We are cleaning and truncating at the same time.
3408       The same block gets allocated to 0 and to 1225
3409       in quick succession.
3410       Problem is that we apply new address while in writeback
3411       so a new lafs_allocated_block
3412
3413 29Aug2009
3414
3415   Review of inodes in orphan list:
3416     lafs_new_inode makes are orphan for a non-existant inode.
3417     If the inode cannot be created, orphan_release is called.
3418     If it can, a 'struct inode' is filled in with valid type
3419     and nlink==1 (!!) and attached.  The inode will only be
3420     detached when the refcnt hits 0, and the orphan list implies
3421     a refcount, so if we ever find something on the orphan list
3422     with a NULL my_inode, it must be very new and can be ignored.
3423
3424     When we find an inode block with a my_inode there are a few options:
3425       if I_Trunc is set, we must progress truncation providing we can
3426             get the i_mutex
3427       else if I_Deleting we must delete the inode
3428       else if nlink is 0, we remove from the list
3429       else nlink > 0 and we must remove orphan status.
3430     This means that if nlink is elevated, we need to be holding the mutex...
3431     So don't elevate nlink any more...
3432
3433     When nlink becomes non-zero the block need to be put back on the
3434     orphan list (it must already be an orphan).  Also when we set
3435     I_Deleting or I_Trunc it must go on the list.
3436    .. OK, I think I have all of that.
3437
3438
3439 30Aug2009.
3440    I have some wierdness that seems to be caused by the orphan stuff,
3441    probably due to it all being async now.
3442    - A deleted inode clears I_Trunc and then sets it again.  The only
3443      explanation seem to be that delete_inode is being called again,
3444      so I must be igrabing it again, maybe from cleaning.
3445    - bits of directories aren't getting deleted.  Sometimes single
3446      blocks, though the referred files are deleted.  Sometimes
3447      the whole directory... More interestingly, those blocks then
3448      don't get cleaned, so something about them means that they
3449      don't get deleted and don't get cleaned either.
3450
3451    Even weird... I just had a case where file 331 had a different
3452    index block for every 4 data blocks...
3453
3454
3455    FIXME:
3456     - What stops pinned blocks from being flushed by bdflush in middle
3457       of operation and so losing allocation?  Must make sure to set
3458       them dirty very late.
3459     - orphan_release can fail, so much make sure we can always call
3460       it, even if my_inode is NULL.... but how?
3461
3462
3463     - make_orphan could fail due to lack of space, which is not OK.
3464       I made it loop, but I'm not 100% sure that is right... it isn't.
3465       I need to pass down the 'I'm freeing space' flag, and I need to
3466       not require Credit of Dirty is set, etc.
3467
3468
3469     - I seem to have a deadlock and unmount.
3470        umount is waiting for lafs_checkpoint_lock_wait in
3471           lafs_put_super
3472        pdflush is in down_read in sync_supers
3473        lafs_cleaner is iget_locked/ifind_fast/inode_wait
3474                 This is waiting for I_LOCK to be clear.
3475
3476
3477 31Aug2009
3478   - When a file shrinks and becomes level-0, make sure
3479     old addresses get deallocated.  I seem to have
3480     a directory where they didn't.
3481
3482   - Due to the fact that we over-preallocate, we really shouldn't
3483     return ENOSPC until we have flushed dirty data and performed
3484     a checkpoint??
3485
3486
3487   - When I removed the last index from an inode
3488     (Indirect type) it seems that I didn't write
3489     out the corrected block..??
3490
3491 1sep2009
3492  I ran my simple test run repeatedly overnight.
3493  It ran 208 times before I stopped it.
3494  There are 3 possible failure modes:
3495    1/ didn't completed within 500 seconds
3496    2/ triggered a BUG
3497    3/ appeared to complete, the number of blocks
3498       in use was not the correct '7'.
3499
3500  74 (35%) did not fail!
3501  31 () did not complete
3502  40 () triggered a BUG
3503  2 did not complete but did not trigger a bug
3504
3505  94 of those that failed did not have a BUG
3506  92 actually completed.  Of these:
3507       1 final blocks 1
3508       1 final blocks 110
3509       1 final blocks 23
3510       2 final blocks 12
3511       5 final blocks 0
3512       6 final blocks 10
3513      11 final blocks 8
3514      21 final blocks 11
3515      44 final blocks 9
3516
3517  of the BUGs,
3518        1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
3519       1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
3520       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
3521       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
3522       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
3523       2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
3524       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3525       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
3526       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
3527       6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
3528       7 BUG: unable to handle kernel paging request at 6b6b6bfb
3529      11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
3530
3531
3532  super.c:655 is "block is still pinned" at unmount time.
3533   The block was always an InoIdx with a child.
3534   Either inode 0 or 16.
3535   child is held by various things:
3536       [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
3537       [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
3538       [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
3539       [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
3540       [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3541       [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3542       [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3543       [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
3544       [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
3545       [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
3546       [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
3547
3548  The "unable to handle kernel paging request" is always in
3549  umount.
3550      invalidate_inode_buffers(26/46)/lock_acquire
3551
3552
3553  block.c:529
3554     This is iblock valid when erasing a block
3555     The block we are erasing is always 0/327 or 0/328.  It is
3556     an orphan we are handling, iolocked but not always pinned
3557
3558  lafs.h:276
3559     Map an iblock which is not IOLocked
3560        always in lafs_clear_index for the InoIdx block for a directory
3561        which is in Writeback.
3562        Call is in lafs_allocated_block from cluster_flush.
3563
3564  segments.c:351
3565     seg_inc reduces seg usage below 0
3566       - lots of blocks (inode 327) that were cleaned, where then erased twice.
3567       - 2 block (inode 328) were erased twice, both from prune
3568       - ditto
3569
3570  segments.c: 1028
3571      The free list is empty.... odd as only first segment is currently
3572      in use.
3573
3574  soft lockup:
3575      Still orphan: 0/328  Index(1) is in Writeback and Dirty
3576        again inode_handle_orphan2 is in Writeback
3577
3578  inode.c:821
3579      inode_handle_orphan are end, child list is not empty.
3580        The children seem to be in Realloc - cleaner need to let go.
3581
3582  cluster.c:1219
3583      my_inode is null while cluster_flush an inode and want to set
3584         WritePhase.
3585
3586
3587  block.c:485
3588      no ICredit for unincredit in dirty_dblock from dir_delete_commit
3589      from lafs_unlock.
3590
3591
3592  spinlock lockup in subsequent to real bug
3593  ditto for sleeping function.
3594
3595  Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
3596  appear to have other strange values....
3597
3598  A select '9' has two extra block for the directory '74'.
3599  But that directory is long gone.
3600  These dir blocks are currently fully populated with numbers.
3601  This seems to be the pattern with all non-7 blocks.
3602
3603
3604  02Sep2009
3605   Found a problem, possibly related to the dir blocks not being
3606   cleaned up.
3607   When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
3608   so that fact is never copied in to the datablock.
3609   On further exploration, the I_Dirty bit is set but never used, which
3610   isn't good.
3611   So: exactly when do we copy inode into datablock, and what do we do
3612   when dirty_inode is call (if anything).
3613   We could just set I_Dirty when dirty_inode is called, checking that
3614   the block is Pinned which it usually will be.
3615   Then we copy inode to data just before writing data block.
3616   However that defeats transactional properties.  We to copy in the
3617   same transaction, and that means either straight away, or when
3618   the data block's phase changes.
3619   So dirty_inode either copies to the block, or sets I_Dirty.
3620   When lafs_refile unpins an inode data block, it need to check
3621   I_Dirty and possibly re-dirty it.
3622
3623   To redirty it we must steal the NCredits.  Any further dirty attempt
3624   will have to allocate more.
3625   The stealing is done automatically by dirty_dblock, so we just flip
3626   the phase and call dirty_inode ... making sure it doesn't try to
3627   prealloc too hard.
3628
3629   Need to review when inodes get dirtied.
3630     - commit_write only sets I_Dirty !
3631
3632     We call lafs_dirty_inode:
3633       dir_create_commit - a child of inode is PinPending
3634       lafs_create - ditto
3635       lafs_link - before dir_create_commit
3636       lafs_unlink, lafs_rmdir - data block is pinned
3637       lafs_symlink - before create_commit
3638       lafs_mkdir - before create_commit, or block pinned
3639       lafs_mknod - before create_commit
3640       lafs_rename - (moved to) before create_commit/update_commit
3641                      or data block is pinned
3642       lafs_dir_handle_orphan - (assured that) child is pinned.
3643       choose_free_inum - child is pinned
3644       lafs_incorporate - block is pinned
3645
3646     So either the data block is pinned, or the index block is pinned.
3647     In either case it is OK to set something to Dirty.
3648
3649     (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
3650     this is called from:
3651         inode_inc_link_count
3652         inode_dec_link_count
3653         ..various quota ops...
3654         inode_setattr
3655         __set_page_dirty (Which we don't use)
3656         other buffer stuff
3657         other quota stuff we won't use
3658         touch_atime
3659         file_update_time
3660         page_symlink
3661
3662     only the time updates are interesting.  Others we have locking
3663     for.
3664     file_update_time is called from generic_file_aio_write_nlock etc
3665     before ->prepare_write/->commit_write.  So they can pick up the
3666     change.
3667     Similarly before set_page_dirty is called.
3668     touch_atime is called from do_follow_link and readlink and
3669     file_accessed which is called all over the place.
3670
3671     So what to do?
3672     If block is pinned, then dirty it to ensure writeout.
3673     If not, don't.  But copy data in any case.
3674
3675
3676 4sep2009
3677
3678     OK, I've decided that I don't like clearing B_Valid when an index
3679     block contains no indexes.  The final straw was that I seemed
3680     to need to initialise the index block when I didn't hold IOLock.
3681     That was probably fixable, but I'm sure more problems were coming.
3682
3683     So: what to do instead?
3684     One issue that must be resolved is that an index block can still
3685     have valid children even when it become empty.
3686     This can happen if we erase blocks from a file, then add them back
3687     after a checkpoint, and so in the next phase.
3688     The checkpoint writeout could need to show an empty index block,
3689     but the next phase will see real addresses.
3690     We cannot easily avoid this, so we must handle it.
3691     This interact badly with the index lookup algorithm that finds
3692     the best index block currently in the parent, and then scans
3693     the children.  If there is no index block in the parent, we
3694     cannot find any children.
3695     This could be handled by responding to an empty index block by
3696     scanning all children.  But that isn't a full solution as if
3697     just one index block got erased, it's unincorporated siblings
3698     would still be lost.
3699     We could treat empty index blocks like orphans.  i.e. don't
3700     discard them immediately but leave them with possibly real
3701     addresses.  Then when they have no children we allocate the
3702     0.
3703     But we still need to ensure that index blocks off which siblings
3704     have been split but not yet incorporated remain present in the
3705     tree to mark the place for their siblings.
3706     There is another problem.  A horizontal split could leave the
3707     new block with no addresses and everything in the uninc list.
3708     Nothing can be found in there.
3709
3710     So maybe we need to revise the lookup mechanism.
3711     The goal is to find an index block that starts at or before
3712     the target and contains an address at or after the target.
3713     Then out search can stop.
3714     In rare cases.....
3715
3716 7sep2009
3717     I thought about this more over the weekend and think I have an answer.
3718     We need to treat internal and leaf index blocks somewhat differently.
3719
3720     An internal index block must never be empty (while unlocked).
3721     Any child block which has not had it's address incorporated must be
3722     attached (simply in the sibling list) to a block which has been
3723     incorporated.  This will be the block that it was split off.
3724     The uninc block needs to hold a reference so that the primary isn't
3725     released.
3726     When a 'primary' becomes empty it cannot be discarded, so the
3727     addresses in the first dependent index block must be copied
3728     across.  This is awkward for indirect blocks so they might be
3729     allowed to be empty (they aren't internal so don't violate the
3730     above).
3731     When a horizontal split break a sequence of dependent blocks
3732     between two parents, the second parent must be incorporated
3733     immediately so that the first block in the second half of the
3734     sequence is incorporated.
3735     If an internal index block does become empty and it has no
3736     dependent blocks to fill from, it must be invalidated immediately.
3737     It cannot have any children - even in next phase - as at least one
3738     would have to be incorporated and so the block would not be empty.
3739     Invaliding involves allocating to address 0.
3740     If index lookup finds a block with PhysValid address of 0, it
3741     must look to the previous index block.  If there was none .... it
3742     gets a bit complex.
3743
3744     Leaf index blocks can become empty, but we try to avoid it.
3745     If a leaf has blocks which have been created in the next phase,
3746     and others which have been deleted in this phase, it can be empty
3747     but still have children.  In this case we just treat it as a real
3748     index block that doesn't actually have any addresses.  We still
3749     write it out even though that is a waste of space.
3750
3751     We have been working on the assumption that every address always
3752     has a corresponding leaf index block.  It is the leaf with the
3753     highest index at or below the target address.
3754     However this requires the every internal index block has a child
3755     with the same address as the parent.
3756     Preserving this requirement when the first child of an internal
3757     become empty requires either:
3758        - loading the 'next' child and reassigning this to the start
3759        - changing the address of the parent to match the first child.
3760     The former requires possibly reading a block from storage.
3761     The latter only involves modifying blocks that are due to be
3762     written out anyway, but makes block look up slightly interesting.
3763     When lookup finds an invalid block that is 'first', it needs to
3764     start again from the top.
3765     When incorporation creates an invalid block that is first, it
3766     needs to walk down from the top and any index block at the same
3767     address needs to be relocated/rehashed.  If the block is
3768     incorporated, the incorporated address needs to be updated.
3769     So:
3770      - flag for unincorporated index blocks which implies a reference
3771        on primary
3772      - after split, immediately incorporate second block
3773      - change lookup to retry when finding invalid block
3774      - When internal block becomes empty, either merge with
3775        first dependent or invalidate.  If first in parent,
3776        update address and parent and recurse.
3777        Need some 'clever' locking here.
3778        Before unlocking the invalidated block, we take i_alloc_sem,
3779        then walk up the ->parent tree locking blocks as
3780        required.
3781        The index lookup, when it finds an invalid block will take
3782        i_alloc_sem, then drop it, then start again.
3783        Or maybe some other lock than i_alloc_sem...
3784      - When leaf becomes empty, invalidate only if it has no children.
3785        When internal leaf becomes unpinned, check if empty.
3786
3787 21sep2009
3788    That locking doesn't look like it will work, and we can never 'merge
3789    with first dependant' as it is not valid to have a index block
3790    where the first child is at a different address.
3791    And we cannot always change the parent address, particularly if it
3792    is zero - increasing it then cannot work.
3793    And there is no need to load a block if we are just going to change
3794    its start address (not internal index blocks anyway).
3795    Let's drop the idea of relocating the parent.
3796    If an internal index block becomes empty:
3797      If it is last in parent, no loss, just discard
3798        If parent would be empty, need to recurse up.
3799      If it is not last relocate the next sibling to this location,
3800       rehashing it and updating the parent.
3801    If a leaf index block becomes empty we cannot just delegate to
3802       next as it might be indirect... not a problem if address is
3803       stored.  But that requires a format change... now might be a
3804       good time!
3805
3806
3807    So:
3808      If we hold an index block locked and it becomes empty and we choose
3809      to invalidate it, we need to ensure that doing so does not
3810      break any indexing paths.
3811      So we take a separate lock (i_alloc_sem??) and flag the block as invalid
3812      by setting physaddr to 0 while PhysValid is set, and unlock the block.
3813      Any lookup that finds such a block must take and release i_alloc_sem,
3814      and then restart from the top.
3815      - If the block was not incorporated, we just remove from sibling list
3816           and all is done - the space in implicitly included in
3817           previous block.
3818      - If the block has a different fileaddr than the parent then update
3819           the parent directly, either removing the entry, or changing it to
3820           point to the first unincorporated sibling (if there is one).
3821           This requires taking the lock on the parent of course.  That is
3822           why we dropped the lock on the child.
3823           Then all done.
3824      - If the block has the same address as the parent we need to find
3825           a 'next block' to relocate to the start of the parent.
3826           It is either the first unincorporated sibling, or the next
3827           block in the index block, or nothing, meaning the parent is
3828           about to become empty.
3829         We lock the parent (still holding i_alloc_sem), and rehash the
3830           chosen child.  If it doesn't exist, or is not dirty, we need
3831           to update the phys address directly in the
3832           accordingly, erasing or replacing the first address.
3833           Then we need to rehash the index block, but we need to lock
3834           the parent for that.
3835           So set a 'busy' flag on the block, unlock it, lock parent,
3836           rehash, clear busy flag, and repeat.
3837       - We can never relocate a block with fileaddr of zero, as the
3838           InoIdx block cannot be relocated.  So leaf index block 0
3839           must never be erased unless the file is empty.  So
3840
3841 28sep2009
3842   New idea.
3843   We store the start address of an indirect block in the block.
3844   These means that the meaning of any index block is completely
3845   independent of the location of the block, so we can change the location
3846   easily and without touching the block.
3847   So if a block becomes empty, we simply move the next block back to
3848   fill the gap.
3849   i.e. when an index block becomes truely empty (i.e. no children)
3850    - if it wasn't incorporated, simply remove it
3851    - if it was,
3852        - if there is a dependent block, rehash it to take my address
3853        - if there is a next block that is dirty, rehash it
3854        - if there is a next block that is not dirty,
3855           update parent to merge my entry with next, and rehash next
3856           if it exists
3857        - if there is no next block but we are not first, just update
3858           parent
3859        - if no next block and we are first, parent becomes empty,
3860           recurse upwards.
3861
3862 12Oct2009
3863  - too long, I've forgotten what I was up to..
3864    + I've changed the format of indirect blocks to store an address.
3865    + I've handled incorporation of an empty block
3866    So now internal index blocks can never be empty - they get immediately
3867    unlinked if they are.
3868    Leaf index blocks can be empty while they have children.  We don't
3869    flag them as empty, but rather wait until another child gets incorporated.
3870    But I don't think I really like that.  It is an external ugliness based
3871    entirely on internal implementation details.  Empty index blocks should
3872    not get written out.  We need some way to reliably find an empty index
3873    block.  The address won't appear in the parent so a lookup will find the
3874    previous block which we cannot link to now as it may not exist yet.
3875    Worse - if first index block goes empty, we can only unlink it by moving
3876    the parent to start at the next block.  That would make this index block
3877    totally unfindable.
3878    So I think we have to stick with writing out empty index blocks very
3879    rarely.  So we need to be sure they disappear properly.
3880    The difficult case is if an index block becomes empty while it has some
3881    children which don't end up getting dirtied. e.g. an update aborts.
3882    We need to leave the block with enough credits to be written out.
3883    I guess the Ncredit should be enough...
3884    Maybe worry about that later.
3885
3886  - what about InoIdx blocks when they become empty?  It would be helpful
3887    to flag them so that inode deletion can check....
3888    Maybe just set depth to 0..
3889
3890  ARRGGG... I've completely lost it.  In need another ITO week.
3891   I just got a bug in summary.c:71!!
3892
3893 7 Jun 2010
3894  - summary.c:71.
3895    ablocks_used has hit zero too soon.
3896    This should be the count of blocks for which space has been allocated
3897    (B_Prealloc is set) but have not been given a phys address yet - at which
3898    point the usage count is moved to cblocks_used or pblocks_used.
3899    The last block (which may not be the cause of the problem) does not have
3900    B_Prealloc set, yet physaddr == 0.
3901    The block is 0/1, so the inode for the inode usage map.  This should have
3902    physaddr 8 !!
3903    We did find 8, then change to 73, but then changed to 0!
3904   Ahhh... recent fix exposed a subtle bug ... fixed.
3905
3906  Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3907      cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3908      cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3909      cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3910      cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3911    We are allocating an InoIdx block, but data block is not valid??
3912
3913  That isn't very reproducible so I'll have to leave it for now...
3914     erasedblock had been called on the data block .. inode 17??
3915
3916   Problem is that I keep changing the rules.
3917    I don't erase the InoIdx block any more.
3918    I used to, then change it to iolock_block/cluster_allocate->0
3919
3920  Problem: When all files are removed, usage is still quite high, two
3921    segments have over 400 blocks (out of 512).  Cleaning keeps running and
3922    not making much progress.
3923   segment 6 has usage of 484.
3924   'cluster 3072' shows: cluster 3072, 3085, 3086 3092
3925     Inode 0:  blocks 267 272 276
3926     Inode 277: blocks 0/4 6/2
3927     Inode 0: blocks 0/2 8 16
3928     Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
3929     Inode 16: 1/1
3930     Inode 17: 0/28
3931     Inode 283: 12/18
3932           etc.
3933
3934   All 'old', so must be the product of cleaning, as you would expect.
3935   All (most) of this has been deleted though, but count didn't drop.
3936    'Count' add to 508, plus the 4 cluster heads makes 512 - good.
3937   lafs_seg_move definitely isn't being called on these blocks.
3938   it is only called from lafs_summary_update
3939   cblocks_used "exactly" matches the number of un-removed blocks.
3940
3941
3942   Another problem
3943 bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3944 /home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3945 bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3946 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3947 bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3948 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3949
3950  and
3951 free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
3952 Want dump of usage
3953
3954 ------------[ cut here ]------------
3955 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3956  free list is empty - that should not be.
3957
3958 and another...
3959 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3960 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3961  [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
3962  [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
3963  [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
3964  [<c02256bf>] ? autoremove_wake_function+0x0/0x33
3965  [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
3966
3967
3968 08Jun2010
3969  Weirdness with truncating.
3970  The cleaner relocates a file resulting in the InoIdx block being
3971  Maybe-dirty and phys_addr == 0.
3972  Then truncate doesn't prune but just incorporates, finding
3973   something weird there..
3974   file 278, blocks around 4100
3975   seem to find 1949 instead??
3976
3977  Note: When a non-InoIdx block is erased we set PhysValid
3978   and physaddr == 0 to record the fact because it will not be stored...
3979
3980 modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3981 Async ??
3982 modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3983 Still Async ... wonder what it means.
3984
3985 - directory block got corrupted.  Maybe conversion to indexed??
3986
3987
3988 Getting bug in remove_from_index because the addr isn't
3989 there, possibly block is empty.  But incorporation is
3990 ??? instant?  No it isn't.
3991 If an index block hasn't be incorporated it has B_PrimaryRef
3992 set as it hold a ref to something earlier index.
3993 But what if nothing is incorporated?
3994
3995
3996 Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
3997 looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
3998
3999 Then spin in a soft-lockup in lafs_inode_handle_orphan
4000
4001
4002 -----------
4003  - grow_index_tree needs to do initial incorporation so things can be found.
4004     just like end of do_incorporate_internal.
4005    NO - cannot incorp yet as do not have phys addr.  Don't need to as
4006    lafs_leaf_find explicitly handles this.
4007    For truncate case we don't use the stored address, but ensure all
4008    leaf indexes must be dirty (or gone) so whole tree must be
4009    accessible for walking around.
4010  - do_incorporate_internal needs to set B_PrimaryRef and take the ref
4011  - when we remove a B_PrimaryRef without incorporating it, we need to
4012    drop a ref if the *next* in the list is B_PrimaryRef
4013  - need to use a constant to identify 'async' calls etc.
4014  - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
4015    it isn't found as async....
4016
4017 09Jun2010
4018  STILL struggling with incorporation.
4019  We have a premise that any file address is coverred by precisely
4020  one leaf index block.  Every leaf index has an implicit address
4021  and it covers all addresses from there to the next leaf.  The last
4022  leaf covers to EOF.
4023  So there must always be a leaf at address 0.
4024  This applies within the tree from an internal index block too.
4025  Beneath an internal index block there must be a leaf covering every
4026  address up to the next internal index block.  So there must be
4027  a first.  So storing the first address is pointless.  And harmful.
4028  When an index block becomes empty and disappears its coverage is
4029  included in the previous block unless there is none, in which case
4030  the next index block must be re-addressed.  If there is no 'next',
4031  this index block must be empty and so must disappear.
4032
4033  BUT if we re-address an index block, we implicitly re-address the
4034  first child - recursively - so we need to move/rehash them all
4035  or lose them... or record where they are.  Or do lookup not by
4036  addr....
4037  I think just rehashing them all - with an iolock - is simple
4038  and safe.  So just do that.
4039
4040
4041  So:  I cleaned up index handling a truncation somewhat.
4042   Now running looptest to see what patterns emerge:
4043
4044   block.c:197 (*9+1) During umount, the Root datablock is
4045         Dirty+Realloc
4046         Maybe just need for cleaner to become inactive
4047         during umount - hope that doesn't deadlock
4048         didn't event work...
4049   block.c:529 (*4+1)  erase dblock while iblock depth > 0
4050         When pruning InoIdx we want to set depth to 0.
4051         FIXME is this really want I want, or is depth=0
4052         only for data-inode ... FIXME
4053   cluster.c:533 (*2) cluster_allocate on invalid block
4054           Block is 8/0 in writepage from sync_inodes
4055           This is the orphan file.
4056                    blocks aren't dirty
4057           I guess the file gets truncated while we wait for it.
4058           Just need to re-test.
4059   index.c:1936 (*2).   An index block is Root - FIXED??
4060   modify.c:1056 - secondary bug, ignore for now.
4061   modify.c:1650 update_index fails to find target.
4062               second call, phys==0
4063               Code was bad ... may not be the cause though.
4064   modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
4065                    from orphan handler.
4066                 Maybe just change the do/while back to 'do'.
4067   modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
4068                Index(0)/InoIdx
4069                in do_checkpoint
4070                uninc list gets set in lafs_add_block_address (parent of iblk),
4071                 do_incorporate_internal,
4072                Maybe the InoIdx still had children.
4073   segments.c:1028.  (*4) The free list becomes empty.
4074   super.c:655 (*3)   Busy inodes after umount, and root InoIdx block
4075          is still pinned as inode 16 data block was still dirty.
4076          segusage slow.  Maybe same as block.c:197 ??
4077   invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
4078           finds invalid lock.
4079           presumably the inodes was freed before invalidated.
4080   spin on writeback during truncate (r3a) 8 times. now 10
4081         Probably because writeback cannot proceed while
4082         orphan processing keeps looping.
4083   kmalloc-1024 problems - (*2)
4084           A block - should be start of page - isn't not what it appears...
4085
4086  Others complete with 'cb' ranging from 202 to 715
4087
4088
4089 10 June 2010
4090
4091  Looking at segment.c:1028
4092   We run a seg_scan every checkpoint, so that should keep free segments
4093   in the list.....
4094   Ahh.. do_checkpoint is looping because root isn't changing phase.
4095
4096   Lowest block pinned to old phase is
4097   [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
4098   which is not on leaf list because it has IOLock
4099   With more debugging:
4100   [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
4101   or better (that was in lafs_iolock_written)
4102   [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
4103   FIXED - I didn't unlock if it wasn't dirty any more.
4104   Well almost - it occurs much less now.
4105   Out of 48 runs:
4106       8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
4107       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4108       2 BUG: unable to handle kernel paging request at 6b6b6bfbt
4109       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4110       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
4111       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
4112       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
4113       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4114       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
4115
4116   So we now have 1/12 rather than 2/3.
4117   a/ pinned by IOLock from file.c:220 - FIXED
4118   b/ as above
4119   c/  Root is pinned by 4 children
4120       328/0  with 196 of data blocks in writeback/realloc, in a cluster
4121       0/1, 74/0, 0/8   all in a cluster waiting writeout.
4122      Don't understand this.
4123   d/ as a,b
4124
4125   Of the 48, 11 ran to completion leaving blocks from 286 to 899
4126
4127
4128   Looking at the loss of blocks when truncating.
4129    tracing show small number of files with remaining blocks at delete.
4130      sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
4131    next attempt: 14+24+26*11 =324 cf cb=1124
4132    next attempt 26+6+15+68+29 == 144 cf cb=383
4133    26+18+14+19+284 = 361 cf 379
4134     files are (in order)
4135    49    bfile       - 30K
4136    325   nbfile-49   - 30K
4137    320   nbfile-44   - 30K
4138    296   nbfile-20   - 30K
4139       ??331??
4140
4141 11 June 2010
4142
4143  Thinking about truncate and index blocks becoming empty while
4144  they still have children.
4145  For leaf indexes, we need to leave the block in place in case
4146  the children get written.  We need to find a time to ultimately
4147  delete it...
4148  For internal indexes,.... uhm, it just works, OK??
4149
4150  When I drop an uninc block, I need to remove it from the
4151   uninc list, and from phase_leafs
4152   clearing dirty and refiling should remove from leafs.
4153
4154  When we recurse to a parent, we need to remove
4155  *this* block from the uninc list for said parent.
4156  It should be the only thing in the list.
4157  But even when we don't recurse, the fact that we have
4158  incorporated means that we should tidy up the ->uninc
4159  list.
4160
4161
4162
4163 12 June 2010
4164   unmount hung after lafs_run_orphans from lafs_put_super
4165   There are two orphans in Writeback which cannot progress
4166   until the current cluster is written...
4167   But they keep getting re-written!
4168   Other time, one orphan, index block is Dirty on a leaf ???
4169
4170 orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
4171 [cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1)
4172 LAFS_cluster_flush 1
4173
4174
4175 orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
4176 [cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0)
4177
4178  OK, problem is that when we truncate and remove an index block, the
4179  next index block expands backwards to fill the space.
4180  Then we apply prune_some, but don't check if anything was done.
4181  We always mark it dirty, so it has to be written and then
4182  we loop through again...
4183  So need to check if prune_some did anything.
4184
4185 TODO:
4186  - prune_some need to get more done at a time
4187  - let cleaner finish up before umount
4188  - use early segments first ??
4189  - look at write-clusters and check OK
4190  - check that df:cb= drops properly.
4191
4192 Bugs:
4193       1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170  - SECONDARY BUG
4194       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4195       3 BUG: unable to handle kernel paging request at 00100104
4196       5 BUG: unable to handle kernel paging request at 6b6b6bfb
4197       1 BUG: unable to handle kernel paging request at 7fffffff
4198       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4199       9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
4200       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
4201       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
4202       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828!
4203       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
4204       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708!
4205       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4206       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
4207      30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4208
4209 Quite a haul there!
4210
4211 super.c:655
4212     Pinned block in lafs_release:
4213          0/2 is Dirty with plenty of credits, so it is a child
4214          0/16 is Dirty/Realloc, or once Async
4215      Dirty, but not on a leaf list, not pinned
4216
4217 segments.c:332
4218     seg_deref with refcnt , 2 in lafs_seg_put_all
4219
4220 segments.c:1028
4221      No free segments - no real pattern.
4222
4223 modify.c:1708
4224      lafs_incorporate on non-dirty/realloc block
4225        328/0 Index(1).  1 in uninc_table - probably during truncate.
4226      Either we add uninc while not dirty
4227      Or we clear Dirty while uninc present
4228      or there is a race between the two.
4229
4230      Don't know:  add a bugon
4231      Bugon in get_flushable didn't fire.
4232
4233 inode.c:843
4234      children present in truncate after final incorp...
4235        328/0.  64 children, no uninc list.  Maybe we ran the orphans too early??
4236       or invalidate_page isn't removing the children.
4237       Might want print_tree here?- added that.
4238      Answer: all the children are in Realloc on Clean_leafs
4239        Maybe erase_page needs to disconnect from cleaner too??
4240
4241 inode.c:828
4242      Orphan handling - uninc but not dirty: is Realloc (sometimes)
4243      Maybe like  mod:1708
4244
4245 block.c:67 *
4246       delref 'primary' from modify.c:2063 in the q2 branch.
4247       nxt has PrimaryRef... Maybe  move earlier, but that shouldn't make a diff.
4248       ditto at modify.c:2035  nxt is primary as was I, so drop mine.
4249       Don't know - looks like sibling list got broken.
4250       Tidied up a bit and added a print-tree.
4251       v.interesting result.  Lots of consecutive index blocks all holding primary-ref
4252             on single primary - which is wrong.
4253       1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
4254             on self, as are being inserted into chain
4255       2/ When splitting, new block must be addressed as first block which cannot
4256            fix, not first block which doesn't fit.  Else incorping in reverse order
4257            can make lots of tiny index blocks.
4258
4259 block.c:529 *
4260         erase with index depth > 1.
4261         0/328 in orphan handling.  Still have 8 or 15 blocks registered!
4262        Maybe caused by index block errors.  Added some printks.
4263
4264 block.c:479 *
4265         not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
4266         74/xxxx in unlink
4267         16/1 in seg_inc/seg_move...allocated_block/cluster_flush
4268
4269         - writepage wrote the page??
4270         - checkpoint wrote it and didn't replenish the credits?
4271
4272 block.c:197 XX
4273         invalidated pages finds dirty block after EOF, after iolock_written
4274          0/0 Dirty/Realloc in unmount - all Realloc!
4275        Need to wait for cleaner etc to finish at unmount time.
4276
4277 NULL deref in 1b4  YY
4278     cleaner->cluster_flush->count_credits->lock??
4279     Trying to get a lock on an inode that has since been free??
4280         spin_lock(&dblk(b)->my_inode->i_data.private_lock);
4281
4282
4283 001001 YY
4284      generic_drop_inode -- extra iput??  in lafs_inode_checkpin from refile
4285 6b6b6b YY
4286       invalidate_inode_buffers!! in kill.  use-after-free
4287
4288 7fffff
4289     seginsert from scan_seg
4290      MAX/number-elements confusion.  Worked around for now.
4291
4292
4293 18  June 2010
4294 After a couple of fixes:
4295       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4296       1 BUG: unable to handle kernel paging request at 00100104
4297       5 BUG: unable to handle kernel paging request at 6b6b6bfb
4298       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4299       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
4300       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
4301       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
4302      16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4303       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4304                 Realloc blocks confusing truncate
4305       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
4306       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
4307       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4308      19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4309
4310
4311 TODO:
4312  - truncate gets confused by blocks being cleaned.
4313    Need to flush cleaner, or just removed the blocks.
4314  - when add PrimaryRef in middle of list, take the right ref.
4315  - fix up wait-for-cleaner at unmount time.
4316
4317 19 Jun 2010
4318
4319       3 BUG: unable to handle kernel paging request at 6b6b6bfb.
4320       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4321       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
4322      22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4323       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
4324       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4325       9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4326      17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4327     251 SysRq : Resetting
4328       3 SysRq : Show State
4329
4330  - We can erase a dblock while it is in the uninc_pending or
4331    uninc_next - need to be careful
4332  - At umount, 0/2 is Dirty but not Pinned, so not written out
4333    ditto from 0/16
4334    16/0 sometimes is Async
4335       16/0 Async might be from the segment scan - so wait for that.
4336    Dirty but not pinned can happen when InoIdx is pinned.
4337
4338  - I think the uninc_next list (At least) should be sorted before
4339     being allocated.
4340
4341  - root block dirty/realloc/leaf in final iput
4342    Could be it was changed during last checkpoint so
4343    pushed in to next phase?  But why Realloc?
4344    Maybe still issue with losing inode data block.
4345
4346 20 June 2010 Happy Birtyhday Dad!!
4347
4348 420 runs.
4349       4 BUG: unable to handle kernel paging request at 6b6b6bfb.
4350      26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4351      87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4352       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
4353       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
4354       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
4355      12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4356       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4357
4358  Problems:
4359   - inode in i_sb_list has been freed.
4360   - block 0/0 is dirty/realloc/leaf after final iput
4361   - not all blocks freed by truncate
4362   - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
4363   - still children when truncate should have finished.
4364     all are Realloc
4365         Maybe inode has become unhashed and we re-load it??
4366         it is invalid after all!!
4367   - Index block not dirty when incorp - has uninc. ??
4368   - didn't wait for free segments
4369   - Data 16/0 is dirty but not pinned after final checkpoint - FIXED
4370
4371
4372 watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4373 watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4374
4375
4376  Unclear on dirtying index blocks.
4377    We normally mark it dirty first, then add the address to the uninc list.
4378    Note that this is the reverse of data blocks which are changed first, then
4379    dirtied.  So maybe we should mark dirty afterwards.  We then need to
4380    avoid incorporation while we are adding addresses else we might find it
4381    has addresses but is not dirty.  Only try if dirty?
4382    Maybe we should iolock the parent.  We need to do that anyway to flush
4383    incorporations when the table is full.   Yes, that fits the VM model
4384    better.  Always lock while updating and preparing to write.  Set
4385    writeback once write has started, then unlock.  Cool.
4386    Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..
4387
4388 21June2010
4389   Apart from tracking down the remaining bugs, I need to:
4390   1/ Decide on locking for incorporation and attaching new address to a block
4391     and implement it.
4392     In particular we need to not lose the Dirty flag before the update is done.
4393   2/ Resolve handling of pinned inode data/index blocks
4394   3/ Correct handling of empty index blocks, particularly when parent is in
4395     different phase.  Make lookup be more careful?
4396   4/ Wait for there to be enough free segments before allowing allocation.
4397
4398   2:  Problem is that we cannot handle a pinned inode-data block while the
4399      InoIdx block is pinned in the same phase.
4400      We currently unpin it so it drops off the leaf list.  But then we
4401      need to re-pin it when the InoIdx is unpinned or phasefliped, and that
4402      gets ugly.  Possible though.
4403      An alternate is to treat it like a parent and keep it off the list
4404      while the InoIdx is pinned/same-phase.  So we would need to
4405      re-assess it after unpinning or flipping the InoIdx.  That is probably
4406      a lot easier than re-pinning it.
4407
4408   1: We would normally set 'dirty' after changing the block.  But we need
4409      to differentiate Dirty from Realloc, so we set before adding addresses.
4410      This requires that are careful not to write an index block while there
4411      are pending changes.  The fact that pinned children stop any writing,
4412      as do pending addresses in a list should ensure this.
4413
4414   3: When an index block becomes empty we need to make sure that
4415      future lookup doesn't get confused by it.  Specifically future
4416      index lookup must avoid the block so nothing new gets added.
4417      Possibly a previous block will split again, but this block must remain
4418      unused.
4419      However we cannot update the parent block immedatiately as it might
4420      be in a different phase.
4421      So we must record both "don't touch this" and "where to look instead"
4422      elsewhere - in children.
4423      If the block being deleted is *not* the first child in the parent,
4424      then we direct index lookup to the earlier block.
4425      If the block being deleted *is* the first child in the parent,
4426      then redirect to the second child if there is one and we weren't just there.
4427      If there is no other block we flag the parent as empty and retry
4428      from the top.
4429      We flag a parent as empty with B_EmptyIndex.
4430
4431      What locks do we need to walk around the sibling list?
4432      the inode private_lock is minimal, but we cannot hold that to take a
4433      iolock - just to get a reference.
4434      I guess we
4435         - iolock the parent
4436         - try to find a good block using private_lock
4437         - get a ref and wait for it.
4438         - check if it is still a good block.  If not, start again
4439
4440      If we find an EmptyIndex block, it must be directly addressed by parent.
4441      It will never be followed by a PrimaryRef block because if there were
4442      such a block, we would have readdressed it back and hidden the EmptyIndex.
4443      So we need to look around for an address in the parent that leads to
4444      a non-EmptyIndex block.
4445
4446      If all children are empty, we need to make the parent empty.  But
4447      what if it is InoIdx?
4448      Maybe I am making this too hard.  I could just use i_alloc_sem to
4449      block lookups while truncate is happening.  That doesn't address
4450      single block removal e.g. from directories.
4451      So I need to be able to wait for incorporation to happen on an
4452      empty index block.  We hold iolock on the parent.  If there blocks
4453      on ->uninc, we just process them immediately.  If there are blocks on
4454      ->uninc_next, we wait for the checkpoint to complete
4455
4456      What does lafs_incorporate actually do with EmptyIndex blocks?
4457      Providing that match currently incorp addresses, they just cause
4458      those addresses to disappear.
4459
4460      If a block is in the uninc list for its parent, then is phase_flipped
4461      and changed and written out it could get a new physaddr before
4462      it is incorporated.
4463      I guess we never allocate a B_Uninc block which is in a different phase
4464      to the parent.  Currently we wouldn't do that anyway except in truncate
4465      though memory pressure on index blocks might one day??
4466      Truncate?  We cannot allocate directly in lafs_incorporate.
4467      We should get lafs_cluster_allocate to notice and DTRT.
4468
4469      Only hash index blocks when they are incorporated.  Not needed before then.
4470      When processing an uninc list, if an address appears twice, prefer the one
4471      that isn't EmptyIndex...
4472
4473 22June2010
4474     I need a clear picture of the "Steady state" for an internal index block
4475     with it's children.
4476     The internal index block contains 1 or more addresses.  For each address there
4477     maybe a child index block.  If there is it maybe the head of a list of
4478     blocks with B_PrimaryRef set thus holding the whole list in place until
4479     incorporation happens.
4480     Each of these children can be on either ->uninc_list or ->uninc_next,
4481     or possibly neither if they haven't been queued for writing yet.  Any
4482     PrimaryRef block will be Pinned.
4483
4484     When a child is incorporated and found to be Empty it is flagged as such
4485     and then must never be returned by index lookup.  Index lookup will either
4486     add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
4487     block and so have to start again from the top.
4488
4489     When a PrimaryRef block becomes empty it is simply removed from the
4490     PrimaryRef chain so it cannot be found.  The space now belongs to the
4491     previous block.
4492     When a non-PrimaryRef block which isn't the first becomes empty it is
4493     flagged and left in place so that following blocks can be found.  The
4494     address space now belongs to the previous block.
4495     When the first child (fileaddr matches parent) becomes empty - what?
4496       We could re-address first child but that forces early address change -
4497           old might not be incorp yet
4498       We could re-address the parent, but that doesn't work for InoIdx
4499       We could leave it there with physaddr == 0
4500
4501     Last sounds promising.  So we never re-address an index block.
4502
4503    So: From the top.
4504
4505     Index blocks, Indirect blocks, extent blocks each have an address
4506     that never changes.
4507     When a block becomes over-full it splits - a new block appears with
4508     a new address thus implicitly limiting the address space covered
4509     by the original.
4510
4511     When an index block becomes empty and has no pinned children it is
4512     marked as EmptyIndex (under IOLock).
4513     When an EmptyIndex is allocated it goes to phys==0
4514     An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
4515     is never used again.  Its address space is ceded to the previous
4516     index block - which could split several times...
4517     An EmptyIndex which is first can be re-used.  Once it gets pinned
4518     children the EmptyIndex is cleared.
4519
4520     An Index block always has an entry for the first address.  It might
4521     be implicit to phys==0.  Loading such a block creates an empty
4522     block.
4523
4524     InoIdx doesn't get EmptyIndex, rather it gets ->depth=1
4525
4526     Indirect *doesn't* store the first address any more.
4527
4528     Changes:
4529 DONE     - remove forcestart from layoutinfo
4530 DONE     - remove start-address from Indirect blocks
4531 DONE     - only hash index blocks when they are known to be incorporated.
4532 DONE     - when incorporating an uninc list, ignore phys==0 if also a block with
4533        same fileaddr and phys!=0.  so sort phys==0 first
4534 DONE     - Create EmptyIndex flag
4535 DONE     - Clear the flag when adding child pin to index block
4536 DONE     - avoid EmptyIndex non-start blocks during index lookup
4537 DONE     - allow index blocks to be loaded with ->phys==0
4538 DONE     - allow EmptyIndex index block to be "written" to phys 0
4539 DONE     - ensure index lookup finds implicit start address, possibly 0
4540
4541 So now after 36 runs
4542       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
4543       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
4544      10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
4545      14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4546       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
4547       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4548       3 SysRq : Resetting
4549
4550
4551 index.c:1939
4552    block 0/2 is Realloc and being allocated from cluster_flush while
4553    parent is not Realloc or dirty
4554    That is bad as Realloc gets set in lafs_allocated_block ... except
4555     that the code was bad.  FIXED.
4556
4557 index.c:403
4558   cleaner is pinning a block (299/25) which is not Realloc,
4559     and phase isn't locked.  We are only meant to pin data blocks
4560     for updates while holding a phase lock.
4561     Ahhh - bad code again. FIXED
4562
4563 inode.c:605
4564    Truncate doesn't clean up properly.
4565     327 has 60+1
4566     331 has 108+1
4567     327 has 34+1
4568     327 has 60+1
4569    No sign of any children.
4570
4571    Very weird.  Signed in incorporation going wrong.
4572      Added more debugging.
4573
4574 Found 4084 4 12 at 890
4575 Added 4084 4 12
4576 Found 4089 4 16 at 878
4577 Added 4089 4 16
4578 Found 4094 2 20 at 866
4579 Added 4094 2 20
4580 Found 2561 2 22 at 854
4581 Added 514 2 22
4582 Found 2564 4 24 at 842
4583 Found 2569 2 28 at 830
4584 Found 0 0 0 at 818
4585
4586 Why are 2564 etc lost?  No sign of alloc-to-0
4587
4588 segments.c:1034
4589    no free segments - need to wait somewhere.
4590
4591 segments.c:624
4592    allocated_blocks has gone over free_blocks!
4593    in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
4594    Wanted CleanSpace to reserve the youthblk
4595    Maybe related to not waiting - ignore for now.
4596
4597 super.c:657
4598   block 0/2 was dirty but not pinned.  Should not happen to inodes.
4599   block 0/0 was Pinned because it had a child - as above.
4600
4601   Maybe we don't carry the pin across when we collapse dir
4602   into inode??... looks quite likely
4603
4604
4605 23 June 2010
4606
4607 116 runs.
4608       1 BUG: unable to handle kernel paging request at 6b6b6bfb
4609       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
4610       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
4611       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
4612      61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4613       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4614      42 SysRq : Resetting
4615
4616
4617 6b6b6bfb:
4618   invalidate_inode_buffers called on at shutdown.
4619   Still wierd
4620
4621 block.c:497  FIXED??
4622   block 16/1 is not dirty with no credits.
4623   Maybe writepage got to it?
4624
4625 dir.c:710
4626   ouch! dir lookup failed in unlink.
4627    No real hints.  Must be hash based - some off-by-one probably.
4628    Need to stare at the code.
4629
4630 inode.c:606  FIXED
4631   Blocks still present after truncate.
4632   typically about 60, but in 1 case '4'.  No index blocks.
4633   So probably content of second index block.
4634   Yes, lafs_leaf_next was doing the wrong thing for addresses
4635    before start of block.
4636
4637 segments.c:1034
4638   same old
4639
4640 super.c:657  FIXED
4641   dir inode 0/2 is still Dirty but not pinned.
4642   Maybe lafs_dirty_inode should be pinning the block
4643
4644   But now this triggers for 16/X still dirty.
4645
4646
4647 How and when to write blocks in a SegmentMap file?
4648  - We don't want normal write-back to write them unless they have
4649    no references
4650  - We need to write them in tail of checkpoint, and index info must
4651    follow in the next checkpoint.
4652
4653 lafs_space_alloc is called from
4654   - mark_cleaning:  always CleanSpace, failure is OK
4655   - lafs_cluster_update_pin: ReleaseSpace.  -EAGAIN is OK (CHECK THIS) but failure
4656               is not - or shouldn't be.
4657   - lafs_allocated_block: CleanSpace, checking if parent of Realloc block
4658         can be saved separately from any Dirty version.  Failure OK, blocking not.
4659   - lafs_prealloc - general space allocation.
4660   -
4661 lafs_cluster_update_pin is call from:
4662   - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
4663     lafs_mknod, lafs_rename,
4664   - lafs_write_inode
4665      So best to return -EAGAIN, and it should be handled adequately.
4666
4667 lafs_prealloc is called from:
4668   - lafs_reserve_block, after modifying the alloc_type extensively.
4669   - lafs_phase_flip to re-fill the 'next' credits.  If they aren't available
4670       we simply pin all children so they aren't needed.
4671       So failure is OK
4672   - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
4673        If this fails .. what?? lafs_reserve_block fails. so...
4674
4675 lafs_reserve_block is called from
4676   - mark_cleaning - CleanSpace
4677   - lafs_pin_dblock - type is passed int...
4678   - lafs_prepare_write - on failure write will fail or retry after checkpoint
4679   - lafs_inode_handle_orphan - to help with delete. On failure we allow
4680          cleaning to happen
4681   - lafs_seg_move - should be elsewhere.  Failure BAD !
4682   - lafs_free_get - as above, failure BAD
4683   - clean_free - update youth for new clean blocks - Failure BAD
4684
4685 lafs_pin_dblock is called from
4686   - dir_create_pin - fail or again handled
4687   - dir_delete_pin
4688   - dir_update_pin
4689   - lafs_create etc
4690   - lafs_dir_handle_orphan
4691   - choose_free_inum
4692   - inode_map_new_pin
4693   - lafs_new_inode
4694     ...
4695   - lafs_orphan_release !! cannot handle failure
4696   - roll_block should use AccountSpace
4697
4698 So:  It seems we need a new allocation class that will never fail.
4699   Maybe it is allowed to BUG though?
4700    AccountSpace - i.e. space need to account for the use of space.
4701      Must never ever fail.
4702
4703 Then we must ask where blocking should happen on -EAGAIN.
4704   dir.c does "lafs_checkpoint_unlock_wait", then tries again.
4705   prepare_write does too.
4706
4707 For that to work we must start a checkpoint on returned EAGAIN.... Don't
4708 we want to wait for some cleaning to happen first though?  Maybe an extra
4709 flag, and a count of the number of empty (but not clean) blocks.
4710
4711 - Should I skip orphan handling when tight on space?  Probably not.  It will
4712   just keep failing while we keep cleaning...
4713 - roll_block should use account_space .. or not
4714
4715 - lafs_space_alloc simply allocates space, or fails.  'why' is used to
4716    guide watermark choice.
4717 - lafs_prealloc allocates space to a block and all its parents base on
4718   'why' for watermarks.  It either succeeds or failed.
4719
4720 - lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
4721   to failure as -ENOSPC or -EAGAIN based on 'why'.
4722
4723 - lafs_pin_dblock simply passes on the failure, which must be handled.
4724
4725 So: What to do when we return -EAGAIN?
4726  We need to wait until there are *enough* clean segments, then cause a checkpoint
4727  so they become free.
4728  So a flag that says 'waiting for free space' and a count of segments
4729  required.
4730
4731  But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
4732  Maybe we don't ??  Or do it later.
4733
4734 Still to do:
4735 - Audit all AccountSpace and justify them
4736  + lafs_seg_move is probably wrong.  Should have allocated when the
4737    free segment was allocated
4738 - lafs_orphan_release called lafs_pin_dblock but cannot handle failure
4739 - Need to wait not just for "enough space" but for "enough clean segments".
4740
4741 - how is 'free_blocks' set - what does this tell us??
4742
4743    free_blocks is the sum of known-clean segments.
4744    We probably want:
4745          clean segments
4746          remainder for each active segment
4747    then reserve some segments for cleaning.
4748    And separate 'allocated_block' for each ?
4749
4750 Notes:
4751  segments.c:647 fired: AccountSpace had no space available.
4752    Reserving space to write the segusage of youth block for a newly
4753    allocated segment.
4754  super.c:657 STILL
4755     0/2 is Dirty but not Pinned  Maybe we need PinPending
4756  soft lockup
4757     in the cleaner!
4758     Maybe I need cond_resched??
4759
4760 Maybe I want two separate 'free_blocks' counters.
4761  One that includes all free blocks for use in 'df' etc.
4762  One that only includes completely free segments for use in allocation...
4763
4764
4765 24 June 2010
4766
4767  Something is wrong with cleaning and segment tracking
4768  We have 5 free segments and we get them all without writing
4769  anything!  We consumer them all with cluster_flush!
4770  It seems that the root inode is not changing phase!
4771  Nothing is on the phase leafs.
4772  Most children are in Writeback on cluster. and are Realloc
4773  Others have pinned children.
4774  They are all in 'cluster', but 'flush' doesn't flush them,
4775  so they must be in a different clister???  Is the cleaner still
4776  cleaning?  Yes, they are on the cleaner 'wc' list so they are
4777  queued but not flush for the cleaner.
4778
4779 25 June 2010
4780  At last it looks like I nearly have a working FS. Out of 361 test
4781  runs, 9 triggered BUGS and one hung at umount.
4782
4783  I need a new TODO list, starting with 6 jul 2007(!) and adding any
4784  FIXMEs etc.
4785
4786 DONE 0/ start TODO list
4787 DONE 1/ document new bugs
4788 DONE 2/ Tidy up all recent changes as individual commits.
4789 DONE 3/ clean up the various 'scratch' patches discarding any tracing that
4790     I don't think I need, and making the rest 'dprintk' etc.
4791 DONE 4/ check in this README file
4792 DONE 5/ Write rest of the TODO list
4793
4794 DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
4795     It is Dirty but only has *N credits.
4796     16/1 ...
4797
4798 DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
4799    I guess we should getref/putref.
4800
4801 DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
4802     and doesn't cope well.
4803
4804 DONE 5d/ At unmount, 16/1 is still pinned.
4805
4806  6/ soft lockup in unlink call.
4807     EIP is at lafs_hash_name+0xa5/0x10f [lafs]
4808  [<d0a56283>] hash_piece+0x18/0x65 [lafs]
4809  [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
4810  [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
4811  [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
4812  [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
4813  [<c02602c1>] vfs_unlink+0x4e/0x92
4814
4815   Don't know. Looks like cleanup up a chain in dir_delete_commit.
4816   Added a BUG_ON.
4817
4818   Would we be spinning on -EAGAIN ?? 4 empty segment are present.
4819
4820  6a/ index.c:1947 - lafs_add_block_address of index block where parent
4821           has depth on 1.
4822 looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
4823 /home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
4824
4825  6b/  check_seg_cnt sees to be spinning on the 3rd section
4826     the clean list has no end!
4827     we were in seg scan
4828 CLEANABLE: 0/0 y=0 u=0 cpy=32773
4829 CLEANABLE: 0/1 y=0 u=0 cpy=32773
4830 CLEANABLE: 0/2 y=0 u=0 cpy=32773
4831 CLEANABLE: 0/3 y=32773 u=6 cpy=32773
4832 CLEANABLE: 0/4 y=32772 u=124 cpy=32773
4833 CLEANABLE: 0/5 y=32771 u=273 cpy=32773
4834 CLEANABLE: 0/6 y=32770 u=0 cpy=32773
4835
4836 of
4837 0 0
4838 1
4839 2
4840 3 6
4841 4 124
4842 5 273
4843 6 0
4844 7 496
4845 8 0
4846
4847
4848  6c/ at shut down, some simple orphans remain
4849     missing wakeup ???
4850
4851 DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
4852    truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
4853 Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
4854 SEGMOVE 1499 0
4855 Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
4856 .......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
4857 Why have I no credits?
4858 /home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
4859
4860    Cleaning is racing with truncate, and that cannot happen!!
4861    Actually it could - if i_size changed at the wrong time.
4862
4863 DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
4864    block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
4865    in touch_atime.  I think I know this one.
4866
4867  7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
4868                i.e. 1510, 1945-2038, 2448 of 5378
4869     Appear to be looping in first loop of try_clean, maybe
4870      group_size_words == 0 ??
4871     Add BUGON and wait.
4872
4873 DONE 7c/ NULL pointer deref - 000001b4
4874      Could be cluster_flush finds inode dblock without inode.
4875      Have a BUG_ON of this now.
4876
4877 DONE 7d/ paging request at 6b6b6bfb.
4878     invalidate_inode_buffers called, so inode_has_buffers,
4879     so private_list is not empty.  So presumably use-after-free.
4880     But is on s_inodes list.
4881      Probably cleaner is still active (if this is first call to
4882      invalidate_inodes in generic_shutdown_super) so list gets broken.
4883      We need locking or earlier flush.
4884
4885 DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
4886      Check for Realloc too.
4887
4888 PRESUME-FIXED 7f/ index.c:2024 no uninc credit
4889         [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
4890       found during checkpoint.  Maybe inode credit problem.
4891
4892 PRESUME-FIXED 7g/  inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
4893       ->uninc blocks.  This is during truncate.  Need some
4894       interlock with cleaner maybe?
4895       Probably the same race between cleaner and truncate.
4896
4897 DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
4898
4899 NOLONGERRELEVENT 7j/ resolve space allocation issues.
4900     Understand why CleanSpace can be tried and failed 1000
4901     times before there is any change.
4902
4903 DONE  7k/ use B_Async for all async waits, don't depend on B_Orphan to do
4904      a wakeup.
4905      write lafs_iolock_written_async.
4906
4907 DONE 7l/ make sure i_blocks is correct.
4908           set on 'import_inode'
4909           decreased when lafs_summary_update assigned block to '0'
4910           changed when lafs_summary_allocate changes e.g. quota.
4911
4912       lafs_summary_update is called when a block is assigned to a location,
4913         or to zero.  It is real usage.
4914       lafs_summary_allocate is called when we set Prealloc on phys==0 or
4915          clear Prealloc on phys==0
4916       So allocate must be followed exactly.
4917        update is already counted for setting !=0, so only dec on ==0.
4918       So all is good.
4919      What about quota? - hidden in quota_allocate / qcommit
4920
4921 7m/ delete inode could not progress through inode_map_free, so
4922    ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
4923    was permanently an orphan.
4924
4925 DONE 8/ looping in do_checkpoint
4926    root is still i Phase1 because 0/2 is in Phase 1
4927   [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
4928    Seems to be waiting for writeback, but writeback is clear.
4929      Need to call lafs_io_wake in lafs_iocheck_writeback for when
4930      it is called by lafs_writepage
4931
4932 DONE 9/ cluster.c:478
4933     flush_data_To_inode finds Realloc (not dirty) block
4934     and InoIdx block is not Valid.
4935   [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
4936   I wonder if it was PinPending, or where it was IOLocked (or if).
4937
4938    I guess we truncated, then added data, then tried to clean.
4939    Probably just a bad 'bug' given recent changes.
4940    No, I think it is the race between truncate and clean which is now fixed.
4941
4942 SEEMS TO BE GONE 10/ inode.c:606
4943     Deleting inode 328: 2+0+0 1+0
4944
4945     2 level index.
4946     first index at level 1 was full and prune properly.
4947     Nothing else found empty.
4948     Somehow the second index block and contents were lost.
4949
4950 ASSUME_DONE 11/ super.c:657
4951     Root still pinned at unmount.
4952      0/2 is Dirty:  [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4953                     [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
4954                     [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4955                     [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4956                     [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
4957     maybe dir-orphan handling stuffed up
4958     Or maybe it is the I_Dirty issue.  Assume fixed.
4959
4960
4961 ASSUME_DONE 12/ timeout/showstate in unmount
4962     umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
4963     That looks similar to 8
4964
4965 DONE 13/ delete_inode should wait for pending truncate to complete.
4966     Document I_Trunc somewhere - including that i_mutex is needed to set it.
4967     Verify that assertion.
4968     Actually it requires i_alloc_sem, or the inode to be deleted.
4969
4970
4971 DONE 14/ Review writepage and flush and make sure we flush often enough but
4972     not too often.
4973     Probably just remove the cluster_flush from write-page as lafs_flush
4974     will do that.
4975     But leave for now as it encourages heavy indexing.
4976
4977 DONE 14a/ use bio_add_page to write clusters.
4978
4979 DONE 14b/ Figure out what backing_dev to present for the filesystem.
4980
4981 DONE 15/ The inode map file lost some credits.  I think it losts a PinPending because
4982     it isn't locked properly.  Don't clear PinPending if someone else might
4983     have set it.
4984
4985 DONE15a/ Find all FIXMEs and add them here.
4986
4987
4988 DONE 15b/ Report directory size less confusingly
4989
4990 DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)
4991
4992 DONE 15d/ What does I_Dirty mean - and implement it.
4993
4994 FIXED 15e/ setattr should queue an update for the inode metadata.
4995      and clean up lafs_write_inode at the same time (it shouldn't do an update).
4996      and confirm when s_dirt should be set.  It causes fsync to run a
4997      checkpoint.
4998
4999 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
5000 ## Items from 6 jul 2007.
5001
5002 15g/ test directories with non-random sequential hash.
5003
5004 DONE 15h/ orphan deadlock
5005     lafs_run_orphans- lafs_orphan_release can block waiting for written
5006      in erase_dblock, but that won't complete until cleaner gets to run,
5007      but this is the cleaner blocked on orphans.
5008
5009
5010 DONE 15i/ separate thread management from 'cleaner' name.
5011
5012 15j/ review rules in getref_locked - and document them
5013
5014  - fix accesses to iblock
5015
5016 DONE 15k/ newblocks should probably be a count of segments.  Review that.
5017
5018 DONE 15l/ make sure checkpoint_youth is decayed properly.  Review youth decay.
5019
5020 DONE 15m/ consider combining .orphans and .cleaning lists.  If something is an
5021     orphan, we probably don't want to clean it just now(?).
5022
5023 DONE 15n/ consider if lafs_pin_dblock should check for iolock.  Maybe
5024      iolock or PinPending (which must be set under iolock).
5025      Just require PinPending and always get iolock_written for that
5026      except in special cases.
5027
5028 DONE 15o/ Can there be async blocks when checkpoint starts?  Could they
5029      pin blocks in old phase?  Do I need to check for them?
5030
5031 DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
5032      yet' thing - or somehow avoid the yuckiness.
5033
5034 DONE 15q/ check checksums when reading cluster_header for cleaner
5035        This is already done!
5036
5037 15r/ consider further optimisation in cleaner to avoid lookups.
5038
5039 15s/ memory barrier for i_size check in cleaner???
5040
5041 15t/ review usable-space calculations in clean.
5042
5043 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
5044
5045 15v/ tidy up all code that fiddles bits and credits - maybe make some
5046      common helpers.
5047
5048 15w/ review cluster updates and make sure space used is accounted properly.
5049
5050 15x/ Consider caching result of a failed dir lookup in case we immediately
5051      try to create it.  Would this actually save anything significant?
5052
5053 15y/ Don't make dir blocks into orphans if it cannot be needed?
5054
5055 15z/ make sure symlink creation is safe - do I need to log the body??
5056
5057 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
5058
5059 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
5060      after lock is taken on block?
5061
5062 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
5063       writeout.
5064
5065 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
5066         if more is needed.
5067
5068 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
5069     before removing from lru.
5070
5071 15af/ Review all locking in lafs_refile
5072
5073 15ag/ Don't allocate data part of InoIdx block.
5074
5075 15ah/ Is there a problem with lafs_allocated_block putting an
5076     about-to-be-truncated block on an uninc list?
5077
5078 15ai/ When allocating a new segment during checkpoint, delay the
5079     youth-block update until after the checkpoint
5080
5081 15aj/ When roll-forward finds a new segment, make sure youth number is
5082     updated.
5083
5084 15ak/ Load orphan file during roll-forward and make every block an
5085     orphan.
5086
5087 15al/ set filesystem update_time somewhere.
5088
5089 15am/ filesystem 'name' needs to be handle uniformly.
5090
5091 15an/ can we be sure 'b' will be non-null in delete_inode?
5092
5093 15ao/ determine what locking is needed to walk the children list
5094     in lafs_inode_handle_orphans.  Probably the address_space private lock.
5095
5096 15ap/ Make sure write_inode has been cleaned up.  See if this apply to
5097     rollforward of a symlink (see FIXME)
5098
5099 15aq/ change inode map to be little-endian, not host-endian
5100
5101 15ar/ understand what to do about errors in lafs_truncate
5102
5103 15as/ handle errors from lafs_write_super ???
5104
5105 15at/ More wait_queues to wait for different blocks.
5106    An array which we hash in to ??
5107
5108 15au/ How should iocheck_block set the page error?
5109        and block_loaded
5110
5111 15av/ ditto for write errors?
5112
5113 15aw/ when lafs_incorporate makes a new block where the
5114       old is Realloc, the new should be Realloc too.
5115
5116 15ax/ Think about what happens when we relocate a block
5117     in the orphan list (lafs_orphan_release), particularly
5118     if the block isn't actually loaded.
5119
5120 15ay/ Wonder if there is any way for orphan_run to get a wakeup
5121     when an inode or dir mutex is released.
5122
5123 15az/ Sanity check all values in cluster head during roll-forward
5124       i.e. in roll_valid.  If the head isn't complete, we can still
5125       use this to commit some previous checkpoints.
5126
5127 15ba/ roll forward should not BUG on bad data like inodefile in
5128     non-primary filesystem.
5129
5130 15bb/ Do I need to sync something before copying an update over part
5131     of an inode, then reloading the inode.
5132
5133 15bc/ Handle DescHole in roll forward.
5134
5135 15bd/ Call lafs_add_block_address from writeback rather than iolock
5136     in roll forward, just for consistency.
5137
5138 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
5139     are actually the correct type.
5140
5141 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
5142    a lookup - or at least we can test for that.
5143    lafs_seg_apply_all has similar problems and needs a good solution.
5144
5145 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
5146     if parent splits.  See what to do about that.
5147
5148 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
5149   or handle if it has.
5150
5151 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
5152   to twostage.
5153
5154 15bj/ Make sure .last link in segtracker is kepts uptodate, particularly in
5155    segdelete.
5156
5157 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
5158
5159 15bl/ better checks for 'valid state block address' in valid_devblock
5160     include that segment_count is credible
5161     also in valid_stateblock
5162
5163 15bm/ make sure everything gets free properly on error during mount / lafs_load
5164
5165 15bn/ How does refcountsing of 'struct fs' work with multiple filesets?
5166
5167 15bo/ use put_super to drop last refer to superblocks
5168
5169 15bp/ review all superblock - maybe use more anon??
5170
5171 15bq/ check readonly status in lafs_get_sb
5172
5173 15br/ sync_fs should probably wait for something if 'wait'.
5174
5175 15bs/ set f_fsid properly in lafs_statfs
5176
5177  - use new write_begin / write_end
5178     - review how we ensure that credit remain with block.
5179
5180 15ca/ When pin inode data block, pin it as well as index block I think
5181     It is still kept of the leaf list until the index block is done with
5182     I think.
5183
5184 15cb/ Layout issues:
5185      - subset filesys still needs a parent pointer
5186      - cluster head needs mtime/ctime to log these.
5187      - need better tracking of which devices are in this array??
5188             Need to be able to have read-only devices that are shared
5189             amove arrays.
5190      - need multiple parallel write-clusters to allow parallel writes.
5191      - record tuning in state block:
5192            - max_segs
5193      - use crc or something, not toy checksum (e.g. cluster - state already has)
5194      - flags for inconsistencies found, at layout/fileset/file levels(?)
5195      - policies of whether old or new data is allowed on each device
5196      - policies of how much duplication of metadata is required
5197
5198 15cc/ free any stray B_ASync block found in destroy_inode
5199
5200 15cd/ Some code assume a cluster header does not exceed 1 page.
5201      Is this safe?  Is in true? Is it enforced?
5202
5203 16/ Update locking.doc
5204
5205 17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
5206     calls  lafs_iolock_written.  How do we know that won't block on cluster_flush?
5207
5208 18/ See if per-fs shrinker is available yet and consider it for index blocks.
5209
5210 19/ Review WritePhase and make sure it is used properly.
5211
5212 20/ Review places where we update blocks and be sure they are not in writeout
5213     or in a different phase.
5214
5215 21/ Review and document all lru uses (locking.doc) and make sure they are
5216     all locked properly.
5217
5218 22/ Check possible failures:
5219     - thread allocation
5220     - memory allocation
5221     - reading critical metadata
5222     ...
5223
5224 23/ Rebase on 2.6.latest
5225
5226 24/ load/dirty block0 before dirtying any other block in depth=0 file,
5227     else we might lose block0
5228
5229 25/ use kmem_cache for
5230         datablock
5231         indexblock - probably a mempool because we cannot allow failure when
5232                      splitting an index block.
5233         skippoint (mempool?)
5234         segsum - mempool??
5235         others?
5236
5237 26/ Review seg addressing code for 2-D geometries.
5238
5239 27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.
5240
5241 28/ Make sure youth blocks are always referenced properly.
5242
5243 29/ Make sure new segments are referenced properly.  I think there might be
5244     some double referencing.
5245
5246 30/ Decide when to use VerifyNULL or VerifyNext2
5247
5248 31/ Implement non-logged files
5249
5250 32/ Store access time in non-logged file
5251
5252 33/ Support quota : group / user / tree
5253
5254 34/ handle subordinate filesystems:
5255      ss[]->rootdir needs to be array or list
5256      lafs_iget_fs needs to understand this
5257
5258 35/ review snapshots:
5259       - peer lists and cleaning
5260       - how to create
5261       - failure modes
5262       - how to destroy
5263
5264 36/ review roll-forward
5265       make sure files with nlink == 0 are handled well
5266       sanity check before trusting clusters
5267
5268 37/ Configure index block hash_table at run time base on mem size??
5269
5270 38/ striped layout
5271         review everything needed for safe RAID5
5272
5273 39/ How to handle all different IO errors
5274
5275 40/ Guard against data corruption at every level.
5276
5277 41/ Add checksums on index blocks and dir blocks and Inodes and ???
5278
5279 42/ Store duplicates of some blocks.  At least index and inode.
5280
5281 43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
5282     Make sure ->page_mkwrite sets up credits properly
5283
5284 44/ Examine created filesystem and make sure everything looks good.
5285
5286 45/ mkfs.lafs
5287
5288 46/ fsck.lafs
5289
5290 47/ Write good documentation
5291
5292 48/ Review all code, improve all comments, remove all bugs.
5293
5294 49/ measure performance
5295
5296 50/ Support O_DIRECT
5297
5298 51/ Check support for multiple devices
5299     - add a device to an live array
5300     - remove a device from a live array
5301
5302 DONE 52/ NFS export
5303
5304 53/ 'overlay' support
5305         So I mount one device read-only an another device
5306         writable which gets all the updates.  metadata on first
5307         device not updated.
5308
5309 54/ cluster support - is this possible?
5310
5311 55/ is any useful variant of reflink  possible?
5312
5313 56/ Review roll-forward completely.
5314
5315 57/ learn about FS_HAS_SUBTYPE and document it.
5316
5317 58/ Consider embedding symlinks and device files in directory.
5318     Need owner/group/perm for device file, but not for symlink.
5319     Can we create unique inode numbers?
5320     hard links for dev-files would be problematic.
5321
5322 59/ Fix NeedFlush handling so we don't drop-then-retake
5323     a mutex and that isn't sensible.
5324
5325 60/ Introduce some fs state recording that fsck is needed and possibly
5326     identifying what sort of fsck.
5327
5328 61/ Try to make the inode struct smaller - maybe move some of the
5329     fs metadata into a separately-allocated struct.
5330
5331 26June2010
5332  Investigating 5a
5333
5334    Normal sequence is to surrender UnincCredit, then to clear Dirty,
5335     then to write.  If anyone re-dirties after Dirty is clear, they
5336     will naturally have to add an UnincCredit having reserved space first.
5337    However it seems that the Cleaner gets in the way as the block in question
5338    has just previously been cleaned, which consumed the UnincCredit
5339    Do we need ReallocUnincCredit?? I hope not.
5340    We generally need a way to say "I might want to write to this" so cleaner
5341    doesn't write it early.
5342    For index blocks that is pincnt.  For data it is 'PinPending'.
5343    This keeps index blocks off clean_leafs until they are ready, but
5344    not data blocks.
5345    And in any case, TypeSegmentMap blocks don't get PinPending as they
5346    get written *after* the checkpoint.  That is a rather ugly exception.
5347    Maybe we make their different handling more explicit.  We put them on
5348    a separate list unpinned so the rest of the checkpoint can complete.
5349    Then we flush that list?
5350    Then PinPending keeps them off the clean_leafs list.
5351
5352    So to clarify the plan:  If a block is already Pinned to this phase,
5353    we can "clean" it by marking it Dirty rather than Realloc.  This is
5354    appropriate for blocks that are likely to change soon (as blocks written
5355    to the cleaner segment are not likely to change soon).
5356    For data blocks we take "PinPending" to say "might change soon".  For
5357    index blocks ... we don't know if it is pinned by Realloc or Dirty or
5358    PinPending children.  So we set Realloc and wait for any children to
5359    be unpinned for whatever reason.  If it is only pinned by Realloc blocks,
5360    it will end up on clean_leafs and be processed to the cleaner segment.
5361    If it is pinned by anything else it will be found by the checkpoint and
5362    processed to the new-data segment.
5363
5364    So Index blocks always get Realloc, PinPending blocks get Dirty,
5365    Other data blocks get Realloc.  Good.
5366
5367    Must review PinPending usage... always set, then maybe-dirty inside
5368    checkpoint lock.  In cases of unlocked usage (inode map) we don't clear
5369    PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
5370    It is likely to be changing though, so not a big cost.  Even good.
5371
5372    Could make the distinction later.  PinPending blocks don't go on
5373    clean_leafs.  So if they are still realloc at the checkpoint, we Realloc
5374    to the new-data segment.  This has the same net effect but is arguably
5375    cleaner.  It means that if a realloc block gets pinpending set, it
5376    immediately stops being a clean leaf and so is safe.
5377    So: just keep PinPending blocks off clean_leafs.  Keep them on phase_leafs.
5378    However there is no mechanism for moving things from phase_leafs to clean_leafs.
5379    So maybe they stay on clean_leafs, but when the cleaner gets to them, it
5380    dirties them and drops them.... that would work.
5381
5382    So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
5383    Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
5384    to pick up.
5385
5386    BUT:  Does this work for TypeSegmentMap blocks?  They aren't PinPending.
5387
5388    We could treat them specially in the cleaner.  Or we could set PinPending
5389    and pin them to the phase, but treat them differently in checkpoint.
5390    If we gathered them onto a separate list, then flush the list after
5391    the phase had changed, it might be quite neat.  No more getting writepages
5392    to do our work for us.
5393    They would need to be re-pinned to the next phase, then written out.
5394    Or just unpinned, and let seg_inc re-pin as appropriate... except that
5395    seg_inc is too later to pin.  It dirties.  We need to pin when we get
5396    SegRef.  We currently reserve but we don't pin.
5397    We really do need to phase_flip these segmentmap blocks.  But that requires
5398    getting extra credits, and Pinning everything if new credits are not available.
5399    And we don't really have a good list of 'everything' that depends on a segment.
5400    But seeing the space_alloc never fails for these...
5401    So Pin them, and flip them with AccountSpace
5402
5403    So:
5404     - split out common 'flip' code
5405     - add 'flip' for data blocks
5406     - create list of accounting blocks and flip accounting file blocks onto
5407       that list during checkpoint
5408       Flush should write that list,  not the files.
5409     - Get cleaner to ignore pinpending blocks, marking them dirty.
5410     - pin segusage blocks while ref on them is held.
5411     - writepage no longer needs special case for TypeSegmentMap, just PinPending
5412     - lafs_prealloc just tests PinPending
5413
5414
5415    [[aside: quota files seem to be handled like segmentmap files.  Is that
5416      right??
5417      We only track usage of data blocks based on various 'owners' of the file.
5418      We need to know if a block was written in one phase or the next, and
5419      only count blocks written/allocated in the one.
5420      Data blocks can slip into 'this' phase quite late - any time before the
5421      parent is finally incorporated.  So we don't write quota blocks
5422      until checkpoint is done.  So yes, they are like SegmentMap
5423    ]]
5424
5425
5426   segsums....
5427    If there are hundreds of snapshots, then a block being cleaned (whether to
5428    cleaner segment or new-data segment) could affect hundreds of segment
5429    usage counters.  That would be clumsy to work with.  Every block in the
5430    free table would need to hold references to hundreds of blocks.  This
5431    is do-able and might not be a big waste of space, but is still clumsy.
5432    I could change the arrangement for accounting per-snapshot usage by having
5433    a limited number of snapshots and having all the counters for one segment
5434    in the one blocks. So 1024byte block could hold 512 counters (youth plus
5435    base plus 510 snapshots).  Half that if I go to 4byte counters.
5436    In more common case of 32 snaphots, could fit counters for 8 segments in
5437    a block.  This means using space/io for all possible snapshots rather than
5438    all active snapshots.  It would also mean having a fairly fixed upper limit.
5439    I wonder what NILFS does....
5440    Worry about this later.
5441
5442   Still trying to get pinning of SegmentMap blocks right.
5443   Normally we need a phase-lock when pinning a data block so that we
5444   don't lose the pinning before we dirty.  But as we phase_flip
5445   these it doesn't matter... So just add that too the test??
5446
5447 28June2010
5448  Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
5449   datablock not, and doesn't cope.
5450   We either prealloc both, which seems clumsy, or always defer
5451   to InoIdx if it is present and pinned.
5452   lafs_prealloc does both Index and Data blocks for inode.
5453   But Data could lose as writeout while index will replenish at
5454   phase_flip, so maybe not a good idea.
5455   If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
5456   credits across to the data block (on non-cleaning segments) so the
5457   Data block doesn't need to have credits.
5458
5459   dirty_inode gets called:
5460      {__,}mark_inode_dirty{,_sync}
5461      inode_{inc,dec}_link_count
5462      [[various quota ops]]
5463     inode_setattr
5464     touch_atime
5465       file_accessed
5466     file_update_time
5467       generic_file_...write
5468       do_wp_page
5469
5470   updates through inode_setattr go to lafs_setattr so the
5471   data block will be pinpending and the checkpoint lock will be held.
5472
5473   updates through inode_*_link_count happen in filesystem and the inode data
5474    block is PinPending, or a block in the file is pinned and will be
5475    dirty, so it will get written.
5476
5477   updates through touch_atime or file_update_time are unexpected and
5478   cannot be prepared for.  file_update_time changes will be caught by
5479   normal file writeout.  atime changes will be lost until we get the
5480   atime file working.
5481
5482   So:
5483     dirty_inode cannot change the block as it might be in writeout, and
5484     it cannot lock anything as it might be in touch_atime which shouldn't
5485     block and cannot fail.
5486     So just set I_Dirty and use that to flush inode to db at writeout.
5487     Any changes which must be in the next phase will come via setattr and
5488     so will wait for incompatible changes to be written out.
5489
5490  Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
5491   my_inode is set
5492      lafs_import_inode
5493          iget and mount-time stuff
5494      lafs_inode_dblock
5495
5496   my_inode is cleared
5497     When I_Destroyed is set and the last ref on the block is dropped
5498     When inode_map_new_prepare claims an inodeblock
5499
5500   So we could easily not have a my_inode - e.g. just cleaning the data block.
5501   ->my_inode cannot disappear while we hold the block, so a test is safe.
5502
5503
5504  ----------------------------------------------
5505  Space reservation and file-system-full conditions.
5506
5507   Space is needed for everything we write.
5508   Some things we can reject if the fs is too full
5509   Some things we can delay when space is tight
5510   Some things we need to write in order to free up space.
5511   Others absolutely must be written so we need to always have
5512   a reserve.
5513
5514   The things that must be written are
5515        - cluster header  - which we never allocate
5516        - some seg-usage and youth blocks - and quota blocks
5517          Whese continually have credit attached - it is a bug if there
5518           are not enough. (We hit this bug)
5519
5520   Things that we need to write to free up space are
5521    any block - data or index - that the cleaner finds.
5522
5523   Things that we can delay, but not fail, are any change to a block that
5524    has already been written or allocate.
5525
5526   When space is needed it can come from one of three places.
5527      - the remainder of the current main segment
5528      - the remainder of the current cleaner segment
5529      - a new segment.
5530
5531   Only Realloc blocks can go to the cleaner segment, so the
5532   'must write' blocks cannot go there, so unused + main must have enough
5533   space for all those.
5534   Realloc blocks can go anywhere - we don't need a cleaner segment if things
5535   are too tight.
5536
5537   When we run out of space there are several things we can do to get more:
5538    - incorporate index blocks.  This tends to free up uninc-credits which
5539      are normally over-allocated for safety.
5540    - cluster_allocate/cluster_flush so more blocks get allocated and so
5541      more can be incorporated.  See above.  This is probably most helpful
5542      for data blocks.
5543    - clean several segments into whole cleaner segments or into the main segment.
5544   Much of this happens by triggering a snapshot, however we should only do that
5545   when we have full cleaner-segments (or zero cleaner segments).
5546
5547   When cleaning we don't want to over-clean.  i.e. we don't want to commit
5548   any blocks from a second segment if that will stop us from commiting blocks
5549   from the first segment.  Otherwise we might use one cleaning segment up by
5550   makeing 4 half-clean.  This doesn't help.
5551
5552
5553   So: we reserve multiple segments for the cleaner, possibly zero.
5554
5555   We clean up to that many segments at a time, though if that many is zero,
5556   we clean one segment at a time.
5557   lafs_cluster_allocate only succeeds if there was room in an allocated segment.
5558   If allocating a new segment fails, the cluster_allocate must fail.  This
5559   will push extra cleaning into the main segment where allocations must not
5560   fail.
5561
5562   The last 3(?) [adjusted for number of snapshots] segments can only be allocated
5563   to the main segment, and this space can only be used for cleaning.
5564   Once the "free_space - allocated_space"  drops below one segment, we
5565   force a checkpoint.  This should free up at least one segment.
5566
5567   We need some point at which we stop cleaning because the chance of finding
5568   something to clean is too low. At that point all 'new' requests defintely
5569   become failures.  They might do earlier too.
5570   Possibly at some point we start discounting youth from new usage scores so
5571   that the list becomes sorted by usage.
5572
5573
5574   Need:
5575     cut-off point for free_seg where we don't allow cleaner to use segments
5576       3? 4?
5577
5578     event when we start using fixed '0x8000' youth for new segment scores.
5579        Maybe when we clean a segment with usage gap below 16 or 1/128
5580     event when we stop doing that.
5581        Maybe when free_segs cross some number - 8?
5582
5583     point when alloc failure for NewSpace becomes ENOSPC
5584        same as above?
5585
5586     point when we don't bother cleaning
5587       no cleaner segments can be allocated, and checkpoint did not increase
5588       number of clean segments (used as many as freed).
5589       Clear this state when something is deleted.
5590
5591
5592    Allocations come out of free_blocks which does not included those
5593    segments that have been promised to the cleaner.
5594    CleanSpace and AccountSpace cannot fail.
5595      We *know* not to ask for too many - cleaner knows when to stop.
5596    ReleaseSpace fail (to be retried) if available is below a threshold,
5597      providing the cleaner hasn't been stopped.
5598    NewSpace fail if below a somewhat higher threshold.  If we haven't entered
5599      emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
5600
5601
5602    Possibly limit some 'cleaner' segments to data only??
5603
5604
5605   So: work items.
5606     - change CleanSpace to never fail, but cluster_allocate new_segment
5607       can for cleaner segment.  This is propagated through lafs_cluster_alloc
5608     - cleaner pre-allocates cleaner segments (for new_segment to use)
5609       and only cleans that many segments at a time.
5610     - introduce emergency cleaning mode which causes ENOSPC to be returned
5611       and ignores 'youth' on score.
5612     - pause cleaner when we are so short of space that there is not point
5613       trying until something is deleted.
5614
5615 30june2010
5616   notes on current issue with checkpoint misbehaving and running out of
5617   segments.
5618
5619   1/ don't want to cluster-flush too early.  Ideally wait until segment is
5620    full, but we currently hold writeback on everything so we cannot delay
5621    indefinitely.
5622   2/ row goes negative!!  let's see...
5623
5624     seg_remainder doesn't change the set, but just returns
5625         the remaining rows times the width
5626
5627     seg_step  move nxt_* to *, stepping to the next ... row?
5628              save current as 'st_*
5629
5630     seg_setsize - allocate space in the segment for 'size' blocks plus
5631          a bit to round of to a whole number of table/rows
5632                nxt_table nxt_row
5633
5634     seg_setpos initialises the seg to a location and makes it empty,
5635        st_ and nxt_ are the same
5636
5637     seg_next reports address of next block, and moves forward.
5638
5639     seg_addr  simply reports address of next block
5640
5641    So the sequence should be:
5642
5643      seg_setpos  to initialise
5644      seg_remainder as much as you want
5645      seg_setsize when we start a cluster
5646      seg_next  up to seg_remainder times
5647      seg_step  to go to next cluster (when not seg_setpos).
5648             or maybe just before seg_setpos
5649
5650      Need cluster_reset to be called after new_segment, or after we
5651      flush a cluster but don't need a new_segment.
5652
5653    I think I'm cleaning too early ...  I am even cleaning
5654    the current main segment!!!!
5655
5656    OK, I got rid of the worst bugs.  Now it just keeps cleaning
5657    the same blocks in the current segment over and over.
5658    2 problems I see
5659       1/ it cleans a segment that it should not touch
5660            We need to  avoid cleaner segment increasing the
5661              checkpoint youth number.
5662       2/ it has 6 free segments and doesn't use them
5663
5664    clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
5665    watermake is 4 segs, so free < 4.  So we have 3 allocated to cleaner,
5666    3 in reserve and so nothing much to clean!
5667
5668    The heuristic for returning ENOSPC is not working.  Need something more
5669    directly related to what is happening.
5670    Maybe if cleaning doesn't actually increase free space.
5671
5672    !Need to leave segments in the table until we have finished
5673    writing to them, so they cannot be cleanable. - DONE
5674
5675    WAIT - problem.  If cleaner segment is part-used, the alloc_cleaner_segs
5676    doesn't count that.  Bad?
5677
5678    When nearly full we keep checkpointing even though it cannot help.
5679    Need clearer rules on when there is any point pushing forward.
5680    Need to know when to fail requests.
5681
5682 02 july 2010
5683
5684   I am wasting lots of space creating snapshots that don't serve any
5685   purpose.
5686   The reasons for creating a snapshot are:
5687     - turn clean segments into free segments
5688     - reduce size of required roll-forward
5689     - possibly flush all inode updates for 'sync'.
5690
5691   We currently force one when
5692        newblocks > max_newblocks
5693           max is 1000 , newblocks is never reset!
5694           probably make that a number of segments.
5695        lafs_checkpoint_start is called
5696           when cleaner blocks, and space is available
5697           at shutdown
5698           on write_super is s_dirt
5699              __fsync_super before ->sync_fs
5700                freeze_bdev
5701                fsync_super
5702                  fsync_bdev
5703                  do_remount_sb
5704              generic_shutdown_super before put_super if s_dirt
5705              sync_supers is s_dirt
5706                do_sync
5707              file_sync !!! is s_dirt
5708
5709       I think I should move checkpoint_start to
5710             ->sync_fs
5711
5712
5713  After testing
5714   - blocks remaining after truncate - one index and 1-4 data
5715   - truncate finds blocks being cleaned
5716          FIXED - move setting of I_Trunc
5717   - orphans aren't being cleaned up sometimes.
5718         Hacked by forcing the thread to run.
5719   - parent of index block has depth==1
5720         Don't reduce depth while dirty children.
5721         Probably don't want uninc either?
5722
5723   - some sort of deadlock? lafs_cluster_update_commit_both
5724      has got the wc lock and wants to flush
5725     writepage also is flushed.
5726    Not sure what the blockage is.
5727    I think the writepage is the one in clusiter_flush, and it
5728     is blocking
5729
5730   - Async is keeping 16/0 pinned during shutdpwn
5731 03July2010
5732
5733   Testing overnight with 250 runs produced:
5734  - blocked for more than 120 seconds
5735       Cleaner tries to get an inode that is being deleted
5736       and blocks, so inode_map_free is blocked waiting for
5737       checkpoint to finish - deadlock.
5738      Need to create a ->drop_inode which provides interlock with
5739      cleaner/iget
5740
5741     But this is hard to get right.
5742     generic_forget_inode need to write_inode_now and flush all changes
5743     out and then truncate the pages off so the inode will be
5744     empty and can be freed.  But flushing needs the cleaner thread
5745     which can block on the inode lookup.
5746     Ahh.... I can abuse iget5_locked.
5747     If test sees I_WILL_FREE or similar, it fails and sets a flag.
5748     if the flag was set, then 'set' fails
5749
5750
5751  - block.c:504 DONE (I trink).
5752     unlink/delete_commit dirties a block without credits
5753     It could have been just cleaned..
5754     It looks like it was in Writeback for the cleaner when
5755     unlink pinned and allocated it....
5756     or maybe it was on a cluster (due to writepage) when
5757     it was pinned.  Then cluster_flush cleared dirty ... but
5758     it should still have a Credit.
5759     Maybe I should iolock the block ??
5760
5761     On reflection it wasn't cleaning, just tiny clusters
5762     of recent changes which were originally written as tiny
5763     checkpoints. Maybe lots of directory updates triggered the clusters.
5764     I guess writepage is being called to sync the directory???
5765     Or maybe the checkpoint was pushed by s_dirt being set.
5766
5767     So use PinPending and iolock to protect dir blocks from writepage.
5768
5769  - dir.c:1266 DONE
5770     dir handle orphan find a block (74/0) which is not
5771     valid
5772     This can happen if orphan_release failed to reserve a block.
5773     We need to retry the release.
5774  - inode.c:615
5775     index block and some data blocks still accounted to deleted file.
5776
5777     No theory on this yet.  Always one index block and a small number
5778     of data blocks.  Maybe the index block looked dirty, but was then
5779     incorporated with something that was missed from the children list...
5780     Or maybe I_Trunc is cleared a bit early...
5781     Or trunc_next advanced too far?? or too soon
5782     ??
5783
5784  - segments.c:640 DONE
5785      prealloc in the cleaner finds all 2315 free blocks allocated.
5786      no clean reserved.
5787     Need to be able to fail CleanSpace requests when cleaner_reserve
5788     is all gone.??
5789
5790     or just slow down the cleaner to one segment per checkpoint when
5791     we are tight..  Hope that works.
5792  - super.c:699
5793      async flag on 16/0 keeping block pinned
5794    Maybe clear Async flag during checkpoint.  Cleaner won't need it
5795    No, just ensure to clear Async on all successful async calls.
5796
5797      orphan file 8/0 has orphan reference keeping parent pinned
5798       [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
5799    Orphan handling is failing to get a reservation to write out the
5800    orphan file block?  Not convincing as there should be lots of space
5801    at unmount, and 'orphan sleeping' has become empty.
5802
5803  - Show State
5804      orphan inode blocked by leaf index stuck in writeback:
5805    [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5)
5806    [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0]
5807
5808     This is in the write-cluster waiting to be flushed
5809
5810
5811 9July2010
5812   Review B_Async.
5813     If a thread wants async something, it
5814          - sets B_Async
5815          - checks if it can have what it wants.
5816            + if not, fail
5817            + if so, clear B_Async and succeed
5818
5819     If a thread releases something that might be requested Async,
5820          it doesn't clear Async, but wakes up *the*thread*.
5821
5822     This applies to
5823         IOLock      - iolock_block
5824         Writeback   - writeback_donem iolock_written
5825         Valid        - erase_dblock, wait_block
5826         inode I_*   - iget / drop_inode
5827
5828      orphan handler, cleaner, segscan - all in the cleaner thread.
5829
5830   107 runs,
5831    2 hit 'Show State' with a blocked orphan inode.
5832     Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
5833     Both NoPhysAddr
5834
5835    Several runs blocked in cluster_flush or waiting for writeback.
5836
5837    - first case: looks like cluster flush should run but doesn't.
5838         cluster_flush runs:
5839            checkpoint, cleaner, cluster_allocate when full, update,
5840            writepage, sync_page
5841         So we have no timeout or other flush.
5842       I guess if we are waiting for writeback, we need to trigger a
5843       cluster_flush.
5844
5845    - other case - cluster_flush was called but is waiting for pending count
5846        to go down.
5847        Looks like cluster_reset shouldn't be changing pending_next
5848
5849    New hang.  Orphans not being processed:
5850         inode, because InoIdx is on leaf and checkpoint isn't pushing
5851         it along.
5852         dir block 0 is Dirty leaf
5853
5854      Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
5855
5856 10July2010
5857   Over night it looks *very* good.
5858   Have one infinite loop with 31770 repeates of
5859   ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
5860                    Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
5861
5862   So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
5863     lafs_add_orphan too short.
5864     tracing shows after truncate_inode_pages.
5865     must be blocked in inode_map_free - maybe use AccountSpace??
5866    But why isn't the the truncate progressing?
5867    Probably same reason:  No ReleaseSpace available.
5868    Maybe we aren't cleaning because there is a free segment, and
5869    we aren't checkpointing because there aren't enough yet...
5870
5871    Probably the cleaner has halted while CleanerBlocks - fix that.
5872
5873   - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
5874         Need a checkpoint to release the orphan?
5875    ditto for 0/331 - 331/0
5876     XX/0 is InoID
5877
5878 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice
5879 day...
5880 This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
5881 ,UninCredit,PhysValid leaf(1) intable(6) release(1)
5882  [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
5883 Valid leaf(1) intable(6) release(1) Leaf0(0)
5884 ------------[ cut here ]------------
5885 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
5886
5887 Forgetting 0 0
5888 724 != 7  (st->free.cnt afte segdelete, close_segment, close_all)
5889 ------------[ cut here ]------------
5890 WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
5891
5892 we called segdelete on something that was on the freelist.
5893 This happens when the final cluster starts a new segment.
5894 Need to improve the fix though.
5895
5896
5897  lafs_inode_handle_orphan can make progress without leaving
5898  anything async.  Maybe we need a return status:
5899   -EAGAIN - try after async
5900   -ENOMEM - try some time soon - hope memory will be better
5901   0 we called orphan_release
5902   anything else loops.
5903
5904
5905  - we allocate a segment in last checkpoint we don't
5906    take references properly.
5907
5908  - orphan handle spinning on:
5909
5910   ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
5911    26402 calls.
5912    stuck in delete_inode?? ?
5913
5914
5915   never-ending cleaning? Maybe just computer slow ??
5916
5917 11July2010  - on plane to Prague.
5918   How can we safely access ->iblock?
5919    normally iolock, but how do we get iolock?
5920    - flush data to inode
5921    - cluster flush takes private_lock
5922    - private_lock is used to set to null.
5923   I guess we use private_lock to get a reference
5924   then iolock and revalidate
5925   but I can probably test for NULL at any time? though that can change under private_lock
5926   If we own a reference to a child with a parent, then we can use
5927    rcu_dereference to get a ref which might change
5928
5929 12july2010
5930
5931  ->write_inode is called by write_inode() called by __sync_single_inode
5932   to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
5933  Do we care?
5934
5935  change to addresss we already handle with checkpoints
5936  change due to setattr we can handle directly if we want
5937  that just cleans mtime/ctime and atime.
5938    mtime/ctime calls ->dirty_inode
5939    as does atime
5940
5941  So:
5942   getattr changes set I_Dirty so that when cluster_allocate
5943   happens all the changes get saved.
5944
5945   when dirty_inode is called, we set I_Dirty but don't dirty
5946   the inode block.
5947   If anything happened to justify an inode write, it will
5948   be dirty anyway.  If it isn't, this is just atime
5949
5950   So on dirty_inode we check if atime has changed and if so
5951   we schedule change to atime file
5952
5953   sync_inode should write an update for the inode if I_Dirty
5954   but sync_filesystems should not
5955
5956   Simple.  fsync calls ->fsync.  We get that to write an
5957   inode update, but nothing else does.
5958
5959   Possibly all directory updates could be chained onto a
5960   directory and only written when fsync is requested before
5961   a checkpoint.
5962   both sides of a rename ??
5963   leave that for later.
5964
5965 WritePhase - what is that all about?
5966   We must not change a block while it is being written to previous
5967     phase, else we corrupt causality.
5968   But we probably don't want to change it any way as that would
5969   mess up any checksum or duplication.
5970
5971  So we want to ignore WritePhase - scrap it.
5972  Before changing a block, we must iolock_written
5973   - all dir updates
5974   - inode update in fsync
5975   - orphan file
5976   - segusage?
5977   - quotas?
5978
5979  But what about regular data.  If prepare_write finds a block in
5980  writeback, do I need to wait, or can I just mark it dirty in
5981  commit_write?  If no checksum and no duplication applies, this should
5982  be fine.
5983
5984 16July2010
5985  BUT e.g. dir operations are in particular phases.  If the dirblock
5986  is pinned to the old phase, we need to flush it, then wait for io
5987  to complete.  So we need lafs_phase_wait as well as iolock_written.
5988  This is already done by pin_dblock.
5989  I wonder if we need a way to accelerate pinned blocks that are being
5990  waited for - probably not, they should be done early.
5991
5992  So we probably want to iolock after phase_wait in pin_dblock.
5993  Though dir.c pins early.
5994  I need to review all of this and get it right.
5995
5996  So:
5997   - we aren't allowed to block much holding  checkpoint_lock as
5998     checkpoint_start waits for that.  However phase_wait will only
5999     block if a new checkpoint has started already, so there is not
6000     chance of phase_wait ever blocking checkpoint_start.
6001     So it is safe to call phase_wait in checkpoint_lock.
6002     phase_wait will wait until block is written, added back to
6003     the lru clean, then found and flipped... I wonder if that is
6004     good - it keeps parent from being a leaf, and so written, until
6005     child write has completed.
6006     We want to phase-flip a block as soon as it is allocated by cluster_flush.
6007
6008     With directory blocks, i_mutex stops other changes, so an early iolock_written
6009     will leave the block clean and phase won't be an issue.
6010
6011     With inode-map blocks.. we:
6012       set B_Pinned to ensure no-one writes except for phase change
6013         do that after lock_written so it starts safe.
6014       once we have checkpointlock, wait for phase if needed.
6015       then lock_written again which should be instant but ensures
6016       that block is locked while we change it...
6017
6018   I think I want
6019     - refile to call phase flip if index is not dirty and is in wrong phase
6020        and has no pinned children in that phase.
6021     - Only clear PinPending if we have i_mutex or refcnt == 0
6022     - before transaction:
6023           lock_written / set PinPending / unlock
6024       the inside cluster_lock
6025           lock_written pin / change / dirty / unlock
6026       it will only wait for writeout if phase changed.
6027       so don't need phase_wait
6028      but want pre-pin then pindblock
6029      Transactions are:
6030         dir create/delete/update - DONE
6031         inode allocate/deallocate - on inode map DONE
6032         setattr  DONE
6033         orphan set/change/discard
6034
6035      Orphans are a little different as when we compact the
6036      file, the orphan file block 'owned' by the orphan block
6037      can change.  As along as we keep them all PinPending it
6038      should be fine though.
6039      I think that every block in the orphan file will always be
6040      PinPending ???
6041
6042     OK - done most of that.
6043     Early phase_flip is awkward.  We need an iolock to phase_flip,
6044     and we don't have one.  The phase_flip could cause incorporation
6045     which cannot happen until the write completes.  So I guess
6046     we leave it as it is.
6047
6048
6049    FIXME what about inode data block - cluster_allocate is removing
6050     PinPending after making them dirty from the index block..
6051
6052   If all free inode numbers a B_Claimed,  don't think we allocate
6053   a new block... yes we do, as 'restarted' is local to caller.
6054
6055  Also
6056   each device has a number of flags
6057    - new metadata can go here
6058    - new data can go here
6059    - clean data can go here
6060    - clean metadata can go here
6061    - non-logged segments allowed
6062    - priority clean - any segment can be cleaned
6063    - dev is shared and read-only - no state-block updates
6064
6065   state block needs a uuid for an ro-filesystem that this is
6066   layered on.
6067
6068   Is metadata an issue?
6069     We might want it on a faster device, but ditto for directories
6070     and for some data.  So probably skip that.
6071
6072   Have separate segment tables for:
6073     - can have new data
6074     - can have clean data but not new. (this often empty)
6075
6076   Clean data can go to new-not-clean if nothing else
6077   new data can go to clean-not-new ?? if not sync??
6078   Maybe call them 'prefer clean' and 'prefer new'
6079
6080   I think we want:
6081     'no sync new' - don't write new data, unless it is in big chunks and
6082            can wait for checkpoint to be 'synced'
6083     'no write' - never write anything - this is readonly.
6084                used for removing a device from the fs.
6085
6086   A 'no sync new' device can have single-block segments.
6087   This doesn't allow compression, but avoids any need to clean
6088   In this case we don't store youth and the segusage is 32 bits per segment.
6089   That means  - for 1K block size - 0.5% of devices used for segusage.  That
6090   feels high.  For 4K, 1/1024 so a giga per terabyte.
6091   Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
6092
6093   Other segusage for 29 snaps is 1/million of space used.
6094   So we 'waste' 0.1% of device for no secondary cleaning.
6095   Can still do defrag though.
6096
6097   clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
6098   as does creating a snapshot.
6099
6100 18jul2010
6101  If lafs were cluster enabled we would want multiple checkpoint clusters,
6102  one for each node. When a node crashes some node would need to find and
6103  roll-forward.  For single node failure, it is enough to broadcast cluster
6104  address to all others.  For whole-cluster failure, need to either list all
6105  in superblock or link from main write cluster.
6106
6107  When writing to multiple devices we may want multiple write clusters
6108  active for new data.  These all need to be findable from checkpoint cluster
6109  so linking sounds good.
6110  Having a single 'fork' link in cluster head might work but does scale to large
6111  cluster.  I doesn't need to be committed to other not does checkpoint end, so
6112  that should be ok.
6113  Could have a special group_head to list other clusters for roll forward.
6114  If we put fsnum first, a large value - 0xffffffff - could easily mean
6115  something else
6116
6117  Or every  cluster head could point to an alternate stream, and if we want many
6118  quickly, each simply points to another, so we create a chain across all writers.
6119
6120
6121  Another issue...
6122   When we 'sync' we don't wait for blocks until after the checkpoint is started,
6123   and we know that will be driven through to CheckpointEnd which will commit and
6124   release everything.
6125   However 'fsync' doesn't have the same guarantee.  The sync_page call will ensure
6126   the data has been written, but we don't know it is safe until the next
6127   header is written.  So we need to push out the next cluster promptly.
6128
6129   So if sync_page is called on a page in writeback, then we mark the cluster as
6130   synchronous.  When a sync cluster completes, the next (or even next+1) clusters
6131   are flushed out promptly.  Hopefully they won't be empty on a reasonably busy system,
6132   but it is OK if they are.
6133
6134   If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
6135   as the write completes the block will be released.
6136
6137   So: to clarify sync_page:
6138     This can be called when page is in writeback or locked.
6139     If locked there is nothing we can do except maybe unplug the read queue.
6140     If page is in writeback and block is dirty, then it is probably in
6141     a cluster queue and we should flush the cluster and the next.
6142     If page is in writeback and block is not dirty, but is writeback,
6143     just flush one cluster.
6144     But we don't want these cluster flushes to start while the previous is
6145     still outstanding else we stop new requests from being added.
6146     So as soon as the cluster can be flushed we flush, but no sooner.
6147     I guess we use FlushNeeded and make that be less hasty.
6148
6149 19June2010
6150
6151   superblocks....
6152    We currently have a superblock for each device.
6153    I cannot see a good reason for that.
6154    We can just bdev_claim for 'this' filesystem.
6155    Rather we should have a number of anon superblocks,
6156     one for each fileset, then one for each snapshot.
6157    Do we use different fs types? probably yes
6158        lafs - main filesystem made from devices
6159        lafs_subset - subordinate fileset, given a path to  fileset object
6160                  can have 'create' option when given an empty directory.
6161        lafs_snap - snapshot - given a path to filesys and textname.
6162
6163     Cannot create a snap of a subset, only of the whole filesystem
6164     Is it OK to mount eith snap of subset or subset of snap?
6165     It probably does, so need to use the same filesystem type for both.
6166     Maybe lafs_sub or sublafs. Needs path to directory.
6167     can be given 'snap=foo'.
6168     No: a given filesystem may not exist in a snapshot.  You need to
6169     mount the snapshot first, then the subset of the snapshot.
6170     So we have three types as above.  All subsets as 'lafs_subset',
6171     whether they are subset of main or of snapshot.
6172
6173     Should we be able to create a snapshot or subset without mounting it?
6174     It doesn't really seem necessary but might be elegant..
6175
6176     remount doesn't seem the right way to edit a filesystem as it forces
6177      some cache flushing.
6178     What do we want to edit?
6179           - add device,  remove device
6180           - add/remove snapshot by name
6181           - add/remove subset?  Not needed, just mkdir/rmdir and mount to convert
6182                      empty dir to subset.
6183           - change cleaner settings??
6184     Could have remount as an option. If problem find other option.
6185
6186     While cleaning (which is always) we potentially need all superblocks
6187     available as we might need to load blocks in those filesystems to
6188     relocate them.
6189     Unfortunately each super needs to be in a global list so there is a cost
6190     in having them appear and disappear. I guess that is not a big deal.  They
6191     are refcounted and will disappear cleanly when the count hits zero.
6192
6193     So:
6194      DONE - change all prime_sb->blocksize refs to fs->blocksize
6195      DONE - create an anon sb for the main filesystem
6196      DONE - discard the device sbs, just bd_claim the devices and add to list
6197      - use lafs_subset for creating/mounting subsets.
6198
6199   Changed s_fs_info to point to the TypeInodeFile for the super, but
6200    for root/snapshot that doesn't exist early enough to differentiate the
6201    super in sget.
6202    So we make an inode before the super exists and attach it after.
6203    Need to do all that get_new_inode does.
6204         inode_stat.nr_inodes++   - just don't generic_forget the inode
6205         add to inode_in_use -   seems pointless - just set i_list to something
6206         add to sb->s_inodes - if we don't it won't flush - maybe that is good?
6207         add to hash - don't want
6208         i_state == lock|new - only really needed if hashed.
6209     but there is lots of initialisation in alloc_inode that we cannot access!!
6210
6211    Problem is that we need s_fs_info to uniquely identify the fs with something
6212    that can be set in the spinlock, so allocating an inode is out.
6213    And also to get to the filesystem metadata which is in the inode.
6214    I guess we allocate a little something that stores identifier and later inode.
6215      for lafs  we use uuid
6216      for subset we use just the inode
6217      for snapshot we use fs and number
6218
6219
6220 25July2010
6221   superblocks:
6222    - sget gives us an active super_block.  We need to attach to a vfsmnt
6223      using simple_set_mnt, or call deactivate_locked_super.
6224    - sget's set should call set_anon_super
6225    - kill_sb (called by deactive_super) should then call kill_anon_super
6226
6227   If we have a vfsmnt, we have an active reference, so we can atomic_inc
6228   s_active safely.  So use this to allow snapshots and subsets to hold a
6229   ref on the prime_sb and thence on the 'fs'.
6230
6231 26July2010
6232  - DONE  need to set MS_ACTIVE somewhere!!
6233  - FIXME if an inode is being dropped when iget comes in, it gets confused
6234     and the inode appears to be deleted.
6235
6236    We cannot really break the dblock <-> inode link until after write_inode_now,
6237    but there is no call-back before generic_detach_inode is complete.
6238    The last is write_inode which is only calledif I_DIRTY_something.
6239    Maybe when writeback completes on an inode dblock, we should check if
6240    the inode is I_WILL_FREE and if so, we break the link...
6241
6242    Or maybe when we find my_inode set we can check the block and if it isn't
6243    dirty or being deleted we break the link directly... That makes more sense.
6244
6245    So... what is the deal with freeing inodes???
6246      ->iblock is like a hashtable reference.  It is not refcounted
6247              It gets set under private_lock
6248       iblock is freed by memory pressure or lafs_release_index from
6249              destroy_inode
6250      when refcount of iblock is non-zero, ->dblock ref is counted,
6251      else it is not.
6252      dblock is set to NULL if I_Destroyed, or when dblock is discarded,
6253        (under lafs_hash_lock)
6254        and set to 'b' in lafs_iget and lafs_inode_dblock
6255
6256      We can drop the dblock link as soon as iblock has no reference
6257
6258     probably get clear_inode to break the link if possible, which it should
6259     be on 'forget_inode'.  Then lafs_iget can wait on the bit_waitqueue.
6260     or maybe do clear_inode itself
6261
6262    FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
6263       dblock is not NULL.
6264
6265 28July2010
6266   So: ->dblock and ->my_inode need to be clarified.
6267
6268   Neither is a counted reference - the idea is that either can be freed and
6269   will destroy the pointer at the time so if the pointer is there, the
6270   object must be ... but we need locking for that.
6271   ->dblock is reasonably protected by private_lock, though if ->iblock exists
6272   we hold a ref of ->dblock so we can access it more safely.
6273
6274   Need to check getiref_locked knows ->dblock exists when called on iblock
6275   and lafs_inode_fillblock
6276    yes, both safe!
6277
6278  But ->my_inode needs locking too so the inode can safely disappear without
6279  having to wait for the data block to go.  After all data blocks some in sets,
6280  and one shouldn't keep others with inodes.
6281  So something light-weight like rcu might work.
6282  We use call_rcu to free the inode and rcu_readlock to access ->my_inode
6283
6284  Yes, that will work.  Occasionally we will want an igrab to, but not
6285  often.
6286  Should look into rcu for index hash table and ->iblock as well.
6287  Current ->iblock is only cleared when the block is freed .. I guess that is fine...
6288
6289
6290 31Jul2010
6291   rcu protection of ->my_inode
6292   A/ orphan inodes - are they protected?
6293   B/ orphan blocks - are the inodes of those protected? Probably...
6294
6295   inodes are 'orphan' for two reasons
6296     1/ a truncate is in progress
6297     2/ there are no remaining links, so inode should be truncated/deleted
6298        on restart.
6299
6300   The second precludes us from holding a refcount on any orphan inode,
6301   else it would never get deleted.
6302   So we must assert that an inode with I_Deleting or I_Trunc has an implied
6303   reference and so delete must be delayed... not quite.
6304   If we set I_Trunc but not I_Deleting, then we igrab the inode until
6305   I_Trunc is cleared.  While we hold the igrab, I_Deleting cannot possibly
6306   be set as that is set when last ref is dropped.
6307
6308 01Aug2010
6309   FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
6310     .. and in lafs_orphan_release
6311   Well... only iolock_written can be a problem, and our rules require that
6312   only phase-change writeout can set writeback.  So the cleaner can never
6313   wait for writeout here.  Maybe it can wait for a lock, and maybe we don't
6314   really need a lock, just 'wait_writeback'.
6315 08Aug2010
6316   So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
6317    It is writeback waiting on 74/BIGNUM fromm file.c:329.  So writepage
6318    tried to write a block in a directory .. but it is PinPending so that
6319    must have been set after writepage got it...
6320    lafs_dir_handle_orphan gets an async lock, then sets PinPending.
6321    If write_page is before that, it will have the lock and dir_handle will try later.
6322    If write_page is after it will block on the lock, or see PinPending and
6323    release the lock.
6324    So someone else must be clearing PinPending!
6325      - checkpoint clears and re-sets under the lock, so that is safe
6326      - dir.c clears under i_mutex
6327          dir_handle_orphans always hold i_mutex ... or does it.
6328      - refile drops when the last non-lru reference goes.
6329      - inode_map_new_abort clears for inode
6330    No, not that - just bad test on result lof iolock_written_async ;-(
6331
6332   Now have an interesting deadlock.
6333     rm in lafs_delete_inode in inode_map_free is waiting for the block to
6334     flush which requires the cleaner.
6335     The cleaner thread in inode-handle_orphan is calling erase_dblock
6336      on the same inode which blocks while inode_map_free has it locked....
6337      no, not same block - just waiting for writeout which requires cleaner.
6338      lafs_erase_dblock from inode_map_free must be async!
6339    pin_dblock in lafs_orphan_release must too.... no - only the setting of
6340    PinPending needs to be async or out side of cleaner, which it is.
6341
6342   Ok, got that fixed.  All seems happy again, time for a commit.
6343
6344
6345 09Aug2010
6346    14b/  What backing-dev to show the filesystem.
6347      backing-dev holds:
6348          congested state
6349          unplug function
6350          read-ahead info
6351          throughput measurements
6352
6353     Much of that is for generic code to use.  We need to:
6354      - provide an unplug funtion that unplugs all devices
6355      - provide a congested function that which checks all devices,
6356        or for 'write' - at least the device we are writing to.
6357
6358     How do we set the backing device?
6359     The 'struct address_space' point to one, as does struct super_block.
6360     set_anon_super establishes a null bdi, set_bdev_super gets it from the
6361     bdev->queue
6362
6363     We need to bdi_init and bdi_register (if no error) our bdi.
6364     bdi_destroy calls unregister and reverses bdi_init
6365     or just bdi_setup_and_register
6366     but bdi_register_dev gives a better name - isn't this sick!!!
6367
6368     Partly done ... but I'm hitting more bugs :-(
6369
6370   -Checkpoint cannot complete because...
6371    Lots of dirty inodes that are orphans are not pinned!! I
6372    guess the InoIdx is ??
6373    Most of them don't have InoIdx(?)  Only '8' does.
6374    8/0 is also an orphan and is on wc[0]
6375
6376    It seems that this block keeps getting re-written and stays in
6377    Phase0.
6378    Is that because it is a data block with PinPending.. No, that works
6379    as long as it become un-dirty: we drop pinpending, refile, and set again
6380
6381    It is being dirtied again during writeout for the checkpoint
6382    so it doesn't get to changed phase when we lift PinPending.
6383    I gues we mustn't dirty it if it is in the old phase.
6384
6385   -And twice inode 17 is deleted without B_Orphan being set!
6386    That is the only file that exists before we mount.
6387      Problem was orphan_release instead of orphan_forget
6388      I wonder why it only affected 17...
6389
6390   -at shutdown we drop an inode and try to invalidate pages, but
6391    root inode is still dirty - I wonder why.
6392      The dblock is in a different phase to the iblock.
6393      In checkpoint we wait until root iblock changes phase, but
6394      not root dblock!
6395
6396
6397   UP TO:
6398     I'm testing subordinate filesystems, which don't work yet.
6399     I need to create the root directory and inode map.
6400     Obviously I cannot record the inode map file in the inode map....
6401       inode_map should ignore everything less than 16? 8? 2?
6402     Need to make sure creating with a given inode number works.
6403     Need to make make sure auto-allocate inum is never less than 16.
6404
6405 11Aug2010
6406  How to map from filesys inode to superblock?
6407   Need in
6408     lafs_iget_fs
6409     choose_free_inum - to get inode-1
6410     ditto in inode_map_free
6411     lafs_put_super has something odd with i_sb
6412
6413   Could do an sget search..
6414   Or could just store it in the inode (but not in i_sb!!)
6415   inode already a bit large though.
6416   Do it for now, but make a note to trim the fs_md part of inode
6417   into a separate allocation.
6418
6419   lafs_new_inode should take an 'sb' not a 'filesys'.
6420   In fact, get rid of filesys.  It is
6421     MAP(i->i_sb->s_fs_info)->root.
6422
6423  15f - timestamps for roll-forward.
6424     The writeout can be much later, but logging the mtime is fairly
6425     boring ... we could log mtime in the group head, which might be cheap
6426     enough.  How much precision is needed, and against what base?
6427     probsably mtime of last checkpoint from superblock.  That should
6428     be not more than 2048 seconds ago, so 16 bits gets is 30msec...
6429
6430 14Aug2010
6431  15l - decay youth info.
6432     Need to decay:
6433          youth_next and checkpoint_youth in 'struct fs'
6434          all blocks in youth files on storage
6435          all scores in seg-tracker.
6436            - not needed, they'll get updated in normal progress
6437              and being wrong for a while is no cost.
6438     ensure correct youth is stored in lafs_free_get
6439     check little-endian conversion of all youth accesses
6440
6441     checkpoint_youth only used by thread, so no locking needed
6442     youth_next protected by fs->lock
6443
6444  15m - share orphans and cleaning list_heads in datablock
6445    It certainly is possible to clean an orphan but it is very unlikely
6446    as it will have changed recently, or be changing soon.
6447    The cleaner could just dirty any B_Orphan it finds.
6448    But if orphan finds a block on the list, it must be careful...
6449    I guess when cleaner drops a cleaning ref, it should check if the block
6450    is an orphan, and re-queue if it is.
6451
6452  15o - async blocks just have an extra refcount.
6453    This could:
6454      - keep PinPending set
6455      - keep an index block pinned - will phase-flip
6456      - keep ->parent link
6457    not not get in the way of a checkpoint.
6458
6459    Should we clear any that we find though?
6460    Normally async is only used by cleaner, orphan processing, or segscan
6461    So it should all be finished when we do a checkpoint.
6462
6463    So if checkpoint, or release_page, finds an async block, drop it.