README

   1
   2 So, let's try to write a kernel module that implements this filesystem.
   3 It would be good to have a plan.
   4
   5 - Mount filesystem, providing empty root directory
   6    o parse mount options - DONE
   7    o find/load superblocks and stateblocks - DONE
   8    o present empty directory - DONE
   9    o Compile external module - DONE
  10    o test DONE
  11
  12 - Mount filesystem read-only with no roll-forward
  13    o IO address mapping
  14           sync_page_io or bread? - not bread I think
  15    o Index blocks management
  16    o search cluster-header for root inode
  17    o file read
  18    o Directory lookup/read
  19    o test
  20
  21 - Support roll-forward for blocks, orphans, whatever
  22    o manage segusage files
  23    o manage quota files
  24
  25 - Support writing
  26    o inode bitmap
  27    o cluster creation / block sorting
  28
  29 - Support Cleaning
  30
  31 - Interface for snapshots and other admin
  32
  33
  34
  35 ------------------------
  36 FIXME
  37  If a device is removed from the filesystem, we cannot reliably
  38  tell from the other devices or state that this is so.
  39  Maybe we need to update all devblocks with a new 'seq' number...
  40 FIXME
  41  How do we specify mounting subordinate filesets?
  42  What superblock do they have?
  43  I suspect we do a -F lafs-sub mount from the original filesystem.
  44
  45 FIXME
  46  If mount fails, we seem to be leaving a super lying around,
  47  and sync_supers dies on it. - DONE
  48
  49 FIXME
  50   Umount appear to work, but a sync_supers dies. - DONE
  51
  52 FIXME
  53   subordinate supers aren't being locked as much - is that a problem?
  54
  55 FIXME
  56   index pages never get put on an LRU - how is this supposed to work?
  57
  58
  59 --------------------------
  60 Thoughts:
  61   Inodes live in an address-space, much like a file.  To load the
  62   first inode, we need an address-space, so may as well have an
  63   'struct inode' as we may want to expose it to user-space.
  64
  65  Loading an inode, need
  66    fs (lafs filesystem structure)
  67    which subfs (maybe a lafs inode)
  68    which snapshot - this is implied by the subfs inode.
  69    and fs can be obtained from inode, so just inode, inum
  70
  71
  72
  73 UPTO
  74 03nov2005
  75   review block_leaf_find and make_iblock
  76   need to do setparent and block_adopt next
  77
  78 10nov2005
  79   need to resolve locking for ->siblings list
  80
  81 24nov2005
  82   peer_find
  83   lock_phase
  84   lafs_refile
  85
  86   I can read a file.....!!!!!
  87   Code review / tidy up.
  88      resolve locking buffer vs page
  89
  90   Export on a web page somewhere??
  91
  92 16feb2006
  93  (I spent a while getting large-directories to work again in prototype..
  94   and some holidays).
  95  - Priority: clean mount and unmount
  96  - large directories
  97  - multiple devices.
  98
  99   FIXME how do we record and handle write errors???
 100
 101  The iput in lafs_release - which is needed - is oopsing
 102   at iput+0xe!
 103
 104 23feb2006
 105  Ok, I finally have a clean mount/unmount.
 106  .. not quite.  blocks being freed at unmount still have a refcnt, which is bad.
 107
 108  Next:
 109   - make sure we can handle 'large' directories.
 110   - make sure we can handle files with indexes
 111   - handle filesystems that span devices.
 112
 113 02mar2006
 114   Hurray - clean unmounts!!!
 115   There is a nasty circular reference of the root inode which is stored in
 116   a block that it manages.  Maybe this should not happen, rather than having to be
 117   explicitly broken - the root-block can live elsewhere, not in the inode.
 118
 119   Next multi-level index blocks.
 120
 121   But first, need to understand memory pressure and pageout.
 122    How are dirty pages found to be cleaned?
 123    How is pressure put on a filesystem to clean up?
 124    How are clean pages reaped?
 125
 126   - call pagevec_lru_add{,_active)(pvec)  to put the page on an LRU
 127       lru_cache_add{,_active}(page) might be easier, but isn't exported.
 128   - call mark_page_accessed(page)  to keep the page 'active'.
 129
 130 09mar2006
 131   - make sure indexes work...
 132
 133
 134  lafs_load_block+0xf
 135   eax,bx,cx,dx,s1 all zero
 136 from block_leaf_find 203
 137
 138   ... OK, indexes seem to work.
 139    But 'lafs' have problems creating some large files.
 140    Try 'tt'
 141
 142    This is due to not handling error properly.. fix it later FIXME
 143
 144 16mar2006
 145
 146   Must make sure the index address-space gets clearred up... I wonder
 147   how we find all the pages to free.  This might be one reason to keep them
 148   in a radix tree.  Though we should be able to walk our own data structures.
 149
 150
 151   Then work on mounting a 2-device filesystem.
 152
 153
 154   FIXME dir_next_ent always starts from the beginning rather than
 155    remembering where it is up to... can this be fixed??
 156
 157
 158 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
 159
 160   Mounting snapshot needs a way to identify that it is a snapshotmount
 161   and which snapshot, and which filesystem.
 162   We could use a different filesystem type, but that isn't really needed
 163
 164     mount -t lafs -o snapshot=name /original/mount/point /new
 165
 166   This grabs the named snapshot of /original/mount/point and places it at
 167       /new
 168   The 'snapshot=' option is the trigger.
 169
 170   For a control FS, we
 171         mount -t lafs -o control /original/mount/point /new
 172
 173   To grow a filesystem, we initialise a device (super/state blocks) and
 174         mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
 175
 176   as the dev_name isn't passed to remount
 177
 178   So, mount options are:
 179         snapshot=name
 180         dev=/dev/device
 181         new=/dev/device
 182         control
 183     and various
 184           name=value
 185     pairs matching what is exposed in the control filesystem
 186
 187 23mar2006
 188  - factored out super-block finding preparatory to finding snapshots.
 189
 190  Thoughts:
 191     superblocks for snapshots and sub-ordinate filesystems do
 192     not get stored in the 'state'.  There is, however, a usage count so that
 193     the prime filesystem cannot be unmounted until all snaps and subs are gone.
 194     This should just refcount the prime_sb I suspect.
 195
 196     So: a snapshot sb points to the 'struct fs' but doesn't .... what???
 197
 198 30mar2006
 199  - remove the super-block finding code by changing the layout to store
 200     superblock locations explicitly :-)
 201
 202  - teach 'mount' to mount snapshots.
 203
 204  - need to audit for bad use of ss[0]
 205  - need to find better way to map 'sb' to snapshot number.
 206  - need to make unmount work.
 207
 208 01apr2006 (no, really!!)
 209  - rewrite index to kmalloc index blocks and use a shrinker to free them.
 210    This means that indexblock no longer has a 'page', which makes sense.
 211    It also means they cannot live in highmem, which is sad, but could
 212    be fixed.
 213
 214   Notes: superblocks and refcounts.
 215    Each device holding the filesystem gets a superblock.
 216     One of these (arbitrarily) is the 'prime' superblock and gets to
 217       manage the whole filesystem.
 218     Each snapshot also gets a superblock, as does each
 219       subordinate filesystem.  These are anon sbs - using anon dev.
 220     Each anon sb takes a reference to the 'struct fs', and also to the
 221       prime sb.... how about the reference relationship between fs and prime_sb???
 222
 223     Need to ponder this,
 224
 225    - problem with getting parent superblock due to semaphores...
 226    - when unmount, put_super isn't being called, so inode 0 isn't released!
 227
 228 13apr2006
 229   (Took a week off to play with rt2500 wireless cards)
 230   - Use different filesystem type for snapshots and subordinate filesystems.
 231     This removes the semaphore problem
 232   + OK, mount and unmount works for snapshots... what next?
 233      - review index block - worry about himem?
 234      - review ss[0] usage - OK
 235      - general code review
 236
 237   FIXME - what should leaf_lookup/index_lookup return on format error?
 238       The currently return '0' which will quietly make an empty block.
 239       Many '-1' would be better to make an error block.
 240   FIXME check how other filesystem lock the setting of PagePrivate
 241      Maybe just need to lock_page
 242   FIXME combine find/load/wait into one operation
 243   Review dir, super, roll, link
 244
 245   FIXME module refcount increases on failed mount!
 246
 247 18may2006
 248   I've been sick for too long, and not much has happened... However I think more than
 249   the above comment says.  I started looking at roll-forward and have the
 250   basic block parsing in place so that it reports what it sees in the roll.
 251   Also, the format has been changes a little: the address in the state block
 252   is the CheckpointStart cluster, and we simply roll forward to the
 253   CheckpointEnd, and then keep going beyond there - there is no longer any
 254   walking back to find the start.
 255
 256   Next step is to start incorporating rolled elements into the filesystem
 257
 258    - data blocks: shouldn't be too hard.  Don't need to update the
 259            index pages just yet
 260    - inode updates: should be straight forward enough, but care is needed
 261            as the data might be in multiple places
 262    - directory updates: these are probably most interesting..
 263
 264
 265   Question: how are symlinks created?
 266     Currently we:
 267       log the inode creation
 268       commit the new inode
 269       log the directory update.
 270     This allows the 'value' stored in the inode to appear after the directory
 271     update.
 272     That might be OK for files (Which are created empty and then extended)
 273     but is bad for symlinks (which are created atomically).
 274     So, options include:
 275      - ensure inode is in a previous cluster to directory updates.
 276        This slows things down too much I think
 277      - log the content as well.  This is awkward if it is big, certainly if more
 278        than a block, which is possible.
 279      - directory updates could be dependant on the inode being valid.
 280        This is ugly.
 281      - log content if it is small, else write inode, flush, then create link.
 282
 283     So the fast option is:
 284       log inode create, log content, log filename
 285     and the slow/safe option is
 286       log inode ceate, sync file, log filename
 287
 288     So on roll-forward if we see the inode we just save the data.
 289     Saving the whole inode seems attractive, but we want minimal order
 290     dependance: an inode update in the same cluster as the new inode should
 291     still over-ride, even though it is earlier.
 292
 293   Ok, rollforward is proceeding slowly.  I think I am now incorporating
 294   new blocks into the tree properly, though the code probably won't compile.
 295   It will be nice to test this and see the file have the right data.
 296
 297   Next step would be to include the index incorporation code.
 298   Then
 299     - directory updates
 300     - segusage summary
 301     - quota
 302     - stuff..
 303
 304 08jun2006
 305  - what exactly should happen when rollforward finds a file with a linkcount of 0?
 306    Currently all updates get lost - I wonder if they are lost safely?
 307  - rollforward is getting the size right, but not the content
 308  - do I need to flag a block that ->phys is valid?
 309
 310  : Ok, roll-forward picks up new blocks in a file OK,
 311   but umount has stopped working.
 312     Presumably because there are pages attached to the inode which aren't
 313     getting released.  What do we want to do here?
 314     Normally those pages, or their addresses need to be recorded before
 315     they are lost.  But on a read-only mount we don't care so much.
 316
 317 22jun2006    continuing above thought..
 318
 319    When we roll-forward and pick up the pieces of a file, we don't
 320    want to allocate pages to hold those pieces (and definitely don't
 321    want to read them all).  We just want to attach the addresses
 322    to the parent for incorporation.  Similarly after writing
 323    dirty blocks in a file we want to be able to release them
 324    immediately rather than waiting for the addresses to be
 325    incorporated (as incorporation can be more efficient when delayed).
 326
 327    We could just allow the page associated with a block to be released,
 328    except that the page provides the indexing to find a block.  We might
 329    be able to live without the indexing, and hunt down the indexblock tree,
 330    but living without the mutual-exclusion provided by block indexing would
 331    be more awkward.
 332    And the 'struct datablock' still contains a lot more than is needed.
 333
 334    So maybe we should just have a completely separate structure attached to
 335    the indexblock which lists fileaddr/physaddr.  This could include
 336    extent information.  The trick would be guranteeing allocation.
 337    We could either allocate-late with a fallback of attaching the 'struct block'
 338    or performing an immediate incorporation, or allocate-early and block
 339    the dirtying of a page until there is space to record the new address.
 340    This last is bound to be easiest.
 341
 342    So: what exactly do we use to store addresses?
 343     Probably a linked list of tables.
 344     Each table contains a link pointer and an array of
 345         fileaddr/physaddr/extentlen
 346     But we would need to allocate lots of these if there are hundreds of
 347       dirty pages, but possibly only end up using a few if they made
 348       extents very nicely.  That might be wasteful.
 349
 350     Or we could allocate just one.  When it is full we perform an
 351      incorporation.  But if that causes a page split we are in trouble.
 352        We could have a spare page, split to it, write out one
 353         and wait for the spare page to be written and free.
 354         But we cannot just release the index page as it might still have
 355         children.
 356
 357     (I think I've been here before).
 358     A worst-case scenario involves writing one block and that requires
 359       spliting every index up the tree to the inode.  This requires
 360       arbitrarily many pages to be allocated.  To accomodate this we either
 361       pre-allocate a spare page at every level of the tree down to the data
 362       block (a bit like storage space allocation) which seems very wasteful,
 363       or we make sure we can release one of the split pages, which seems impossible.
 364
 365     I could decide not to worry about it.  Have a pool of index pages and hope
 366      it always works.  Afterall, most pages are data pages, and they can be
 367      freed successfully.  We would only have a deadlock if all dirty memory were
 368      index pages, and that seems unbelievably unlikely.  If we trigger a
 369      checkpoint when the count of locked-pages hits some limit we should be
 370      safe.
 371
 372     So: Keep one table per index block.  Use simple append and sequential search.
 373      When table gets full, force an incorporation
 374
 375      Do we allocate the table separately, or embed it in the indexblock??
 376
 377      Probably embed it.  indexblocks that don't need it can be freed at any
 378      time so that space waste hopefully isn't significant.
 379
 380      How big?
 381       If the file is written sequentially, then everything should gather into
 382       extents, and so it doesn't need to be enormous.
 383       If the file is written randomly then the index block can be expected to
 384       be 'indirect', so incorporation will be cheap.
 385      So 'small' seems ok in both cases.
 386
 387      Let's say 8.
 388
 389      But wait a minute.....
 390      On a checkpoint we can be getting phys updates for prev and next phases.
 391      next-phase updates cannot be incorporated until the indexblock has passed
 392      on to the next phase.  So in that case, I think we still keep a linked
 393      list of unincorporated blocks and live with the fact that we cannot
 394      free them until the phase change passes.  That shouldn't be a big problem
 395      as it is a limited time frame - especially for data blocks..
 396
 397      But does this solve our initial problem??
 398      During roll-forward we want to keep the addresses but not the blocks,
 399      and we don't want to force incorporation. That means an arbitrary list
 400      of addresses attached to an index block.
 401      I guess we could possibly allow incorporation, but I would rather not
 402      as I want the fs to be able to be read-only nicely.
 403      So that means we need to have a list of address tables.
 404      Maybe the normal approach is 'add a table if possible, else incorporate'?
 405
 406      OUCH... we may write a block a second time before incorporating the
 407      new address, so when adding an address to the table we need to check
 408      if it already exists.  That could be expensive.
 409      For index blocks might it even be a different address?  I think
 410      not but the vague possibility (in the future?) does complicate
 411      things somewhat.  Maybe we just keep thing in chron order and
 412      don't worry about duplicates until incorporate time, when we have to
 413      sort anyway.
 414
 415
 416      todo:
 417         lafs_find_block  DONE
 418         free_block must free tables DONE
 419
 420
 421      Unmounting still doesn't work.
 422      Problem is that an index block is holding a reference on parent,
 423      and parent references aren't getting cleaned up.
 424      On read-only unmount I guess we need to walk the list of leafs,
 425      discard any address info, and unlock the blocks.
 426      So that should be the first task for next time.
 427
 428 27jul2006
 429   Leafs are locked blocks which have no locked children.
 430   So any locked data block (non-inode) is a leaf
 431   Any locked index block with lockcnt[phase] 0 is a leaf.
 432
 433   OK - fixed numerous bugs, but I can unmount now!!
 434   I can even rmmod and insmod and all is cool.
 435
 436
 437 TODO:
 438  - review refile and get all the code in there from prototype
 439        DONE (I hope)
 440  - write a combined find/load/wait function and use it
 441        DONE
 442  - allocate inodes in single memcache and avoid generic_ip
 443        HALF DONE. (still using kmalloc, not doing initonce well)
 444  - review recording of new block addresses
 445     + make sure we lookup there on index lookup - YES
 446     + make sure ->uninc_next gets tranferred to table at phase change.
 447     + write incorporation code as it is tricky
 448  - review how directory updates can be incorporated into a RO filesystem.
 449     No, they cannot.  We need to update the directory.
 450  - write directory update code
 451  - write cluster construction code
 452  - make sure indexblocks with unincorporated addresses get on to inc_pending
 453     ?? or is locking them enough?
 454
 455
 456 INCORPORATION - ARgggghhhhh.
 457  The current uninc_table doesn't really lend itself to building
 458   index block... though maybe....
 459  Question: what happens when an index block disappears? i.e. it has no
 460   addresses in it?
 461   We clearly need to remove it from the parent.  This should be trivial,
 462   a direct operation on the parent index block. etc some number to 0.
 463   Then the next incorporation pass with simply lose that entry.
 464
 465  OK, that might be all well and good, but how do we sort unincorporated
 466   addresses so we can merge them?
 467  A linked-list merge sort is nice and open-ended, but does waste
 468   quite a bit of space in pointers.
 469
 470  Or maybe I should just always do small-table incorporations.
 471  Is there a way that a bad ordering of writes could force very bad
 472   index layout in this case? i.e. cause a table split every time,
 473   but new blocks go in the first (full) table.
 474  OK Decision: always do small-table incorporation.
 475   i.e. not a list of blocks: just a table of addresses.
 476
 477  FIXME check validity of index type when it is first read in,
 478    and reject early if it cannot be recognised.
 479
 480 24aug2006
 481  Took a break from incorporation.
 482  Looking at directories.
 483  Wrote dir.doc in module to sum lots of stuff up.
 484  Issue:
 485    dir blocks have an info structure attached.
 486    This included a counted reference to the parent.
 487    How long does this need to hang around for??
 488
 489    - when there is any orphan issue happening, it must stay, via
 490      the 'pinned' flag.
 491    - when actually performing a dir op, we need to create and
 492      maintain this info.
 493
 494    When last ref of a dir block is dropped, should drop
 495    the parent reference.
 496
 497
 498  Status:
 499     free list management mostly done.
 500     Next:
 501       create/delete prepare/commit/abort
 502       orphan handling
 503       dirty_block lock_block
 504
 505
 506  FIXME should dir_new_block zero out the block?
 507    How will commit_create know what to do with this block?
 508
 509  NOTE another type of directory orphan is a free leaf block which
 510    is on the part-free list.
 511
 512 -------------------------------------------------------------
 513 09spe2006 0 on the plane to Frankfurt
 514  Don't tell me I am rethinking preallocation again ???
 515
 516  TODO
 517    dirty_inode needs to record the phase it is dirty in
 518    inode_fillblock needs to check current phase and act accordingly.
 519      we inode.doc
 520    Make sure the B_Orphan flag is set and used - or discard it.
 521
 522    How do we commit creating a symlink?
 523    If it is a full block in size we cannot make an update record.
 524     - maybe have two update records? We cannot guarantee they are in
 525       the same  cluster.
 526     ... but if we put the 'make dir entry' last it should work.
 527
 528    Change 'struct descriptor' definition
 529    the 'block_type' aka 'length' 16 field becomes
 530       0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
 531       0x8001 -> 0xc000 -> miniblock upto 16K+
 532       0xffff           -> index block.
 533
 534    Need to write IO routines which decrease pending-block-count in
 535      'wc'.
 536
 537
 538    Thinks.  a 1TB filesystem with 1K blocks and 4096 blocks/seg
 539      gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
 540      - 512 segments per block - is 512 blocks in each seg usage file
 541
 542 12oct2006
 543  Need to write
 544  - lafs_lock_{d,}block  DONE
 545        Make sure the block has parents and allocation and set the locked
 546        flag and the phase.
 547
 548  - lafs_flush
 549        Given a datablock, wait for it to be written out
 550        This is needed before updating a block that is still locked in the
 551        previous phase.
 552  - lafs_inode_init
 553        Used when creating a new object/inode
 554        Given a datablock which is to hold the inode
 555          and a type (Type*) and a mode,
 556        Fill in the data block with appropriate data so that
 557           when lafs_import_inode looks at it, the right stuff happens.
 558  - phase_flip
 559  - lafs_prealloc
 560  - lafs_seg_ref
 561  - lafs_lock_inode
 562
 563 lafs_dirty_dblock
 564 lafs_cleaner_pause
 565 lafs_dirty_inode
 566 lafs_seg_flush_all
 567 lafs_write_all_super
 568 lafs_quota_flush
 569 lafs_space_use
 570 lafs_cluster_update_abort
 571 lafs_cluster_update_commit_buf
 572 lafs_cluster_update_commit
 573 lafs_seg_apply_all
 574 lafs_cluster_update_prepare
 575 lafs_inode_phase_check
 576 lafs_seg_dup
 577 lafs_dirty_block
 578 lafs_cluster_update_lock
 579 lafs_checkpoint_unlock_wait
 580 lafs_orphan_drop
 581 lafs_free_get
 582 lafs_find_next
 583
 584 2nov2006
 585  - I need to know if a block is undergoing write-io so that I can
 586    avoid modifying it in certain circumstances.  But I don't track
 587    this information.  Options:
 588     1/ track the info.  This means an extra field in the 'struct block'
 589         because I still need to know which wc has had a write.
 590     2/ For blocks that we care about copy the data on write...
 591         But we care about all inodes and directory blocks.  That is a waste.
 592    I think we put extra info in the block.
 593    We need to know which wc was used (0,1,2) and which pending cluster
 594    in there (0-3) which comes to 4 bits.
 595    But we only care about the block for wc=0. and we could include the
 596    which-pending in the b_end_io, or maybe put it all in low bits
 597    of the block pointer....  Need max 4 bits.  Can only be sure of 2...
 598
 599    Maybe:
 600        'which' goes in bottom two bits of bi_private
 601        'wc' goes in ->flags
 602
 603
 604 4apr2007  (What a long gap !!)
 605
 606  - lafs_cluster_update_*
 607    How do we prepare for a cluster update?  How do we lock it.
 608
 609    The important thing is that the update can be written.  That
 610    requires that there is space available.  So we need to preallocate
 611    space and then release it.
 612    It is possible that each update might go in a different cluster, so maybe
 613    we need to preallocate one block per update.  That sounds a little expensive.
 614    After all, we aren't preallocating a cluster block for every data block
 615    that is dirty.
 616    So: prepare does nothing
 617         lock preallocates the space - a full block.
 618         commit copies it in.
 619     For now at least.
 620
 621 24May2007
 622
 623  - Can now create and delete lots of files.  This is cool.
 624   But:
 625     Orphan slots just grow and grow - never to be reclaimed - why?
 626     After rm f*, 7 files remain.  but rm f* again and the go.
 627          FIXED - readdir wasn't returning them
 628     Size of directory remains large.
 629     And sometimes, files become ghosts... (try just removing one after first rm f*).
 630
 631   TODO - process those orphans to clean up the directory.
 632
 633 20June2007 (Happy Birthday Dad)
 634
 635  - Creating lots of file and then deleting them leaves 5 orphan slots
 636    for the directory busy, and one for inode 0??
 637
 638    Directory handling uses the following orphans:
 639     CREATE:
 640         A new index block is created by splitting.  This needs to be linked in.
 641     DELETE:
 642         The dirent block we are deleting from
 643            If it becomes empty, it needs to go on free list
 644         The index block we are deleting from
 645            If it has lots of free space it might need to be rebalanced.
 646      The inode that was deleted.
 647
 648
 649  - When a file is fully deleted, we need to drop any orphan info... DONE
 650  - Need to do orphan handling of free blocks in directory, and
 651    unmerged parents - but there doesn't seem much point as I am going to
 652    change the directory layout (again).
 653
 654  So: writing to a file.
 655    We need prepare_write, commit_write, and writepage.
 656    Prepare loads and links the page and checks there is space.
 657    commit marks it as dirty so writeout is possible.
 658    writepage chooses a page to write out
 659
 660 25June2007 - HACK week, thanks Novell!!
 661  - write - DONE
 662  - sync
 663      Somewhat done.
 664      Need to revise the process whereby async completion
 665      clears PAgeWriteback,
 666      We need locking in there, and need to worry about
 667        'which' wrapping too soon.
 668      Need to not start IO before we set page writeback
 669  - chmod
 670      Maybe, but syncing to disk needs more thought.
 671  - 'df'
 672     Partly done, need actual content.
 673  - mkdir
 674     Can make directory, but creating first entry fails. - FIXED
 675  - symlink
 676  - readlink
 677  - new directory structure.
 678
 679 27Jun2007 - More HACK week :-)
 680
 681  - new directory layout done - much easier!!
 682  - If I delete a file that was created, the blocks still have a ref-count
 683    and we crash.
 684  - mkdir doesn't increase link count on parent. - FIXED
 685
 686  TODO:
 687    Orphan handling.
 688      Infrastructure to process orphans
 689      Handle specific cases
 690      flush orphans at key times.
 691      load orphans at roll-forward
 692
 693    checkpoint
 694      Write out a checkpoint (when?)
 695      Make sure refcount goes back to zero on blocks I write.
 696
 697   Check on inode_phase_check and checkpoint_unlock and inode_dirty
 698    in all directory operations.
 699
 700  FIX: Writing a small file leaves something non-dirty but
 701     due to be written, and lafs_cluster_allocate complains.
 702   - seems to work now.
 703
 704  FIX: dir_handle_orphan doesn't lock the orphan transaction required.
 705
 706  FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
 707
 708  FIX: lafs_allocate hasn't been written!!!
 709
 710  FIX: before updating any block in a depth=0 file, we must first load
 711       and 'lock' block 0.
 712
 713 29Jun2007 - still HACK week.
 714   Summary of how incorporation works.
 715
 716   Each index block has a small table for unicorporated changes. i.e.
 717   blocks number and their addresses.
 718   This supports efficient storage of extents, and is extensible by allocating
 719   more tables.  This last is done rarely.
 720
 721   When a block gets a new address, this is added to the table or, if
 722   there is a phase missmatch, it is added to a list until a phase change
 723   happens (so the whole block is pinned pending the phase change).
 724
 725   If the table is full then:
 726    - if the filesystem is read-only (including during roll-forward),
 727      a new table is allocated (else rollforward fails).
 728    - otherwise we incorporate the table into the block, then add the new
 729      address to the (now empty) table.
 730
 731   If incorporation requires that we split the index block we allocate one
 732    from a pool.  If there are none in the pool, we wait.
 733
 734   As the table is much smaller than a block, the incorporation into
 735   two block will always succeed.
 736   The 'uninc_next' and 'children' lists will then need to be shared
 737   between the two blocks before the new address is added to whichever
 738   table is appropriate.
 739
 740   When looking for a block address, we must always check the table and
 741   then children lists.  We do not need to check uninc_next as they will always
 742   be children.
 743
 744   How to ensure that the pool always has sufficient index blocks and we don't
 745   deadlock?
 746   We have two halves of the table, one for each phase.  Before we allow
 747   a block to be dirty in a phase, we ensure that the pool has adequate
 748   index blocks for that phase.  e.g. twice the depth of the block.  If it
 749   doesn't we block the dirtying until space becomes available.
 750   For syscall writes, this is easy as we catch in prepare_write.
 751   When we perform a phase change, we must be sure there are enough index
 752   blocks for the deepest bloc that will stay dirty.  If there aren't, we need
 753   to flush all dirty block, and unmap all writable mappings before
 754   starting the checkpoint.
 755
 756
 757  FIX: need to work out life time rules so that inodes hang around while they have blocks.
 758     currently have an igrab that is never put.
 759
 760  FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
 761
 762 3Jul2007
 763  Checkpoint flushing is getting close.
 764  Current problem.
 765    InoIdx blocks are not changing phase.
 766    Phase change should happen when all children have been incorporated, and
 767     then the write has been triggered marking us clean.
 768   For InoIdx blocks, we need to be marked clean when the data block
 769    completes.
 770
 771 5jul2007 - a week off
 772  Checkpoint flushing seems to work !!!!
 773  FIX: what should filesize of symlink be?
 774      other filesystems use len, but still zero-terminate for vfs.
 775
 776  Problem.  A chmod is followed immediately by an unlink then a checkpoint.
 777    The chmod update gets into the checkpoint cluster, but the unlink completes
 778    before the checkpoint is finished so the new superblock sees the file
 779    as gone.  Roll-forward find the update and want to update a missing file.
 780
 781    This isn't a big problem, but with slightly different details, it could be.
 782
 783    One option is to ignore updates that preceed the updated block.  That might
 784    be awkward with e.g. directory updates and checkpoints that cross multiple
 785    segments.
 786
 787    Another option might be to prohibit updates once a checkpoint has started
 788    unless they are known to be after the phase change.
 789
 790  FIX: unlink isn't punching a hole in the inode file.
 791       Inode usage map isn't being updated. - FIXED (For create, not unlink).
 792
 793  FIX: roll forward does not pick up inodes, only data blocks.
 794     But tiny files are synced to inode, so they might not be picked up.
 795     So we must process a level=0 inode like a data block.
 796
 797 6July2007
 798  Time for lots of clean up.
 799
 800 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
 801 DONE 2/ rename 'lock' -> 'pin'
 802  3/ Review and fix up all locking/refcounts.  See locking.doc
 803 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
 804          B_Alloc inside the semaphore
 805        Also lock inode when copying in block 0 and probably
 806        when calling lafs_inode_fillblock (??)
 807 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
 808          more allocations can come in at any time.
 809 NotYet 3c/ cluster_flush should start all writes before calling _allocate
 810          as _allocate might block on incorporation/splitting.
 811        No.  We really want _allocate to not block, but to queue...
 812         I think this is too hard to get perfect just now, so I will leave it.
 813 DONE  3d/ introduce PinPending for data blocks.  remove fs->phase_depth.
 814 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
 815      so that locking of lru doesn't have to be too global
 816 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
 817        lru system.
 818 DONE 3g/ revise refile lru handling based on new understanding
 819  3h/ Utilise WritePhase bit, to be cleared when write completes.
 820      In particular, find when to wait for Alloc to be cleared if
 821       WritePhase doesn't match Phase.
 822        - when about to perform an incorporation.
 823  3i/ make sure we don't re-cluster_allocate until old-phase address has
 824      be recorded for incorporation.
 825  3j/ Check that index blocks cannot race when getting locked....
 826   k/ Check what locking is needed to set PagePrivate exclusively.
 827 DONE  l/ cluster_done needs to call refile, but is called in interrupt context.
 828      We need to get it done in process context I think and lock
 829       ->waiting access with fs->lock after changing it to ->lru
 830 DONE  m/ Need to know which blocks in a page are in writeback so we can clear writeback
 831         only when *all* have finished.
 832 DONE  n/ on phase change, uninc_next blocks need to be shared out.
 833 NO 3o/ Make sure lafs_refile can be called from irq context.
 834  3p/   lock all lru accesses.
 835  3q/ Lock those index blocks!!!
 836  3r/ Can inode data block be on leafs while index isn't, what happens if we
 837        try to write it out...
 838  FIXED Why are extent entries only grouped in 4s?
 839  If InoIdx doesn't exist, then write_inode must write the data block.
 840  4/ resolve length of symlink
 841    FIXED - long symlink followed by 'sync' crashes.
 842    FIXED - rollforward isn't calling 'allocated' on blocks, or something
 843    FIXED - I cannot find 'bfile'. (inode isn't written)
 844    SEEMS OK...- Must flush final segment of a cluster properly...
 845  5/ Review what does, and does not need to be initialised in a new datablock
 846  6/ document and review all guards against dirtying a block from a previous phase
 847     that is not yet safe on storage.
 848           See lafs_dirty_dblock.
 849  7/ check for proper handling of error conditions
 850      a/ checkpoint_start might fail to start a thread!
 851      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
 852  8/ review checkpoint loop.
 853        Should anything be explicit, or will refile do whatever is needed?
 854  9/ Waiting.
 855        What should checkpoint_unlock_wait wait for?
 856        When do we need to wait for blocks the change state. And how?
 857 DONE 10/ rebase on 2.6.current
 858 DONE     - use s_blocksize / s_blocksize_bits rather than fs->
 859
 860  11/ load/dirty block0 before dirtying any other block in depth=0 file
 861  12/ Add writecluster flag for old-phase updates.
 862      Why is this needed?  updates should always go in the new phase???
 863  13/ use kmem_cache for 'struct datablock'
 864  14/ indexblock allocation.
 865         use kmem_cache
 866         allocate the 'data' buffer late for InoIdx block.
 867         trigger flushing when space is tight
 868         Understand exactly when make_iblock should be called, and make it so.
 869  15/ use a mempool for skippoints in cluster.c
 870  16/ Review seg addressing code in cluster.c and make sure comments are good.
 871 DONE 17/ Make sure create inherits uid etc from process.
 872  18/ consider ranges of holes in pending_addr.
 873
 874 DONE 20/ Implement rest of "incorporate"
 875 DONE 21/ Implement staged truncate
 876 DONE         use for setattr and delete_inode
 877 DONE 22/ block usage counts.
 878  23/ review segment usage /youth handling and make a todo list.
 879       a/ Understand ref counting on segments and get it right.
 880  24/ Choose when to use VerifyNull and when to use VerifyNext2.
 881  25/ Store accesstime in separate (non-logged) file.
 882  26/ quotas.
 883         make sure files are released on unmount.
 884
 885  30/ cleaner.
 886        Support 'peer' lists and peer_find. etc
 887  31/ subordinate filesystems:
 888      a/ ss[]->rootdir needs to be an array or list.
 889      b/ lafs_iget_fs need to understand these.
 890  32/ review snapshots.
 891       How to create
 892       how they can fail / how to abort
 893       How to destroy
 894  33/ review unmount
 895       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
 896  34/ review roll-forward
 897       - make sure files with nlink=0 are handled well.
 898       - sanity check various values before trusting clusters.
 899
 900  34/ Configure index block hash_table at run time base on memory size??
 901  35/ striped layout.
 902          Review everything that needs to handle laying out at cluster
 903          aligned for striping.
 904
 905  36/ consider how to handle IO errors in detail, and implement it.
 906  37/ consider how to handle data corruption in indexing and directories and
 907      other metadata and guard against problems (lot of -EIO I suspect).
 908
 909  - check all uninc_table accesses are locked if needed.
 910
 911 And more:
 912   1/ fs->pending_orphans and inode->orphans are largely unused!
 913   2/ If a datablock is memory mapped writeable, then when we write it out,
 914      we need to with fill up it's credits again, or unmap it.
 915   3/ Need to handly orphans asynchonously.
 916
 917 ---------
 918 22nov2007
 919 Free index block are on two lists, both protected by the global
 920 hash_lock.
 921   1/ The per-inode free_index, so they can be destroyed with the inode
 922   2/ The global freelist so they can be freed by memory pressure.
 923
 924 11feb2008.   Where was I up to again?
 925    reviewing phase_flip and lafs_refile.
 926
 927   UPTO
 928      Reading through modify.c, at 'add_indirect'.  Plan to fix all this code.
 929      Need to thnik about how index block really change.  How old blocks get
 930       dis-counted from segment usage, and what optimisation are really good
 931       for re-incorporating index blocks.
 932         Operations to consider are:
 933               i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
 934           i/ leaf block splits, index block gets new entry at end, and replacement
 935                   for other entry.  Easy to handle
 936          ii/ trailing entries are zeroed.  Should be easy, but isn't yet.
 937         iii/ probably caught in leafs.  May cause internal split so we add new
 938              index address, which is easily handled if there is space.
 939          iv/ same as iii, though split more likely.
 940
 941        What about merging index blocks.  That just makes addresses disappear, which
 942         we handle the slow way.
 943        Do we ever re-target index blocks?  Would need to be careful about that.
 944        Make it look like a split where one block ends up empty as a hole.
 945      Need to write
 946            grow_index_tree (DONE - untested)
 947                   ib is a leaf inode that is getting full.  Copy addresses
 948                   into 'new', and make 'ib' an index block pointing at new.
 949
 950            add_index/walk index (DONE - untested)
 951
 952            end of do_incorporate (DONE - untested)
 953                 new contains the early addresses.  Some remain in ib
 954                  and/or ui.
 955                 the buffers much be swapped, so ib has the early address.
 956                 ui needs to be attached to new
 957                 return 2; - then new uninc needs to be split
 958
 959            lafs_incorporate
 960                 case 2 - horizontal split
 961                 case 3 - vertical split
 962   12feb2008
 963    Bother - uninc_table is a problem (again).
 964    We can currently add at any time with just a spinlock.
 965    So when we split a block horizontally,
 966
 967
 968    Still need to
 969           share out children and uninc_table in do_incorporate
 970           share out credits in do_incorporate
 971
 972 14feb2008
 973    Still need to do incorporate as above but took a break to...
 974
 975    Counting allocated blocks now works - stat show right info, hopefully
 976      storage is correct too. - DONE
 977
 978    next: truncate?  orphan thread?
 979       Then segment usage and the cleaner.
 980
 981
 982    thoughts:
 983     truncate - removing blocks doesn't need to erase them...
 984     - nothing forces a cluster_flush promptly!!!  We need a timeout
 985          or at least we need a flush before truncate_inode_pages...
 986
 987     - in lafs_truncate we need to make the block an orphan an pin in
 988       all in a checkpoint.
 989
 990 21Feb2008 (Research morning)
 991    Discard checkpoint thread created on demand in favour of a cleaner
 992    thread that runs all the time.  It cleans and checkpoints and
 993    orphans and scans.
 994
 995      want to:
 996         do segment scan and get a real list of free segments and
 997         free-space info!
 998
 999 25Feb2008
1000  - segment usage scanning to count free blocks
1001  - fix up re-reading of erased blocks
1002  - FIX truncate can still block waiting for writeback to complete.
1003  - FIX allocations aren't failing when we run out of free space
1004  - FIX df doesn't agree with du.
1005
1006  problem:
1007    Truncate when an index block has addresses in uninc_table.
1008      The summary for the new address has already been performed.
1009      We need to deallocate the new without disturbing the old.
1010      However a simple allocation may not be possible.
1011      I guess we can prune them all to zero, then incorporation
1012       can proceed.
1013
1014  TOFIX: when truncating a recently created file, it is still depth=0 so
1015     nothing happens.
1016     We really need to increase the depth to 1 as soon as we dirty
1017     any block, then reset back to 0 if it fits.
1018
1019 26Feb2008
1020   We have a file that we have written to, and the data blocks have been
1021   written out and the addresses stuck in uninc_table.
1022   We then truncate the file.  Who releases the usage of those blocks?
1023   And who removes them from uninc_table?
1024
1025   OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1026   I really should make sure that inodes are getting freed properly and the
1027   inode map is clean and everything.
1028
1029   BIG QUESTION
1030     Do we reserve segment-usage blocks.
1031      We cannot do it naively as we get infinite recursion.
1032      But we need it to be allowed to dirty the segment block.
1033      But we cannot pin them to this phase as we want to write them out
1034      after this phase
1035      This still needs more thought.  I avoided the recursion by setting SegRef
1036      before getting the ref.  But that isn't safe.
1037
1038 28Feb2008
1039   The table of cleanable segments is not working out.  Each segment appears multiple
1040   times which wastes space and adds confusion.
1041   We really want to be able to lookup by dev/seg and also find the least.
1042   'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1043
1044   We could have a skiplist for dev/segment lookup and do a merge-sort on
1045   a different link when we want to find the best segment.
1046   We then remember the best number found since a sort, and re-sort if the top
1047   is worse than the best.
1048
1049   We keep all this in a fixed size table.  Each entry has
1050    seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1051       addr-sort-skip links.
1052    This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1053    Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1054    One page of 16byte entries (256 of them)
1055    2/3 page of 24byte entries, 1/3 of 32byte entries.
1056    Total 2 pages, and 256+113+43 = 412 entries.
1057
1058   But deleting random elements is awkward... but not too awkward.  We can delete
1059   lots of entries by marking them as old, then performing a single pass of the skip
1060   list deleting them.
1061
1062   We should keep free segments here too, on a separate list.
1063
1064   So how about:
1065    2 pages of 16byte entries
1066    1 page of 24
1067    1 page of 32
1068
1069   free list randomly threads through all.
1070
1071   When using from 24 or 32, randomly choose height of 2-5 or 2-9
1072   Two lists run through the skiplist entries.  One for cleanable, one for free.
1073   Remember the nth element for some small n (10, but it decreases as we pull
1074   things off the front) and if we add something less than that, we trigger a
1075   mergesort on the next time we want to clean.... maybe.
1076
1077   Remember end of free list and add to there.  Maybe merge-sort the free list
1078   by addr occasionally.
1079
1080   Quesitions:
1081     When can we clean, when can we free wrt checkpoints?
1082       - we an clean a segment as soon as we have a checkpoint after it.
1083         So we record the youth of the segment holding the (start of the)
1084         checkpoint, and can clean any segment with a lower youth.
1085       - we can free a segment after the checkpoint after itfs usage has reached
1086         zero.  So if usage is zero and youth....
1087         We could offset the usage by one (say - for the first cluster header..)
1088         then when we find a segment with usage of '1', we schedule an update to
1089         0 in the next checkpoint...
1090     Have about segments with different sizes - they get different weights.
1091        Need to divide by segment size:  usage * youth / size.
1092
1093   TOFIX
1094    - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1095    - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1096
1097    - Lots of 'creates' makes lots of little clusters - need to optimise!
1098         Or it could be deletes as we currently cluster_flush for each
1099         delete.
1100          - I think this is fixed
1101
1102 29Feb2008
1103   Started looking at the cleaner.
1104   Need to understand how much to clean each checkpoint
1105   Need to track free-space-in-active-sectors while scanning.
1106
1107 3Mar2008
1108   TOFIX
1109     - the cluster head is currently limited to one page.  This is not good.
1110
1111     - Should the cleaner start before the scan is complete after a checkpoint?
1112       Probably it can, but while the scan is still happening it might be best
1113       to be cautious ??
1114
1115   STATE:
1116     try_clean is taking shape and has a few FIXMEs.
1117     need to write async find_block code and get it to watch for
1118        block in a cleaning segment.
1119
1120 28Mar2008
1121   - where can padding appear in a cluster? between miniblocks? at
1122     end of device blocks?
1123   - need to track phys block while parsing headers for cleaning.. why?
1124   - determine rules for avoiding block lookup during cleaning
1125     based on youth/snapshot age, and truncate generation.
1126      We need to load the inode from each snapshot
1127     Can we optimise based on snapshot age?
1128     only if we know the block is newer than the snapshot.
1129     So when we relocate blocks (cleaning) they must go in a segment
1130     that is marked as being old. we cannot really guarentee that.
1131     I guess blocks that are marked as 'new' can safely be skipped if
1132      segment is newer than snapshot. This 'age' is not the youth, but
1133     is the cluster_head->seq which is stored in creation_age.
1134
1135  - Store the rootdir for a filesystem in the metadata for the root inode.
1136    Then 'struct snapshot' doesn't need rootdir.  It can have a root
1137
1138 30Jun2008
1139   Looking at lafs_find_block_async.
1140      Needs async flag to make_iblock.
1141         Check that.  Can we block_adopt if there was an error?
1142              iblock will exist.
1143      setparent has async flag.
1144      lafs_leaf_find has async flag
1145      lafs_wait_block_async
1146
1147   FIXME I wakeup the cleaner every time an IO completes.
1148   Do I really want that?  Maybe only when number of async IOs hits
1149   half the recent maximum??
1150
1151   FIXME need to ensure that lafs_pin_dblock flushed committed
1152     B_Realloc blocks.
1153
1154   FIXME when we incorporate a dirty (non-realloc) address to an index block,
1155     we need to clear B_Realloc on the indexblock.
1156
1157   FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1158     giving it any credits.  Where should they come from?
1159
1160   We don't seem to scan for free/cleanable segments often enough.
1161
1162   FIXME we shouldn't start a checkpoint while cleaning is happening.
1163
1164   FIXME need to be careful when cleaning about finding inodes that
1165     don't exist any more.
1166
1167   FIXME give credits to realloc blocks.
1168
1169   FIXME think about/document transitions between realloc and dirty,
1170     and what locking is needed.
1171
1172 2Jul2008
1173   Allowing for the FIXMEs above, the cleaner is now identifying
1174   blocks that need to be cleaned and marking them B_Realloc (I think).
1175   We now need to gather these into a write cluster and write them.
1176   They will all be on the clean_leafs list, so we can iterate that
1177    allocating or incorporating as needed.  This will be similar to
1178    do_checkpoint.
1179   Important question is: when?
1180    Ideally we would have some auto-flush mechanism.  The cleaner just
1181    keeps finding blocks to clean and when we start running out of
1182    resources we flush the cleaning queue.
1183    However we will still want to flush the cleaner always before a
1184    checkpoint, so for now we cna implement that bit and wait for a
1185    need for the other to arise.
1186
1187
1188   FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1189       don't record that location the same way.. how to handle?
1190      Should check that 'adopt' doesn't do the wrong thing with this block.
1191
1192
1193   Realloc blocks need to be pinned.  That makes sense.  Only that way
1194     will they get onto the clean_leafs list.
1195   When checkpointing we should probably examine clean_leafs to be
1196    on the safe side.
1197
1198
1199   Realloc and Dirty:
1200
1201      Both of these hold a Credit.
1202      Both can be set at the same time.
1203      Cleaner ignores Dirty and sets Realloc anytime the block is in
1204       the wrong segment.  It also Pins the block.
1205      When the cleaner is flushing to the cleaning segment, it
1206       ignores Dirty blocks.  They get their Realloc cleared, but
1207       the remain pinned.  So they will get moved at the next checkpoint.
1208      How do we know whether an indexblock should be Dirty or Realloc?
1209       The Dirty/Realloc bit is cleared before we get to incorporation.
1210       Maybe we lafs_dirty_iblock the parent of any block we write
1211        out.  Then after incorporation, we set Realloc if it is not
1212        dirty.
1213
1214 STATUS:
1215   I think I'm pinning cleaner blocks now.
1216   Need to make sure the dirty ones are dropped. DONE
1217   Need to make sure the usage is transferred
1218   Need to get free segments back into use
1219   Need some more 'dump' options.  Maybe youth/usage files.
1220       Maybe tree.
1221   Need to make sure scan etc are triggered often enough.
1222
1223   FIXME lafs_prealloc walks up ->parent without locking
1224     I think we want i_mapping->private_lock like lafs_pin_iblock.
1225
1226 TODO:
1227   1/ a 'dump' option that triggers a scan and prints everything out.
1228   2/ scan must mark freeable as such, then subsequently free them.
1229   3/ Look at code that decreases usage of old segments.
1230   4/ Review lafs_cluster_wait_all and decide exactly how long we need
1231      to wait.
1232   5/ Review 'FIXME that is gross' HZ/10 thing.
1233   6/ Review 'wait for checkpoint to flush' msleep(500);
1234            Maybe remove that altogether.
1235
1236   FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1237   FIXME BUG in lafs_allocated_block fired.
1238             from lafs_erase_dblock from invalidate_page from .. vmtruncate
1239              from lafs_setattr
1240
1241   Current problem:
1242     An inode data block is dirty and pinned, but the inoidx is no longer
1243        pinned.  Presumably it isn't dirty.
1244      Recheck what 'dirty' means on the two blocks and see how this can happen.
1245
1246 10july2008
1247   Tree gets very big!  Lots of 'Realloc' blocks that should
1248    be long gone.
1249
1250   WE are spinning in cleaner again, and not in try_clean.
1251
1252   Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1253   In general it shouldn't be.  The flush_cleaner process will remove
1254    the Realloc bits so the blocks fall off clean_leafs.  They then either
1255    go onto phase_leafs or get unpinned.
1256   But I currently have a problem with InoIdx/data.
1257   The Pin is transferred to the Data block, but it doesn't go from the
1258    InoIdx block because it has a pincnt.  Now that is probably a bug, but
1259    what if it weren't?  What if, while we were cleaning, a block got dirtied.
1260    That would pin the whole tree.
1261   I guess the rule about not allocating an inodedata block while the
1262    InoIdx is pinned needs to be revised.  If the inodedata block is
1263    Realloc (and not Dirty) while the InoIdx is not Realloc, we
1264    can go ahead (in a cleaning segment).
1265
1266  FIXME to check:
1267    adir/big1 is garbage.... big1 was removed, so why is it even there?
1268               FIXED.
1269    echo tre > dump  # still too much stuff.
1270
1271
1272
1273  Put cond_sched in checkpoint loops!
1274
1275
1276  Thoughts about cleaning and pinning.
1277
1278   When cleaning we need to know how many dependant blocks are being cleaned
1279   so that we know when *this* block can be written - i.e. when the could hits 0.
1280   We cannot use the pincnt for this phase because there may be dependant blocks
1281   which are dirty.  They, and therefore this, may get flushed at next checkpoint,
1282   but they may not.  If we could be certain they would, we could just write
1283   to the clean-segment blocks which can become unpinned.  However if there
1284   is an index block being cleaned, and no dependant is being cleaned, but some
1285   are dirty but not pinned, then the checkpoint can go past without the block
1286   being moved.... but maybe we can detect that.
1287
1288   Try this:
1289     We set B_Realloc precisely on blocks found in segments being cleaned.
1290     We pin these blocks and leafs which are Realloc go in clean_leafs.
1291     If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1292        That way it gets written at end of checkpoint, but to main cluster.
1293     When we incorporate Realloc blocks into an index block, it gets marked
1294        Realloc.  When we incorp dirty blocks, mark dirty.  Then see above.
1295     On a checkpoint, we process both phase_leafs and clean_leafs
1296
1297
1298  FIXME do inode reads async better when cleaning...
1299
1300  FIXME if a realloc inode has been allocated to a cluster when we try
1301      to dirty it, confusion can ensue as the writeout won't mark it
1302      clean, but will use up the credits.
1303      Maybe we need something similar to phasewait to not set PinPending...
1304       But normal dirtying doesn't phasewait.   I think we just need to
1305       detect this case and wait for the clean-cluster to flush.
1306       Messy...
1307
1308  FIXME make sure incorporate is doing the right thing with credits.
1309
1310  FIXME lafs_write_inode. We need to be careful about clearing Dirty
1311            when making an update.  Need some sort of locking.
1312            Need to review all inode dirty stuff and make sure we do
1313            write thing no matter when it is called.
1314
1315  FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1316         anymore so we don't know how to flag the index block.
1317
1318 2008jul13
1319  UPTO: unlink etc don't prealloc the inode that will be modified.
1320     And a warnon inode.c:579 is very noisy.
1321
1322 2008jul22
1323  FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1324      but it doesn't get set until AFTER lafs_reserve_block is called.
1325
1326  Here I am...
1327    Cleaning cleans an InoIdx block which schedules the data block.
1328     Subsequent the InoIdx block gets pinned again.
1329     Now when we go to write the data block, we cannot because InoIdx is pinned
1330      in same phase.
1331      Maybe given that data block is pinned, we write it anyway...
1332
1333  FIXME: when we realloc an block embedded in the inode, don't pluck it out
1334         and put it back in again.  Just realloc the inode.
1335
1336  FIXME: when cleaning a directory that has shrunk, we think we have
1337      blocks that don't exist any more. FIXED - we thought '0' was in
1338      segment '0'.
1339
1340 2008jul23
1341   FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1342      flush finds no credit. for InoIdx block of 8501
1343
1344   FIXME: do we do SEGREF on all the index blocks? do we need to?
1345
1346
1347 2008jul24
1348   FIXME: seg usage for segment 0/5 isn't dropping to zero.
1349     Part of a file got moved off, but count is still there.
1350     FIXED - seg_move wasn't being called.
1351   FIXME: segusage file has inconsistent extents:
1352       Extent entries:
1353        0 -> 694 for 2
1354        1 -> 1291 for 1
1355        1 -> 15 for 1
1356    FIXED several bugs in walk_extent
1357
1358   FIXME qphase:  any locking between that changing and lafs_seg_move??
1359     I don't think so.  Just that seg_apply_all must be called after qphase is set.
1360
1361   FIXME make sure we don't try to clean the current segment!!
1362
1363   FIXME 'Available' goes negative!
1364       Creating large file doesn't instantly reduce 'Used'.
1365       Deleting files plus sync doesn't increase Avail?
1366
1367   FIXME a segment is in the table but doesn't print out!
1368
1369   FIXME we don't cope with running out of free segments (not that we ever should).
1370
1371   FIXME check all Credit usage and make sure credits are returned when
1372     ->parent is dropped.
1373     provide visibilty into credit counts.
1374     Make sure we are keeping enough space for cleaning.  We should always
1375      have a few segments unallocatable.
1376
1377 2008jul25
1378   FIXME cannot do io completion in cleaner thread as it can block on
1379      a i_mutex which might be waiting for completion. FIXED (keventd).
1380
1381   FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1382             If we 'know' we have a reference, e.g. a child with a ->parent
1383             link, we can access it without locking.
1384        So:
1385            lafs_make_iblock should return a counted reference.
1386
1387        If we own an (indirect?) reference to iblock, we can access
1388         both iblock and dblock for free... but iblock can change???
1389        If not, we need to get a reference to on or other under a lock.
1390
1391   FIXME block->inode should be a counted reference?
1392
1393 lafs_make_iblock OK
1394   lafs_leaf_find OK
1395     lafs_inode_handle_orphan OK
1396       inode_handle_orphan_loop FIXED
1397     __lafs_find_next OK
1398     find_block FIXED
1399   __lafs_find_next OK
1400     lafs_find_next FIXED
1401       dir_lookup_blk
1402       dir_handle_orphan
1403       lafs_readdir
1404       lafs_inode_handle_orphan
1405       choose_free_inum
1406   find_block - FIXED
1407
1408  FIXME root->iblock should always be refcounted.  Is it?
1409  FIXME walking siblings - what lock?
1410
1411 2008jul28
1412  FIXME several times we clean PinPending without refiling, in dir.c in particular.
1413     that looks wrong. FIXED
1414
1415   Maybe  lafs_new_inode should return a reference to the dblock
1416     Or pin it. or something. FIXED  And pinned (when needed).
1417
1418  FIXME lafs_inode_dblock might return a block without valid data...
1419    Need to get valid data, then load block 0 in find_block rather than
1420        load_block.  FIXED
1421
1422  FIXME we really should own a reference to ->dblock before calling
1423     lafs_pin_inode.  We don't want IO during a pin request.
1424     FIXED
1425
1426  FIXME review use of PhysValid FIXED
1427
1428  lafs_orphan_abort - what if lafs_orphan_pin not called?
1429    or if 'b' is NULL.  FIXED
1430
1431  Do I Need to clean PinPending when retrying??
1432    Well, we need to be phase-locked when we set PinPending, so
1433     it must be Pinned to the current phase.
1434     So when we unpin a datablock, we must clear PinPending.
1435   FIXED we now clear PinPending in do_checkpoint.
1436
1437  Does phase_wait do the right thing when pinning an inoidx block
1438    for an inode? FIXED
1439
1440
1441 Pending
1442   Need to understand and document the lifetime of a page with datablocks.
1443     who hold what refcount, and when can it be freed?
1444    Then fix up locking in lafs_refile, __putref.
1445
1446  FIXME how keep what refcount on orphan blocks/inodes??
1447  FIXME should dirty/pinned/etc hold a refcount?  they don't.
1448
1449
1450 Later:
1451  FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1452
1453  FIXME make sure empty files have depth of 1.
1454
1455  FIXME Truncate proceeds lazily. All data blocks need to be gone
1456
1457 26aug2008
1458  If I call lafs_erase_dblock while a write is underway, we have a problem.
1459   We need to wait potentially for a checkpoint to let go of the block and
1460    a write to complete.
1461     This should be done with waiting for PG_writeback on the page to disappear.
1462   Check this out.
1463
1464   When end_page_writeback is called, we must have dropped all references to the
1465    page.
1466   When we commit to writing a block, we have to set PG_writeback on the page
1467    so that truncate et al can wait for it.  Before we have committed, truncate
1468    can just remove the page.  Internally we differentiate by B_Alloc.
1469   So before setting B_Allocated we need to test_set_page_writeback(page).
1470   Be careful of races.
1471   I don't think we can ensure all references are dropped.  After all, that is
1472   the point of refcounts.  So dblock array must exist without page!
1473   But we need to ensure that we don't start a writeout after truncate
1474   has done wait_on_page_writeback.
1475   This is done with the page locked so when we want to write a page
1476     in a checkpoint, we need to lock the page first.  Once we have the lock,
1477     we check if the page is still dirty.  If it has been truncated it
1478     will be clean.
1479    But how do we safely reference the page if b->page can be cleared?
1480     How about:
1481       When we clear PagePrivate, we take a counted reference to the page
1482       for db->page.  This is dropped when the page is freed by lafs_refile.
1483       But while it is held, it is still safe for db->page to be dereferenced.
1484     So before we commence writeout we have to lock the page and set
1485      PG_writeback.  After locking, we need to test if writeback is still
1486      appropriate.
1487
1488   Maybe not.  I think we can submit blocks for writeout without setting the
1489   page to writeback.  If we do, then we need to be sure those writes
1490   finish before invalidatepage calls releasepage (block_invalidatepage
1491   calls discard_buffer which calls lock_buffer which waits).
1492   In our case invalidatepage need to make sure that no new write commenses.
1493   Maybe we should lafs_iolock_block before we allocate to a cluster and check
1494   again if the block is dirty.
1495
1496   So:
1497     lafs_cluster_allocate does:
1498        lafs_iolock_block
1499        check if still dirty.  If not, unlock and return
1500        set allocate flag
1501        allocate and write
1502        when write completes, allocate is cleared.
1503                     unlock block
1504
1505     invalidatepage does
1506        lafs_iolock_block
1507        clear Valid,Dirty,Realloc
1508        lafs_iounlock_block
1509
1510
1511
1512 2008 aug 28 - happy birthday.
1513 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1514 lafs_prealloc complains.
1515
1516   mark_cleaning does too, but cleaning only happens well away from a checkpoint
1517   lock.
1518 segsum_find is being called to reference a new segment when we flush a cluster.
1519  segment usage blocks are special.  Their index information doesn't
1520 need to be written out in the current checkpoint.  We can do that, but
1521 the backstop is to write just the data block in the tail of the
1522 checkpoint and write indexing information later.
1523
1524 2008sep10
1525  unlink is getting "No space left on device".  This is when trying to
1526  pin the directoory block, the physaddr is 0, so it looks like we want
1527  NewSpace.  But we should even be trying to prealloc in that case becase
1528  there should already be a prealloc on the block.  i.e. there should be
1529  credits.
1530  Hmmm. after multiple 'syncs' how can the block not be written out.
1531  Maybe it is embedded in the inode?
1532  When we pin a block that was embedded in the inode it isn't clear what to
1533  do.  If we might grow the file so it doesn't fit any more, we need to
1534  allocate NewSpace.  If we know it won't grow. we use Release.
1535   This still needs a proper fix.
1536
1537  Cleaning seems to be working nicely.  However we don't get all the space
1538  back that we should because lots of blocks still have credits that
1539  aren't being returned.
1540
1541  So when should credits be returned?
1542  They are set when a block is pinned.  It then gets dirtied which
1543  consumes a credit.  Then gets unpinned.  I guess if it isn't pinned,
1544  then it doesn't need any credits.
1545
1546
1547  It seems that cluster_flush is not always writing things in the correct
1548   order.  Root gets written before some other things below it.
1549    Maybe they are temporarily out of the loop??
1550  No.  There are dirty blocks which one checkpoint doesn't pick up, but
1551   they aren't holding the index block pinned. so they lose allocation.
1552
1553  But they must hold the indexblock pinned, even though they aren't pinned
1554  themselves.  We maybe do this just with the refcnt... maybe.  That will cause
1555  it to phase-flip rather than drop pinning, which I think is right.
1556
1557  So: too many credits remain allocated.  Where are they?  There are 1464
1558    outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1559    But things removed from the tree have credits removed.
1560
1561
1562
1563 FIXME roll forward ignores inodes.  But what about an inode that contains
1564    data.  Should that be ignored?  I think not.
1565 FIXME delete adir/big2 then delete adir and it cannot release:
1566   Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1567  presumably there is orphan processing or something to complete???
1568 FIXME when files are deleted, the space isn't returned!
1569    This seems to be mostly fixed - need to test.
1570 FIXME when I "rm [b-z]*" it waits for writeback on something???
1571    zfile again!!!  OK, I think that is fixed.
1572
1573
1574 12sep2008
1575   Current problem:
1576     seg_apply_all dirties dblocks.  When should they be reserved?
1577     The originally get reserved by a lafs_reserve_block call in
1578     segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1579     However: that block might get written before *and* after a checkpoint.
1580     So we need N* Credits.  These are usually only used for Index blocks.
1581     We can set these easily enough if inode type is TypeSegmentMap.
1582     We move them across to Credit in seg_apply_all.
1583     But when to we clear them if they aren't needed?  I guess
1584      when we drop the last segref.  Yes, we already do that.
1585     FIXME need to make sure these get flushed on next checkpoint
1586      if we cannot allocate new credits after a checkpoint.
1587
1588   New Problem.  The 'cleanable' table reports a size of 3, but it is empty!
1589     Think that is fixed.
1590
1591   Some problems.
1592     1/ see above:  rm x/y; rmdir x -> BUG - FIXED
1593     2/ Spins on 'CURRENT=1' ??
1594     3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1595     4/ When I create/delete a file, ablocks_used increments by one.
1596         The inode hasn't been allocated yet, so it seems the deallocation
1597          isn't adjusting ablocks_used??
1598     5/ open_namei (for dd) got caught on a mutex_lock.
1599     6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1600        I'm not sure where we should and am not thinking very clearly.
1601        Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1602     7/ unlink (at least) can get stuck in iolock_block.  Who could be holding
1603        the lock?  Writeout that hasn't completed?
1604        Yes.  writepage calls lafs_allocated_block without calling flush.
1605        So the block could be sitting waiting for a flush.  How long do we
1606        wait??
1607     8/ It seems that some datablock can need NCredits.  Make sure these
1608        are handled properly re flush-or-refill after checkpoint and
1609        flip_phase rather than unpin.
1610     9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1611        enough, and we lock up (see 7).  Need to flush the first block
1612        straight away, and the next one as soon as the first finishes, etc.
1613        Or something like that.  Then remove the comment from lafs_writepage.
1614
1615 8th December 2008
1616
1617   I seem to be getting only 4 blocks to a cluster at the moment.
1618    This is good as it motivates the code to handle block splitting in
1619    the Btree.   But it shouldn't happen.
1620
1621   ....
1622   Block spliting might work - it doesn't crash at least.
1623   But
1624   After deleting all files, the tree is full of stuff.
1625   Lots of inode data/InoIdx blocks.
1626   Many but not all a Pinned.  The others are OnFree
1627   The Pinned ones have outstanding references.
1628   Others
1629
1630   ....
1631   Problem with the block splitting, when adding an index block.
1632   The index block is initially empty - we need to find things by looking
1633   at children.  But we don't.  We BUG_ON the iphys==0.
1634   In general, when we add a block below and index block and before we incorporate,
1635   the block must be found by finding the first indexed block and looking to
1636   see if there is a 'next' block that contains the address we need.
1637   FIXED
1638
1639   But if we truncate a file while an index block is pinned and dirty,
1640   we spin on trying to incorporate it, which should make it empty.
1641
1642 11th December 2008
1643   deadlock.
1644   sync is trying to get lock in lafs_cluster_flush
1645   pdflush holds the lock and is stuck in cluster_flush_0xa40
1646     some wait_event I expect.
1647     Maybe we need an unplug ??
1648
1649  - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1650    This is in clean_free.  We try to update the 'youth' to mark
1651    the segment as free, and we don't have a reservation to do it.
1652    Maybe just reserve it there and then.
1653
1654
1655 12th December 2008
1656   When doing a lookup in an index block, we need to check the unincorp
1657   address list.  It isn't enough to look for unincorp blocks as they
1658   might have disappeared.
1659   For INDIRECT and EXTENT this is easy enough as full information is in
1660   'uninc'.
1661   For INDEX it is a little tricky as we need to look at the full set of
1662    addresses to know where a particular address fits.
1663    We could force and incorporate first, but that has awkward implications
1664     if it requires a split.
1665    Maybe if we get from the lookup "start+range"....
1666      That is not enough as the 'start' might get zeroed by an update.
1667
1668
1669    rm adir/* doen't work as readdir doesn't get all the entries
1670     for some reason.
1671    Reason is that they are being put in the wrong block.
1672    lafs_find_next doesn't correctly find the 'next' block if it
1673    hasn't been incorporated yet.
1674    Block can be:
1675      in index tree -- easy to find
1676      in uninc_table -- not too hard
1677      in only in the ->children list, or attached to a page.
1678    It would be nice to use find_get_pages but that isn't exported so try
1679     something else for now.
1680    For index blocks
1681         Look in index block for 'next
1682
1683 15th December 2008
1684    FIXME when we split an index block, we need to hold a reference to
1685    the original so it doesn't disappear until the split-off copy is
1686    written.  This is because we search from an index block to find
1687    split-off copies.
1688    [ note from Feb09.  This should be OK now. Both will need
1689    incorporation, and we now hold on to blocks until they are
1690    incorporated.]
1691
1692
1693
1694 23rd February 2009
1695   - index block.  What changes are allowed exactly.
1696      - splitting certainly makes sense.
1697      - merging two adjacent blocks is fine, of which a special case
1698        is finding that a block is empty and so removing it.
1699      - What about a 2->3 split which would require removing a block
1700         and adding another at the same time?
1701        or noticing that the first blocks addressed are all missing, so
1702        moving the index forward?
1703        In each case, searching down by indexes will find a block that
1704        has been replaced by a later address.  We could manage that as
1705        long as the new block is attached after the replaced block.
1706        So we cannot move a block.  We must delete and replace.
1707
1708   - unincorporated index blocks..
1709     unincorporated data blocks are not pinned in memory.  Once they have
1710     been written out, they can be freed.  Their address is stored in the
1711     uninc-table.  This means we can delay incorporation while many
1712     extents are written out and freed.  When we come to incorporated, we
1713     may have many hundred of address in a few extents that can be incorporated
1714     efficiently without holding all that data pinned in memory.
1715     The same scale doesn't apply to index blocks.  An index block can
1716     reference only 102 blocks (for 1K block size).  And the uninc table can
1717     hold far fewer so we will naturally incorporate more often.
1718     So keeping index/indirect/extent blocks pinned until they are incorporated
1719     is reasonable.  And it makes lookup a lot easier, as we have
1720     guarantees about ordering of block in the children list that we
1721     don't have in the uninc table.
1722
1723     Incorporation could have some atomicity issues.  There is no
1724     concern about bad stuff appearing on disk as the phase-change
1725     process handles that.  In memory it might be awkward if we split
1726     an index block before incorporating a block what would span them.
1727     That could conceivably happen if we only incorporate 8 blocks
1728     (size of uninc table) at a time.
1729     So maybe we should incorporate a full uninc list (not table) at
1730     a time.
1731     This means quite different code paths for incorporating leaf
1732     and internal index blocks....
1733
1734
1735   - uninc_table lists are a real problem.
1736     They can only be created during roll-forward so they hardly ever
1737     happen.
1738     But if the block is split while processing earlier things on the
1739     list, then splitting an uninc table would be very messy.
1740     Is there any way around this?
1741     Why not just do incorporation during roll-forward?
1742     We only need to incorporate leafs, not internal blocks because we
1743     don't use uninc_table for internal blocks any more.
1744     So during roll forward, all index blocks that are touched need to
1745     be held in cache...
1746     I think we live with that.  If it every becomes a problem, we will
1747     need to perform the roll-forward twice.  The first time collects
1748     the usage information so that we know where we can start writing,
1749     then the second just applies all the changes. to the rest of the
1750     filesystem.
1751
1752
1753    So:
1754      uninc table only used for leaves, and has no linked list
1755      unincorporated index block are stored on a list, which we
1756      sort before applying.
1757      All uninc index blocks are therefore kept in the index tree.
1758      Their order on the children list allows us to find the correct
1759      index. Each block for which the fileaddr is in the parent is
1760      followed by any blocks that have been split off and end after
1761      this one starts.  Blocks that have been emptied are Hole and are
1762      skipped over when looking for a block.
1763
1764      When we split an internal block, the remaining uninc blocks
1765      must not start with a Hole.
1766
1767    FIXME: what locking do I need around lafs_incorporate?
1768       i_mutex?? i_alloc_sem??
1769       i_alloc_sem is imposed by truncate (inode_setattr) and
1770          direct_io possibly.  So it is really about adding/removing
1771          blocks.  Not updating internals.
1772          Maybe our own mutex.  Could even be per-index-block !!
1773       Whatever it is, we need to protect walking ->children too.
1774
1775
1776 24th February 2008
1777   "rm -r" problem from 12/dec/2008 fixed now.
1778   incorporate code got a make-over and is probably much better.
1779
1780   New problems:  After test runs, cannot create files due to no space
1781      on devices!!  But directory tree is empty.
1782   I can see:
1783
1784     free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1785
1786   The problem is that we think 1425 has been allocated to data that
1787   might still need to be written, leaving not enough room for more.
1788   Index Dump shows
1789   ====================414 credits ==============================
1790   which doesn't explain everything, but does explain a lot.  There
1791   really should be nothing in the Index tree (except fs-root and
1792   tree-root)
1793   There is also:
1794   Some inodes which are OnFree and hold no credits.
1795     0 DATA (1)  52 [0]ESegRef,Claimed,PhysValid
1796     52    1 (0)   0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1797
1798   Some other inodes which are pinned with lots of credits and are
1799     on the phase_leaf list
1800     0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1801    299    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1802
1803   And that is about it.  some are not Valid, some are...
1804   checkpoint just wants to 'flip' them.
1805   They mostly have a refcnt of 1... I wonder who is holding that....
1806   The reference of on the dblock is held by the iblock.
1807   But what is the iblock remaining?  Who holds that reference?
1808
1809   I restored some code to clean iblock, and now:
1810   free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1811   ====================244 credits ==============================
1812   which saved 130 credits.  That helps.
1813   There seem to be many fewer of the many-credits blocks
1814   Lot of index blocks in tree are 'OnFree' and have a
1815   0 refcnt, but haven't been removed.  Why?
1816   It seems that the have ->parent == NULL, so lafs_refile never
1817   bothers to remove them.  I guess it should...
1818   OK, lots of InoIdx block have gone now with their DATA blocks.
1819
1820   So, remaining blocks are pinned to their phase with lots of Credits,
1821     have not pincnt, mostly have physaddr==0.
1822    It is just the stray refcnt that keeps them there..
1823    inums are 40, 56, 62-73, 275-278, 280
1824     40 is f22
1825     56 is first adir
1826     63-69 are directories 2/3/4/5/6/7/8/9
1827     70-73 are looooong symlinks
1828     275 is cfile
1829     276 is dfile - same as cfile but truncated.
1830       Then some nbfile-X that were big enough.
1831
1832    So: what do they have in common:
1833      Several only use the in-inode data block, but
1834        probably not all
1835
1836     Can it be that it is refcounted on the Leaf list, and so
1837     cannot get off??  Yes, I think so!
1838     We only unpin things that have a zero refcount.
1839
1840     So: what to do?
1841       checkpoint takes it off the list, then flips the phase and puts it
1842       on the other list with refile.  During that time it has a refcount
1843       it doesn't lose the pinning.
1844       Do we want to:
1845         1/ Not have it on the list despite being pinned.
1846         2/ Drop the PIN despite the refcnt.
1847         3/ have refile do the phase_flip so it has a chance to
1848            notice the refcount has hit zero.
1849
1850       2 isn't really an option.  We need PIN to persist whenver we have
1851        a reference.  We could possibly use PinPending for index blocks too,
1852        but that would require a lot of thinking.
1853       1 requires another criterea for being on the list.  I suspect that would
1854        get messy fast.
1855       3 we used to do I think... But refile is in a big lock, and we
1856         cannot really do a phase_flip under that.. and phase flip calls
1857          refile anyway so we would get recursion.
1858       So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1859
1860       FIXME use kzalloc where appropriate.
1861
1862       FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1863
1864 25th February 2009
1865   Good progress.
1866   Only 54 credits in Index Tree now.
1867   Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1868   plus '74', which seems to be schedules for deletion - root has uninc_table.
1869    ... and 'sync' got rid of that and left 44 credits.
1870   Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1871     50  link
1872     55  zfile
1873     72  long84
1874     73  long85
1875     74  adir
1876   These seem to be the files that used data-in-the-inode
1877   They still have a refcnt of 1 (or 2 for adir).
1878   ... OK, that's gone now.  I fould a refcount leak.
1879
1880   So now:  42 Credits in Index Dump.   No stray files.
1881
1882   df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1883   So we still seem to have 1085 blocks allocated.  42 are accounted
1884   for, so 1043 still missing... either we lost the count, or lost the tree.
1885
1886   create a finy file, remove, and sync, now
1887   df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1888
1889   so I lost 15, b ut now 48 are in tree.  Lets try again...
1890   df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1891   and 44 in tree
1892   and again:
1893   df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1894
1895   Definitely losing more thant the difference in the tree.
1896
1897   Try creating empty files...
1898 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1899 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1900 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1901 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1902 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1903 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1905
1906  very strong pattern there.
1907  What about 2 files at a time.
1908 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1910 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1911 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1912 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1913
1914   Slightly different pattern - not as bad.
1915   Have to try 4 now.
1916 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1918 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1919 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1920
1921   Strange, isn't it....
1922
1923   Making sure we clear UnincCredit... result looks worse.
1924
1925 26th February 2009
1926   I fixed up the credit accounting 'incorporate' and then fixed a couple
1927   more little bugs.  And now:
1928
1929
1930
1931 ====================48 credits ==============================
1932 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1933
1934 So we still have 720 allocated credits that aren't accounted for.
1935 But we are nicely under 100...
1936
1937 .... and now
1938
1939
1940 ====================76 credits ==============================
1941 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1942
1943 That is different.  The count of missing blocks is way down,
1944 but there is some extra cruft in the index tree.
1945 Quite a few like
1946     0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1947     0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1948 and even one
1949     0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1950    330    1 (1)   0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1951 Time for a commit though....
1952
1953 and now
1954 ====================46 credits ==============================
1955 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1956
1957 so the strays in The index tree are gone. but still have 159 outstanding
1958 credits.
1959 Now change but now
1960 ====================36 credits ==============================
1961 df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
1962
1963
1964 That is a little weird...
1965 Hmmm. back to
1966 ====================48 credits ==============================
1967 df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
1968
1969 Oh well.
1970 ====================34 credits ==============================
1971 df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
1972
1973 It seems that the unaccounted blocks are (or can be) created by
1974 writing to a file then removing the file without a sync.
1975 ..but why is cb (cblocks_used) so high?
1976
1977 27th February 2009
1978
1979  Got onto a bit of a tangent...
1980  What happens if we truncate a block while it is on a list to
1981  be cleaned?  Clearly we want to cleaner to drop it ASAP.
1982  But what if invalidate_page wants to drop it *now*
1983  Hopefully it is either still on clean_leafs and we can remove it,
1984  or it is now iolocked and we can wait for it.  So should be OK.
1985
1986  I keep getting caught in "looping on..."
1987  We are truncating an inode and some index block which is now empty
1988  is not getting removed from the tree because there is an outstanding
1989  reference.... 327/0 depth=1.  I guess I turn on the tracing.
1990
1991  ... and it seems that it is in the process of checkpointing.
1992  I guess I need to lock against that ... maybe with the iolock.
1993
1994 Credits = -1, rv=2
1995 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
1996 ------------[ cut here ]------------
1997 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
1998
1999  -------
2000  Every time I create/delete a file, I get an extra 'ab' which disappears
2001  on 'sync'.
2002    ablocks_used is:
2003      decremented when +ve summary_update on non-index
2004      increased on lafs_summary_allocate... should not be done for index blocks.
2005
2006  OK:  after test run, filesystem is empty, but cblocks_used is around 360.
2007   cblocks_used:
2008         is loaded at mount time
2009         collects pblocks_used on a phase flip
2010         is updated in lafs_summary_update (unless pblocks is)
2011    So we must be missing a lafs_summary_update when phys->0
2012
2013
2014  Lots of problem:
2015    truncating big (multi-level index) seems to be bad
2016      Leaves 'pb-338 !!! and cb+689, even after sync.
2017    still 'looping on' occasionally
2018    Haven't found cblocks_used leak yet.
2019    Occasionally non-B_Valid blocks are actted on.
2020      I think I need to improve io locking.
2021
2022 ---------------
2023 1st March 2009
2024   Need some improvements to iolock locking.
2025   We use this lock to wait for a block to be written out (if that is happening)
2026    before we allow lafs_invalidate_page to complete.n
2027    It is also use in lafs_erase_{d,i}block (Similar purpose)
2028   We take the lock in lafs_cluster_allocate, and then make sure the block is
2029    still dirty.
2030
2031   Also lock in lafs_new_inode as initing the inode is a form of IO ??
2032   load_block takes the lock
2033   We only clear_bit(B_Valid, ) under this lock.
2034
2035   So the issue is this:
2036     A block that is going to be written is passed to lafs_cluster_allocate.
2037     This happens either after taking it of a _leafs list, or when
2038     lafs_writepage requests the write.
2039
2040     lafs_invalidate_page needs to be able to release the page, so there needs to
2041     be no transient references.  In particular, once the block has been
2042     removed from a _leafs list it must already be iolocked.
2043     Invalidate_page can then either remove from that list and erase the block,
2044     or use io_lock_block to wait for the IO to complete.
2045   So when a datablock comes out of get_flushable it must be iolocked, and must
2046   remain iolocked until after Dirty and Alloc are clear
2047   Index blocks belong entirely to the fs, so we can be more relaxed with them.
2048   If get_flushable finds the block already iolocked, it is either being invalidated
2049   or already has IO pending, so it can be dropped.
2050
2051
2052 16th Match 2009
2053
2054   FIXME  When we sync a small file, we just write out the inode.
2055      rollforward currently ignores data in inodes I think.
2056      Thanks needs to be fixed to ensure this data is safe.
2057
2058  - stop iblock from disappearing so much.
2059
2060  - I think...
2061     While cleaning a file, I truncate it.  This makes it appear
2062     to fit in the inode but it is very big and we get confused.
2063    We cannot allocate block 0 until all the others have been
2064    allocated to 0 and forgotten.
2065    But what if we truncate a file to 10 bytes, then fsync?
2066     We need to write the data promptly, but we like doing truncate
2067     in the background.
2068    When we extend a file we already need to wait for truncation
2069     to complete (FIXME do we do that?)  We could wait on fsync too.
2070    We cannot just delay block0 as it might be part of a checkpoint
2071     that has to complete promptly while truncation can take a long time.
2072    i.e. we have a very large file.  We update the first byte, then
2073     truncate to 2 bytes.... we don't need to write until fsync which will wait...
2074     Directory?? delete lots of entries so it shrinks to one block?
2075        There is no delayed truncate there.
2076    ?? Never clean an I_Trunc file.
2077    If we try to allocate a file with other indexes:
2078      clear Realloc
2079      if Dirty and Pinned, just do normal alloc
2080      if Dirty and not pinned, skip.
2081
2082
2083   Sometimes I run out of credits while truncating a file.
2084   I need credits - maybe only briefly - to dirty the index blocks.
2085      -- FIXED I think.
2086
2087   An indexblock remains pinned while the refcount is non-zero.
2088   A pinned index block can be on a _leaf lru
2089   The _leaf lru holds a refcount.
2090   This is an awkward referential loop.
2091   We break it at checkpoint time with special code in phase-flip.
2092   But there are other awkward times such as truncate.
2093
2094   We cannot use PinPending like we do with data blocks because there
2095   could be multiple pending Pins (from different children).
2096
2097   We could possibly treat checkpoint_lock like pinpending, but that
2098   might be racy.
2099
2100   We could not count the _leaf lru, but that might just make the race
2101   harder to find.
2102
2103   I think we want to explicitly drop the pin when we truncate a block.
2104   Normally, once we Pin an index block is will become dirty so we don't
2105   want to de-pin before a checkpoint anyway...
2106
2107   Just to clarify: an index block gets dePinned:
2108    - during checkpoint on a phase_flip if it is no longer dirty etc
2109    - on truncation when we erase it
2110    - during pre-emptive write-out which is a bit like an early phase_flip
2111            not sure that we implement that one yet.
2112
2113 17th March 2009
2114  Deadlock?
2115    - checkpoint calls incorporate call erase_iblock calls iolock_block
2116    - rm calls orphan_pin calls phase_wait
2117  The problem is in lafs_incorporate.  It expects the block to be iolocked,
2118   but can call erase_iblock which try to get an iolock itself...
2119  ...fixed that and it still happens.
2120  checkpoint calls phase_flip calls allocated_block (on uninc list) calls
2121     iolock_block before calling incorporate
2122  Maybe all of these should assume an IO lock.
2123
2124  FIXME truncate assume truncate-to-zero.  We need proper ftruncate support.
2125
2126  It nearly works....
2127   Things to do:
2128     - sort out individual patches and review DONE
2129     - allow compilation without refcount tracking DONE
2130     - don't hold a 'leaf' reference. NO
2131     - clean up *ref calls - differentiate those that can be called when zero DONE
2132     - use enum for B_* DONE
2133     - support truncate to non-zero offset DONE
2134     - "looping on" found an 'OnFree' block!
2135     - clean out lot of debugging
2136
2137  Hmmm.... deadlock.
2138   rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
2139   checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
2140   Not cool.  i_mutex must not be taken by checkpoint
2141  Fixed that, though it is a bit of a hack....
2142
2143  New deadlock:  checkpoint calls phase_flip which calls allocate_block,
2144     to move the uninc_next across, and that tries to iolock the parent to
2145     perform a partial incorporation.  But that seems to be iolocked.
2146     Generally that is ugly as ->uninc_next might be very long and require
2147     multiple splits, and direct-driving that from phase_flip is bad.
2148     I should just move the list across
2149
2150
2151 19th March 2009
2152   Spent too long trying to remove refcount help by *_leaf lists.
2153   This leaves InoIdx block with zero refcount so Data block can get
2154   lost and bad things happen.
2155   I might be able to fix it up, but it is probably better to try the
2156   checkpoint_lock approach if I can only remember what that is.
2157
2158 Locking:
2159   Available locks:
2160
2161    Spin:
2162
2163     lafs_hash_lock
2164         Used in:
2165            lafs_shrinker
2166            lafs_refile ???
2167         Protects:
2168            ib->hash
2169            ->lru when on freelist
2170
2171     i_data.private_lock
2172         Used in:
2173            lafs_shrinker
2174         Protects:
2175            ->iblock / refcnt
2176            ->dblock / my_inode
2177            ->children / ->parent within an inode
2178            setting ->private
2179
2180     fs->alloc_lock
2181         fs->allocate_blocks
2182
2183     fs->stable_lock
2184         segsum hash table
2185         segsummary counters (in blocks)
2186
2187     fs->lock
2188         _leafs lru
2189         ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
2190         Pinned consistent with lru
2191         ->checkpointing / ->phase_locked
2192         fs->pending_orphans
2193         ->uninc and ->chain ??  Should use parent->B_IOLock ??
2194         uninc_table - should use B_IOLock
2195         free list / clean list segtrack
2196
2197    Mutex:
2198
2199     fs->wc->lock
2200       wc[0] .. something in prepare_checkpoint
2201        ->remaining etc
2202       cluster_flush
2203       mini blocks
2204
2205     i_mutex
2206       inode_map
2207       orphans
2208
2209    Other:
2210
2211     B_IOLock
2212        erase_block
2213        incorporate
2214        cluster_allocate
2215        allocated_block
2216        IO
2217        Phase flip
2218        Initialising new inode
2219     B_IOLockLock
2220          IOLock across a page
2221
2222
2223 --------------------
2224 This is a list from 18 months ago, with updates
2225
2226  - Understand how superblock 'version' should be used.
2227
2228  -  Review and fix up all locking/refcounts.  See locking.doc
2229        Also lock inode when copying in block 0 and probably
2230        when calling lafs_inode_fillblock (??)
2231  -  lafs_incorporate must take a copy of the table under a lock so
2232          more allocations can come in at any time.
2233
2234  - We don't want _allocated to block during cluster flush.  So have
2235    a no-block version and queue blocks on ->uninc if we cannot
2236    allocate quickly.  Find some way to process those ->uninc blocks.
2237
2238  - Use above for phase_flip so that we don't need to _allocated there.
2239
2240  - Utilise WritePhase bit, to be cleared when write completes.
2241      In particular, find when to wait for Alloc to be cleared if
2242       WritePhase doesn't match Phase.
2243        - when about to perform an incorporation.
2244  - make sure we don't re-cluster_allocate until old-phase address has
2245      be recorded for incorporation.
2246
2247  - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
2248
2249  - Can inode data block be on leafs while index isn't, what happens if we
2250        try to write it out...
2251
2252  -  If InoIdx doesn't exist, then write_inode must write the data block.
2253
2254  - document and review all guards against dirtying a block from a previous phase
2255     that is not yet safe on storage.
2256           See lafs_dirty_dblock.
2257  - check for proper handling of error conditions
2258      b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
2259  - review checkpoint loop.
2260        Should anything be explicit, or will refile do whatever is needed?
2261  - Waiting.
2262        What should checkpoint_unlock_wait wait for?
2263        When do we need to wait for blocks the change state. And how?
2264
2265  - load/dirty block0 before dirtying any other block in depth=0 file
2266
2267  - use kmem_cache for 'struct datablock'
2268  - indexblock allocation.
2269         use kmem_cache
2270         allocate the 'data' buffer late for InoIdx block.
2271         trigger flushing when space is tight
2272         Understand exactly when make_iblock should be called, and make it so.
2273  - use a mempool for skippoints in cluster.c
2274  - Review seg addressing code in cluster.c and make sure comments are good.
2275  - consider ranges of holes in pending_addr.
2276
2277  - review correct placement of state block given issues with stripes.
2278
2279  - review segment usage /youth handling and make a todo list.
2280       a/ Understand ref counting on segments and get it right.
2281  - Choose when to use VerifyNull and when to use VerifyNext2.
2282  - implement non-logged files
2283  - Store accesstime in separate (non-logged) file.
2284  - quotas.
2285         make sure files are released on unmount.
2286
2287  - cleaner.
2288        Support 'peer' lists and peer_find. etc
2289  - subordinate filesystems:
2290      a/ ss[]->rootdir needs to be an array or list.
2291      b/ lafs_iget_fs need to understand these.
2292  - review snapshots.
2293       How to create
2294       how they can fail / how to abort
2295       How to destroy
2296  - review unmount
2297       - need to clean up checkpoint thread cleanly - be sure it has fully exited.
2298  - review roll-forward
2299       - make sure files with nlink=0 are handled well.
2300       - sanity check various values before trusting clusters.
2301
2302  - Configure index block hash_table at run time base on memory size??
2303  - striped layout.
2304          Review everything that needs to handle laying out at cluster
2305          aligned for striping.
2306
2307  - consider how to handle IO errors in detail, and implement it.
2308  - consider how to handle data corruption in indexing and directories and
2309      other metadata and guard against problems (lot of -EIO I suspect).
2310
2311  - check all uninc_table accesses are locked if needed.
2312
2313  - If a datablock is memory mapped writeable, then when we write it out,
2314      we need to with fill up it's credits again, or unmap it.
2315  - Need to handle orphans asynchonously.
2316
2317  - support 'remount'
2318  - implement 'write_super' ??
2319
2320  - pin_all_children has horrible gotos - remove them.
2321
2322  - perform consistency check on all metadata blocks read from disk
2323    e.g. don't assume index blocks are type 1 or 2.
2324
2325 23rd March 2009
2326  + looking at cleanup for unmount.
2327  - various more refcounts fixed up
2328  - B_SegRef is never dropped!  and we take a ref on a segment when
2329    we start a cluster on it, but never drop that reference.
2330   THIS is next thing - review all setting and clearing of B_SegRef.
2331
2332 30th March 2009
2333  - SegRef and lafs_reserve_block...
2334    There is room for recursion here, I need to be careful.
2335    To dirty a data block, all parent index blocks must be Pinned and must
2336    be able to be written.  That means their segusage blocks must be
2337    available for update.  And Pinning a segusage block for update requires
2338    all its parents.  So the segment for the block, the indexes, and the
2339    segusage and indexes and so-on must all be pinned.
2340    When we pin a block, we do it from the root down to avoid recursion.
2341    We probably wany whatever reserve_block calls, to return an unreserved
2342    block rather than call reserve_block itself.
2343
2344   When do we clear SegRef?? We set it when Pinning, so I guess we
2345     clear it when unpinning.
2346    pin_dblock, mark_cleaning, prepare_write, truncate
2347    seg_move clean_free
2348   We it is really when Pinning, or Dirtying or Reallocing.
2349   So we clear when unpinning, or when a dblock gets written...
2350   Maybe just when we lose ->parent
2351
2352 6th April 2009
2353  - sometimes sugsum counter goes zero for random data block
2354      Something is going wrong in roll-forward.  The block looks transiently valid
2355      so doesn't get read, but has no good data in it.
2356  - After deleting a directory, the block might still have incorporation
2357    to happen, but is not marked dirty
2358  - at unmount, there are various blocks that are still dirty.
2359  - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
2360
2361 12th April 2009
2362  - that rollforward problem above:
2363     When rolling the checkpoint, if we find segusage blocks we want to include
2364     them directly into file.  But by pinning the block we might preread a
2365     segusage block.. but we must be sure not to update it.
2366     So during the early stages of rollforward while still in the checkpoint,
2367     seg_inc must be called with in_phase == 0.
2368     so seg_move is called with phase != qphase.
2369     ditto for summary update.
2370     So the block must be pinned to the previous phase...
2371     Normally 'phase' changes at checkpoint-start,
2372              qphase changes at checkpoint-end
2373     So we probably want to start with qphase being 0 and phase being 1.
2374     When we reach the end of the checkpoint, we flip qphase to 1.
2375
2376  - blocks still in phase_leafs at unmount:
2377     After we force a final checkpoint we still have Pinned:
2378         root InoIdx
2379         ino==8 InoIdx due to Dirty block0
2380         ino=16 InoIdx due to dirty block0
2381      and dirty:
2382         inode block 1,  inode usage map
2383                     2,  root directory
2384                     8,  orphan
2385                    16   seg usage
2386      Problems:
2387         inode blocks dirty but not pinned?  No InoIdx...
2388         Segusage dirty - probably by seg_apply_all - disable that at umount
2389         orphan dirty ??... but not pinned!
2390            This is possible - we don't pin for clearing entries, just for setting.
2391         The inode problem stems from the datablock being dirty while the
2392          InoIdx block isn't.  That is, at best, confusing.
2393
2394 13th April 2009
2395    segusage blocks aren't being pinned
2396    They need to be pinned  whenever dirty.
2397    and youth blocks aren't even made dirty some times.  They need to be
2398     pre-pinned in many cases.
2399
2400    So: segusage gets changed when we write out a cluster, and when we
2401       delete/relocate blocks.
2402       In the first case we pin the block when it becomes part of the free list,
2403       and need to keep it pinned across checkpoint changes.
2404       In the second, we pin when the block is dirtied and again must keep it pinned.
2405       Youth gets changed when a segment becomes free and again when we allocate
2406       a segment to it.
2407
2408       Keeping a datablock pinned across checkpoints is awkward - we currently need
2409       to repin for each dirty... I guess we can re-pin for each checkpoint
2410       in lafs_seg_apply_all.  That might work for segusage, but not for youth!
2411       If segsnum for ssnum==0 held a reference to the youth block, that might
2412       help.  Segstat on 'clean' or 'free' would imply a reference to that segsum.
2413
2414       Is it OK to keep all youth/usage blocks for free/clean blocks
2415       pinned?  We can currently have 810 entries.  Only half will be clean/free.
2416       For each entry there can be two blocks, youth and usage.  So that could be
2417       810 blocks. 1Meg?  Normally much less.  If it became a problem we could
2418       reduce the number dynamically I guess.
2419
2420       maybe segusage blocks need to get phase_flipped, as other blocks do
2421       depend on them,   pin_all_children wouldn't be able to find them though..
2422
2423     1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
2424       Youth block.
2425
2426 14th April 2009
2427    I think I want to link dirty block to the space in free segments that we
2428    actually know about.  Each of those segments has youth and usage blocks
2429    pinned (at least parent pointer is active).  So we have everything we need
2430    to write everything that is dirty.  So 'free' or 'clean' implies
2431    a segsum reference which holds youth block.
2432
2433    When we get low on space, we wait for cleaning/finding to progress.
2434    This would limit us to  400 segments, say 16Meg each, so 6Gig of dirty
2435    memory.  I guess that we need to scale the 'free' list based on available
2436    memory (FIXME).
2437
2438    When cleaning needs a segment, it needs to load the usage blocks for other
2439    snapshots too.
2440
2441    When cleaning in the presence of snapshot we need to be careful never to
2442    duplicate a block that is shared.  To allow for v.many snapshots, we don't
2443    even want to duplicate in memory.
2444    So we need to choose a 'primary' copy - probably first one found - and
2445    follow the peers link when possible...
2446
2447 18th April 2009
2448    (continuing).
2449
2450    So clean and free segments in the list carry a SegRef.  But it could be
2451    excessive if all of them did - we shouldn't be required to pin more
2452    data than we need.
2453    So for segments with a usage of 0, we use the score to record if a
2454    segref is held.  0 means 'no', 1 means 'yes'.
2455    When space_alloc wants more space we need to find an entry and
2456    segref it.  Maybe we want free lists - reffed and not-reffed.
2457
2458    Then again, SegRefs are fairly cheap as they are heavily shared.
2459    maybe 512 to a block.  If we hold 400 refs they could easily all be
2460    in one block.  We could possibly encourage this by sorting the list
2461    and discarding from one end if it is too full.
2462    Sorting is a good idea definitely.  It keeps youth/usage updates
2463    together.
2464
2465    Just check the numbers.
2466    a 1TB device with 1K blocks might have 32M segments of which there
2467    would be 32768.  512 per block means 64 blocks or 16 pages (64K).
2468    So total segusage files is 128K plus snapshots.  Not worth worrying
2469    about surely.
2470    For 16TB, that is 2Meg plus snapshots.
2471
2472    So
2473     - keep a SegRef for all free and clean blocks.
2474       This must include a youthblk reference.
2475     - sort the free list when 'clean' is merged or when a pass
2476           finishes.
2477         sort clean list
2478         fix youth value
2479         merge as many as fit into free
2480         sort
2481
2482    How is the code flow...
2483       add_cleanable is called during the periodic scan.  It could hold
2484                a SegRef easily.
2485       add_cleanable calls add_clean as does lafs_get_cleanable during
2486           clean.  That might block getting a segref, might even
2487           deadlock?
2488       add_free is also called by seg_scan
2489
2490       So seg_scan should get a segref and leave it with everything!
2491
2492     BUT.....
2493     A SegRef implies a 'struct segsum' for each segment.  We don't
2494     want to allocated one of these for every segment in the table.
2495     We only want a reference to the youth and segusage block, which
2496     are heavily shared.
2497
2498     But these blocks need to be Pinned and SegReffed etc so we can
2499     write them at any time.
2500
2501 20th July 2009
2502   The refcount held by the 'leaf' lru is a problem.
2503   While it holds a count we do not unpin an index block, so it cannot
2504   be removed from the list.
2505   Thus we can only remove from the leaf lru on a phase change.....
2506   Or when doing lru based flushing... Maybe we can remove from the
2507   lru while holding the checkpoint lock.
2508   This happens when truncating..
2509
2510   No, that is just too messy as it is too easy to get put back on the list.
2511
2512   Maybe the leaf lru should not imply a reference count ... or maybe
2513   we need to split the refcount:  'inuse' and 'active'.....
2514   How about we test refcnt against list_empty(->lru)...
2515
2516   ....
2517
2518   During truncate, we need each index block to get unpinned so they can
2519   all be cleaned up.
2520   But the InoIdx block is held pinned by by the inode block being dirty.
2521   In this particular case, the InoIdx block is Invalid as the file is empty.
2522   But.... InoIdx should always be valid until after Inode is destroyed??
2523
2524
2525  umount
2526  I need to stop the cleaner and flush everything before trying to
2527  clean up.
2528
2529  This is awkward though.
2530  The 'sync' of umount is done by kill_block_super, but I call
2531  that rather late, after checking that the tree is empty.
2532  There are pinned/dirty bits left after sync that we want to magically
2533   clean.
2534  We have:
2535    - segusage/youth blocks.  Maybe if we don't seg_apply_all...
2536    - orphan block.  Maybe don't mark it dirty when we remove things?
2537    - inode map?? why is that dirty
2538
2539    - root directory is dirty still??  But it has been erased.
2540      InoIdx is valid-but-empty.  Inode Data is dirty
2541         Data block 0 is Dirty at block 0.
2542
2543   ......
2544  Ahh... need to mark page dirty when block is marked dirty !!
2545
2546  The seg usage blocks are now flushed out but not incorporated.
2547  I feel that might be correct - we don't want to care about
2548  incorporation as we will never use it.
2549  For this, segusage and quota are very special cases.
2550
2551  Inode map is no longer dirty, but is pinned
2552  Orphan does have a dirty block still
2553     The orphan table contains the root directory.
2554  root is now clean and gone
2555
2556  Segusage doesn't get incorporated after last checkpoint now
2557  so that is better.
2558  But now we have a circular reference for SegRef.  This should not
2559  be surprising given the circular problems we had setting SegRef.
2560  I guess we just erase the references in the segsum table...
2561
2562 22nd July 2009
2563  Hurray!!! I can unmount without crashing!
2564  Now I need to sort through all the fixes required to achieve that
2565  and make discrete patches, and be sure it is all OK.
2566
2567 DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
2568 DONE - (block.c) Mark page dirty when block becomes dirty
2569 DONE - (checkpoint.c) print orphan_slot with Orphan flag
2570 DONE - Don't incorporate segcount etc after final checkpoint
2571 DONE - Don't apply seg changes after final checkpoint.
2572 DONE - Don't start opportunistic checkpoint after final.
2573 DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
2574 DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
2575 DONE - (cluster) be more flexible about credit usage when flushing InoIdx
2576 DONE - (dir) do add_orphan when we abort as well as on success
2577 DONE - use inode_dec_link_count, not i_nlink--
2578 DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
2579 DONE - change %d/%d to strblk
2580 DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
2581 DONE - (index) refile: when unpinning, remove from lru
2582  - lafs_refile: ->iblock can be non-null for inode 0.
2583 DONE - Make sure I_Deleting gets cleared when deleting finished.
2584 DONE - phase_flip should have something separate to call, not lafs_allocated_block
2585  - inode.c: lafs_dirty_inode: getref_lock used to get dblock
2586 NONO - ?? getref_locked allowed if PagePrivate
2587 DONE - segment: lafs_seg_put_all needed at unmount
2588 DONE - segdelete_all: need to put intable references
2589 DONE - lafs_free_get: put the intable references
2590 DONE - lafs_get_cleanable: put the intable references
2591 DONE - fix sort splitting in add_cleanable
2592 DONE - add lafs_empty_segment_table for unmount
2593 DONE - lafs_release: flush all dirty blocks
2594 DONE - lafs_release: force a final checkpoint
2595 DONE - lafs_release: move kill_block_super before final check
2596 DONE - lafs_put_super: release orphans and segsum files.
2597 DONE - lafs_destroy_inode: putref should be 'iblock'
2598  - lafs_destroy_inode: allow for iblock to be present but no ref held....
2599 DONE - can roll forward call lafs_allocated_block without dirty???
2600
2601 27th July 2009.
2602  - I've re-arranged lafs_release so that the flush is all done in
2603    generic_shutdown_super.  However it calls invalidate_inodes, and that has
2604    problems with pinned inodes.  So we need for fsync_super to checkpoint
2605    out all inodes that we don't hold our own reference to.
2606    If we do hold a reference, then invalidate_inodes will skip them,
2607    and ->put_super can be used to drop the references and perform the final
2608    checkpoint.
2609    fsync_super calls ->sync_fs. after syncing call files.  Maybe I can
2610    do some sort of checkpoint there...
2611    There almost is a checkpoint in there.... But only when called without
2612    'wait'....
2613    I need to understand 's_dirt'.
2614    This is controlled entirely by the filesystem, common code only examines it.
2615    If it is set:
2616           file_fsync (the generic 'fsync' method) will call ->write_super
2617           fsync_super will call write_super
2618           generic_shutdown_super will call write_super
2619           sync_supers will call write_super
2620           sync_filesystems(0) will call ->sync_fs
2621    sync_fs is called:
2622         twice from 'sync', once with '0', once with '1' for 'wait'.
2623              (though in emergency_sync, both are '0').
2624         once from unmount and remount with 'wait' set to '1'.
2625         We don't want two checkpoints for a 'sync', but we want to start
2626         on 'wait=0'.
2627         Maybe if we get called with '0', we set a flag and treat the '1'
2628         differently..  There is no locking to make this really safe, but
2629         it will probably be OK...  I could take a process_id, but then
2630         parallel 'sync's could race.
2631         write_super is called before the syncs.  So it could start the checkpoint,
2632         and sync could wait for it.
2633         write_super is called multiple times at shutdown,  We really need
2634         to utilise sb_dirt to avoid some of these.
2635         We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
2636             - when we pin a dblock or dirty a this-phase iblock.
2637
2638 29jul2009
2639   at unmount, we iput the root inode which de-references the dblock
2640   before clearing ->iblock, which fails an assertion ... why?
2641    Apart from the shinker, ->iblock is only set to NULL in refile
2642    when we find an I_Destroyed inode... I guess the root block isn't
2643    getting Destroyed...
2644  The protocol for freeing iblocks is bad.  Should be:
2645    - it only gets freed by the shrinker
2646    - when inode dies, set ->inode to NULL
2647    - when InoIdx iblock dies, set ->iblock to NULL
2648    ...???
2649 30Jul2009
2650   So, what exactly is the protocol?
2651     - index blocks live either in the parent/sibling tree, or
2652       on the inode's free_index list
2653     - when refcnt is 0, they live on 'freelist.lru'.  When refcount
2654       is elevated they stay on lru until they need to be
2655       added to some other lru (leafs or cluster)
2656     - when shrinker finds block on freelist.lru with non-zero refcnt,
2657       it just removes from lru
2658     - when shrinker finds free block, it removes from free_index and discards
2659       the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
2660         I think not as such would either have children or be on an lru
2661     - When we destroy an inode, all index blocks get disconnected from the
2662       inode and freed.  This must include the ->iblock
2663     - When an index block becomes free due to index tree shrinkage,
2664       we set the ->depth to -1 so that it cannot be found by mistake,
2665       and leave it for shrinker or inode destruction.
2666
2667    Confused about inode<->dblock dependence.
2668    We don't want the inode to refcnt the dblock as that wastes space.
2669    We don't want the dblock to refcnt the inode as that stops it from being freed.
2670    So each must disconnect from other when freed.
2671    What locking?
2672    inode takes private_lock, then checks dblock
2673    dblock cannot take private_lock before checking ->my_inode..
2674    Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
2675      drops ref
2676
2677 1Aug2009.
2678   Tracking down the 'credit' count and making sure it stays correct.
2679   It seems that I have a Dirty InoIdx block which is not pinned.
2680   Due to this it has no refcount and so the data block disappears so
2681   the InoIdx block is not visible in the tree.  This isn't a definite bug
2682   but it means I cannot count credits properly.
2683   And surely Dirty index blocks must always be pinned!!??
2684
2685   When as small file is flushed to the inode we were dirtying the
2686   iblock.  That seems wrong - should dirty the dblock?  Need to
2687   check that is valid
2688
2689   I got a hang in 'rm adir/4'.
2690   rm is in lafs_cluster_update_commit_both
2691        getting a mutex.
2692   cleaner is in lafs_do_checkpoint+0xe4
2693   pdflush is in writepage/lafs_cluster_flush waiting on a lock
2694   so I guess cleaner is holding a mutex and waiting for something
2695    that wont happen?
2696
2697
2698   Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
2699    cleaner is at some point, holding a mutex to stop 'sh'.
2700   0e4 == 228
2701
2702   ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
2703    to be allowed.
2704   So when something locks the checkpoint and needs to flush, we have problems....
2705
2706
2707   I seem to have fixed the above.  Now:
2708     Free space is a real problem.  When I remount after the successful unmount,
2709     we find a usage pattern like:
2710 CLEANABLE: 0/0 y=10 u=34179
2711 CLEANABLE: 0/1 y=0 u=65144
2712 CLEANABLE: 0/2 y=0 u=65535
2713 CLEANABLE: 0/3 y=32773 u=32910
2714 CLEANABLE: 0/4 y=32772 u=149
2715 CLEANABLE: 0/5 y=0 u=0
2716 CLEANABLE: 0/6 y=32770 u=16529
2717 CLEANABLE: 0/7 y=32769 u=35084
2718 CLEANABLE: 0/8 y=32768 u=31877
2719
2720     Which is ridiculous.
2721    Better fix up what I have first...
2722
2723  ...
2724  In rm /mnt/1/nbfile* we hang..
2725    rm is in lafs_phase_Wait from pin_dblock in unlink
2726 wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
2727
2728    cleaner is in lafs_iolock_block from add_block_address in phase_flip
2729 iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
2730
2731  So cleaner is probably deadlocking against itself via iolock_block.
2732   This is taken:
2733     - in lafs_invalidate_page just to wait for any io - it isn't held long
2734     - in lafs_erase_dblock while we erase and 'allocated_block'
2735     - in lafs_get_flushable to protect blocks being checkpointed
2736     - in lafs_writepage to call cluster_allocate (which releases), both for
2737              data block or for inode when data was flushed there.
2738     - lafs_add_block_address to process pending incorporations to make room.
2739          This is what is trapping the cleaner.
2740     - lafs_inode_handle_orphan when truncate finishes to erase_iblock
2741     - lafs_inode_handle_orphan again to incorporate all removal
2742     - and again to erase_iblock
2743     - and for partial truncate to incorporate some removals
2744     - and again....
2745     - lafs_new_inode to keep it from being cleaned while being created
2746     - roll_block to add addresses
2747     - lafs_load_block during IO
2748
2749   So: who holds it?.... let's use the code to find out...
2750   And the answer is : lafs_get_flushable.
2751    So get_flushable iolocks the block then calls phase_flip which tries to
2752    incorporate other-phase children which try to iolock the block.  Deadlock.
2753    Do we need to hold iolock during phase_flip ??.  Not for all of it..
2754
2755 02August2009
2756    FIXME When erasing a block, do I need an uninc credit?  I usually don't
2757     have one and the need certainly isn't as great...
2758
2759   Now... let's try to get free space accounting right.
2760    Observed problems:
2761      - unlink sometimes failed with ENOSPC
2762      - usage scan shows segmetns with enormous usage - 23039!!
2763
2764   no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
2765   no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
2766
2767   no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
2768
2769
2770   after umount/remount df says "4608 7 1544" but cannot
2771    create anything.
2772 df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
2773 ============= Cleanable table (7) =================
2774 pos: dev/seg  usage score
2775   0:   0/0        1 0
2776   1:   0/5        1 64
2777   2:   0/6        6 384
2778   3:   0/7        2 128
2779   4:   0/8        3 192
2780   5:   0/3        1 64
2781   6:   0/2        2 128
2782 ...sorted....
2783   0:   0/0        1 0
2784   1:   0/3        1 64
2785   2:   0/5        1 64
2786   3:   0/2        2 128
2787   4:   0/7        2 128
2788   5:   0/8        3 192
2789   6:   0/6        6 384
2790 --------------- Free table (1) ---------------
2791 12290:   0/4        0 0
2792 --------------- Clean table (0) ---------------
2793 CLEANABLE: 0/0 y=10 u=1
2794 CLEANABLE: 0/1 y=32775 u=3
2795 CLEANABLE: 0/2 y=32774 u=2
2796 CLEANABLE: 0/3 y=32773 u=1
2797 CLEANABLE: 0/4 y=0 u=0
2798 CLEANABLE: 0/5 y=32771 u=1
2799 CLEANABLE: 0/6 y=32770 u=6
2800 CLEANABLE: 0/7 y=32769 u=2
2801 CLEANABLE: 0/8 y=32768 u=3
2802
2803
2804 03Aug2009
2805  Current issues:
2806 FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
2807 Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
2808  3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
2809  4/ invalidate_page find Realloc set, even after iolock ..
2810      This is during umount  in generic_shutdown/lafs_put_super/iput
2811  5/
2812
2813
2814  Thoughts:
2815    If we flag a block for Realloc then Dirty before it is allocated,
2816      then all is fine.
2817    But if we have already allocated to a cleaning cluster... what happens?
2818     We need to treat this like it was dirties after being written, so
2819     it gets written to a regular cluster as well.
2820     As we only have one uninc bit for both Dirty and Realloc, we need
2821     to *not* incorporate the Realloc update if the block is still dirty.
2822    So:
2823         - block gets chosen for cleaning and allocated to a clean-cluster
2824         - block gets marked dirty.  This must not clear Realloc
2825         - cluster is flushed, block is dirty, so don't call lafs_allocated_block
2826         - Return the Realloc credit, but keep dirty and Uninc.
2827      Is there a race if Dirty is set after we enter lafs_allocated_block?
2828       As long as the index block gets marked Dirty, not Realloc we might
2829        be safe... though it gets awkward if the Dirty writeout falls in to
2830        the next phase.  But reserve_block will have provided NCredits for that.
2831      So:
2832         1/ don't clear Realloc when setting Dirty
2833         2/ do clear Realloc if cleaner finds the block is Dirty
2834         3/ avoid calling lafs_allocate_block when cleaning a dirty block.
2835                    This is an optimisation.
2836
2837     Almost...  A B_Realloc block no longer has B_Credit so B_Dirty cannot be
2838        set.
2839
2840
2841   Thoughts3.
2842      When cleaning blocks we hold no reference to the inode and it can disappear.
2843      We don't want to hold the inode active, but need a reference much like
2844       the truncate code has.
2845      I think we need a subordinate refcount for both cleaning and truncate.
2846       These hold inode present but not active.
2847      Maybe every block->inode should be counted like this.
2848      And this might simplify the my_inode->dblock inter-relationship.
2849      For later..
2850        We need to ensure that if a new iget is called on an inode that still
2851        exists, we don't allocate a new one but just reuse the old.
2852        But that won't work as we cannot add an inode back into the hash table.
2853      So I think when cleaning a block we need to ref the inode.
2854       i.e. B_Realloc implies an i_grab
2855
2856 05aug2009
2857  So I have a problem with the cleaner wanting to hold and inode that
2858  the VFS is destroying.
2859  I don't want the cleaner to hold i_count as that delays truncate etc.
2860  So we need a second counter subordinate to i_count.
2861  This is held by the cleaner and by delayed truncate, and by i_count.
2862  Possibly ->my_inode holds this, which means it can be a single bit...
2863
2864  When a lookup wants an inode, we need to load the inode data block and
2865  see if it has my_inode.  If it does, we insert that inode in to the
2866  hash table.  If not we fall back to regular inode creation....
2867
2868  On reflection, that is too complicated and hard and error prone.
2869  When relocating a file we need the data so it had best be in the page
2870  cache so the filesystem really needs to know that the inode is still
2871  active.
2872  So cleaning needs to keep a reference to the inode.
2873  The cost of this is that if an inode is being deleted while it is
2874  being cleaned the truncate cannot happen until the cleaning
2875  completes.  This means that space usage will be wrong.
2876  When nlink becomes zero we can drop the cleaner reference.  When
2877  the inode is dropped/destroyed we can tie the cleaning in with the
2878  delayed truncate so that the final destruction doesn't happen until
2879  the cleaner has let go.
2880
2881  So: how to track that the cleaner has a reference to the inode?
2882  Maybe every B_Realloc block owns a ref on the inode.... but dropping
2883  those references when i_nlink hits zero would be difficult.
2884  They could hold a secondary refcount which, if non-zero, implies a
2885  ref on the inode.
2886
2887  So:
2888   - Set B_Cleaning when we look at a block for cleaning, and clear
2889     it when we find Realloc clear and ....????
2890   - Whenever a block has B_Cleaning set, it holds a counted reference
2891     on LAFSI(b->inode)->cleaner_ref
2892   - When cleaner_ref is non-zero and I_Deleting is not set, we hold
2893     a reference on the inode (i_grab).
2894   - when i_nlink hits zero, set I_Deleting and drop any reference
2895     held by the cleaner.
2896  DONE - cleaner must be careful not to process any block that has been
2897     truncated, or file that is dead.
2898  DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
2899   - What about filesystem inode... how do they fit in??
2900
2901
2902   Question. When are the index blocks for an inode flushed?
2903   We need to have them gone when the inode disappears.
2904   For deleted inodes, this happens in background truncate.
2905   For memory-pressure inodes it will hopefully happen well in advance,
2906   but we need to make sure in destroy_inode that everything is
2907   written. - FIXME
2908
2909
2910   Thinking again about B_Cleaning, any B_Realloc block will hold a
2911   reference through to InoIdx and so dblock will be present and the
2912   inode won't be freed.  So we only need an extra reference during
2913   the first little phase of cleaning when we are collecting blocks.
2914   After that a reference can be useful as it will delay flushing so it
2915   can be more efficient...
2916
2917   Maybe this is all much simpler than I thought.
2918   If we hold a ref on the inode whenever the InoIdx block is Pinned
2919   and i_nlink is non-zero, then we won't be forgotten until all
2920   index blocks are written.  We may still be deleted, but as that
2921   is one-way we can hold on to the inode at little cost.
2922
2923   getting/putting that ref at exactly those times turns out to be
2924   messy.
2925   It might be best to have a flag to say "We hold an extra ref".
2926   Then we occasionally call a function that validates the setting.
2927   It is most important to drop the count at the right time, so
2928   after unlink/rmdir/rename and when B_Pinned is dropped.
2929
2930   B_Pinned is set in:
2931      set_phase which is called from:
2932           lafs_cluster_allocated when moving 'pin' across to data block
2933               so don't need checkpin
2934           lafs_pin_block_ph
2935               only need check_pin if dropping spinlock
2936           pin_all_children
2937               only pins data blocks (Index are already pinned if relevant).
2938           grow_index_tree
2939               where "inoidx block pinning" doesn't change
2940           do_incorporate_leaf
2941               No InoIdx involved
2942           do_incorporate_internal
2943               ditto
2944    So only need check in lafs_pin_block_ph and maybe pin_all_children...
2945
2946 08Aug2009
2947   - credits get out of sync from
2948       lafs_incorporate->refile->space_return from checkpoint.
2949       counter is one more than we can find.
2950       returning space on
2951          i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
2952        Note it in an Index but not InoIdx.  The parent is still in the tree.
2953      This that is FIXED
2954
2955   - and out by 8! at
2956       delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
2957     FIXED that.
2958
2959   - BUG credits<0 in space_return from lafs_incorporate from add_block_address
2960      from phase_flip
2961 Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
2962      from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2963 msg: (1,3,1)(1,1,-1)
2964 Credits = -1, rv=1
2965 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2966
2967     This is a predicted but not handled problem.
2968     The answer is that not all blocks need ICredit/UnincCredit.
2969     The purpose of this credit is to allow for a split in the parent.
2970     pre-existing index blocks can never split the parent themselves
2971     If an index block becomes full, it will split and this might split
2972     the parent.
2973     If an index block has free space, then it will only over flow if it
2974     gets multiple child updates and this will provide multiple credits.
2975     So an index block with space for 3 or more new addresses does not need
2976     and ICredit/UnincCredit.  So when we split we don't need to provide an
2977     uninc credit.
2978     In particular.
2979     When we have a fully InoIdx block and a single new child with 1 UnincCredit,
2980     each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
2981     an ICredit, then create a new intermediate such that
2982         InoIdx is Dirty and has an ICredit
2983         New Index is Dirty with no ICredit - it used the UnincCredit
2984         New child looses its UnincCredit
2985     When another block in the new index arrives, it's unincredit is used to
2986     provide an ICredit
2987
2988     When a leaf block cannot fit a single address it will have ICredit.
2989     The block is split so that each has 3 spaces and so do not need ICredit,
2990     but as soon as ICredit is available, they take it.
2991
2992     Worst case is that every ancestor is full and the leaf is split
2993     We then get two full branches, each block half empty so not needing ICredit.
2994
2995
2996   Then...
2997     free data being used in lafs_refile from cleaner.
2998     b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
2999     Answer: lafs_refile was derefering ->inode when it wasn't safe.
3000      Need to at least have a parent before it is safe.
3001
3002   Hang:
3003      soft lockup cleaner->lafs_iget->ifind_fast ....
3004     Then (may be caused)
3005 Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
3006 .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
3007 Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
3008 ------------[ cut here ]------------
3009 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
3010
3011     It seems the cleaner gets confused and goes spinning.
3012
3013
3014   So: space problems:
3015     After the run, we have -14 used and 2055 available (of 4608), and
3016     cannot create anything.
3017     4 segments ar free, one is cleanable.
3018    free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
3019 or
3020    free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
3021 or
3022    df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
3023    free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
3024    and very little free
3025
3026   ablocks_used is going negative - why?
3027    Probably we erase a dblock without clearing Prealloc.
3028    Then when Prealloc later gets cleared, ablocks_used is
3029    wrongly decremented.... no...
3030
3031
3032 10aug2009  (don't forget above problems)
3033   Another problem.
3034    read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
3035      getiref_lock triggers BUG.
3036    This is presumably because I have just fixed it to get the correct
3037      iblock and not the iblock of the filesystem.
3038
3039   FIXME I hacked around this but I'm not sure the result is right.
3040     The question is about when the InoIdx should be dirty and when
3041     the inode data block should be dirty.
3042    In this particular case we are writing a page of a small file.
3043      cluster_allocate calls flush_data_to_inode which tried to dirty
3044      the inode dblock but finds that iblock is not pinned...
3045      When we dirty a data page we aren't pinning the parent!
3046    That might be OK - we only need to count and reserve the parent.
3047     We don't need to pin it until it becomes dirty.
3048
3049    Still need to resolve when which block gets to be dirty, and also
3050     exactly when an index block needs to be pinned.  And how does that
3051     related to holding a ref on the inode when the inoidx is pinned.
3052     Maybe it should be when the inoidx is referenced.
3053    FIXME
3054
3055 11aug2009
3056    Another problem. unlink->handle_orphans->erase_dblock->allocated_block
3057     and get a zero from lafs_add_block_address but parent is not pinned.
3058   And... One unmount, orphan file still has pinned blocks so the inode
3059     isn't free.
3060   And ... root still old phase after lots of 'rm' then sync.
3061     Inode 244 has pinned inode block held by writepage0 and writepage
3062          this is adir/170
3063
3064 13aug2009
3065   - lots of bugs introduced by change to marking inode blocks dirty:
3066      writepage/cluster_allocate wants to Dirty inode data block with no credits.
3067          because I put credit in iblock!
3068
3069   - ohhh.... The phase contour is broken.  When a block is added to a
3070     cluster for allocation it isn't in the phaseleafs any more, but prevents
3071     it's parent from joining.  So we cannot assume that if dblock is on
3072     list then iblock or a child will be too.
3073     So when we find dblock we do need to remove it.... done that.
3074
3075   - root not changing because Data 1/0 is Pinned and IOPending
3076      and held by writepage!!
3077      Problem is that IOPending blocks aren't put back on lru.
3078      But that should only be blocks on the cluster list.....
3079      But that is where I am putting it.
3080      Maybe I need exclusion between checkpointing and any other
3081        code that writes to checkpoint so checkpoint can wait
3082        for that ... can we use wc->lock??  That doesn't lock
3083        against cleaner, but that isn't a problem...
3084    But now 0/228 is still pinned and in writepage and IOPending
3085     So there is more to it than that.
3086     When checkpoint finds an IOLocked block, it might be about to
3087      join a cluster, in which case we don't really want to wait, or it
3088      might be undergoing incorporation in which case we want to wait.
3089      or it could be being erased, so wait..
3090      Maybe I wait until it appears on some list.... yes.
3091
3092 14aug2009
3093     At unmount Index 8/0 with child and leaf is still pinned
3094   This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3095
3096   and..
3097
3098   A problem is that something goes wrong in the erase process.
3099   We find new children after we erase the inoidx block!
3100
3101   This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
3102
3103   When/how do we erase indexblock and particularly inoidx blocks?
3104   Does and inValid InoIdx simply mean there is no indexing and does not
3105   reflect on the Data block?
3106
3107 .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
3108
3109  Orphan problem:
3110 nextfree = 0
3111 reserved = 0
3112 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3113 This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3114 [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3115   [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
3116   [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
3117   [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3118
3119 nextfree = 1
3120 reserved = 0
3121   0: 1 0 0 304
3122 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3123 This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3124 [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3125   [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
3126 badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
3127
3128
3129 erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
3130 erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
3131 ------------[ cut here ]------------
3132 WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
3133 unlink/orphan/erase_dblock_allocated_block
3134 ---[ end trace 61b8bd59512ea4da ]---
3135 zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
3136    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3137    [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3138 ------------[ cut here ]------------
3139 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
3140
3141 BINGO.  When we remove last entry from directory we erase the InoIdx block,
3142  then when we add entries, we hit problems.
3143
3144
3145 nextfree = 3
3146 reserved = 0
3147   0: 1 0 0 306
3148   1: 1 0 0 307
3149   2: 1 0 0 74
3150 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3151
3152 This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3153 [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3154   [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
3155
3156 This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3157 [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3158   [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
3159
3160 This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3161 [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3162   [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
3163
3164 We have stray 'cleaning' references.
3165 It is taken -
3166    on a data block that was in a to-clean segment
3167      at which point we igrab the inode
3168      the block is put on the ->cleaning list.
3169 It is put:
3170    when we get an error finding the block
3171    when we find that it isn't in the segment
3172    when an error occurs loading the block-to-be-relocated
3173    and when we mark that block for cleaning.
3174   i.e. always unless we got EAGAIN or some space error.
3175    If we still hold some blocks, try_clean returns 0.
3176
3177 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3178 This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3179 [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3180   [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
3181   [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
3182   [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3183
3184 NOTE these inode data blocks are not pinned and so did not get written!!
3185
3186 FIXME I should wait for the checkpoint to finish
3187 nextfree = 1
3188 reserved = 0
3189   0: 1 0 0 301
3190 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice day...
3191 This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3192 [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
3193   [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
3194
3195 16Aug2009
3196   When I clean and find an inode that is already deleted, I need to be
3197   very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
3198   to be.  lafs_delete_inode gets called a lot, but mostly for dead inodes.
3199
3200   BUGS:
3201 FIXED orphans don't get cleaned up.  It seems a 'create' fails and leaves
3202       and orphan block un-released.
3203    - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
3204    - Not sure that we handle complete truncation, then adding blocks properly.
3205      - what should the state of the InoIdx block be?
3206    - On remount, the filesystem contains rubbish.
3207    - create fails even when there should be free space.
3208    - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
3209    - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
3210           and 74 has similar issue
3211      327 = adir/big1   74=adir
3212
3213
3214 17Aug2009
3215   Segusage blocks aren't always Pinned when we make them dirty.
3216   Yes. That is correct.  They are not forced out by phase change but by
3217   lafs_seg_flush_all at the end of a checkpoint.  So they need to be
3218   preallocated, but not Pinned.
3219   But, once we have finished the last checkpoint we don't want to
3220   dirty Segusage blocks any more.. I wonder if we are.
3221   No, but we were Pinning inodes without PinPending and they
3222   lost the pinning straight away!
3223
3224   OK, other annoyance.
3225    InoIdx block and similar are getting erased at the wrong
3226    time.
3227    We can only safely erase them when they have no children.
3228    I guess what we really want is the incorporation leaves them
3229    existing but empty, and when we go to write them out, if they
3230    are empty we register an address of 0.
3231    When we drop the ->parent pointer of an Index block it
3232    just goes away...
3233    So:
3234     When incorporate or truncate produces and empty index block
3235      it simply clears B_Valid.
3236     When incorporate want to add to an index block, we set B_Valid
3237     When cluster_allocate gets a non-Valid index block it call
3238     block_allocated with phys of 0.
3239
3240     Yes, that seems to work.  Mostly
3241
3242 18Aug2009
3243   On remount, check_credits dies: 16/20-0
3244     In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
3245
3246 19Aug2009
3247   OK, this index block clearing is a mess.  There must be a neat model I can
3248   follow that will make it "just work".
3249   The key seems to be children.  If an index block has children, then it
3250   really must exist.  If it has no children and no content, then it can
3251   be discarded, in which case it needs to be unlinked from its sibling list.
3252   What locking do we use here?  Probably IOLock on the parent index block.
3253   So we need iolock while looking in a parent for children, and we take
3254   IOLock while incorporating or pruning.
3255   Once the empty index block has dropped out it will never be found again.
3256   When we incorporate the zero address, the index block becomes invisible
3257   unless it is shortly after it's predecessor in the sibling list.  But
3258   that is hard to ensure, especially if the first child is the one that
3259   is being erased.  So if an index block is erased, then it must be
3260   discarded quickly and any children need to be relocated...
3261   Or maybe not.... maybe if there are children, we just write and empty block?
3262
3263 22Aug2009
3264   We need better locking of the index information.
3265   It seems best to use IOLock as that is already held during incorporation.
3266   So any code that accesses or updates and index block must hold IOLock.
3267   This might be a bit of a restriction if we try to do a lookup while
3268   writeout is happening.... Maybe we need a separate writeback flag for that.
3269   But I think it is good to use IOLock for now.
3270   Places we need this are:
3271      flush_data_to_inode needs to lock the InoIdx block
3272        - DONE
3273      lafs_leaf_find as it recurses down.  This should return a locked leaf.
3274        - DONE
3275      callers of clear_index
3276          erase_dblock for depth=0??
3277        - DONE
3278      incorporate should lock new blocks for consistency
3279        - DONE
3280
3281    Locking dependency rule is that if we hold a lock, we are allowed to
3282    lock a child index block, but not a parent.  IF we hold a data block,
3283    we are allowed to lock the an index block.
3284
3285
3286   The read/write completion seems all wrong.  It unlocks if the page was locked,
3287    and that isn't really safe, because it might not have been locked for read..
3288    We need to flag block0 to say if lock or writeback need to be cleared.
3289    Given that, I don't need IOPending any more:
3290     Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
3291     Write: We queue all writes, then set 'do_clear_writeback', then check.
3292
3293   Now... can we use a writeback flag to avoid waiting to read while writeout
3294   is happening?  We would need:
3295      set writeback in cluster_allocate
3296      wait_writeback after some lock_block
3297      clear_writeback when writeout finishes.
3298      Extra checks where we already check for IOLock
3299
3300
3301 24aug2009
3302  Lots of progress but....
3303    cluster_flush calls cluster_done calls refile call iput call
3304     drop_inode call write_inode_now calls writepage calls cluster_flush
3305   and we get a locking loop.
3306    I think we need the run that cluster_done from a different thread.
3307
3308
3309  We seem to have a refcnt problem with segsum.
3310
3311 25aug2009
3312  Lots more progress but.....
3313
3314   orphan_release is finding that the orphan block has no credits.
3315   We can allocate credits and simply not do the update if they
3316   are not available:  having an extra entry in the orphan file isn't
3317   a problem.  However we need some mechanism to clean up other than
3318   waiting for a remount..
3319   I think we leave that until we redo orphan handling.
3320
3321  and: adir sometimes loses one block so it and the contents don't get
3322    deleted.
3323
3324  and: it seems we sometimes try to clean the segment being written
3325    to.  We must avoid that.
3326
3327  (long ago I wrote::
3328   FIXME When pin fails, we need to remove PinPending from everything!!!
3329  and never followed up ... I wonder?
3330  )
3331
3332 25Aug2009
3333  Orphan handling.
3334   Every orphan block goes on a per-fs list and gets removed only
3335   if the B_Orphan bit is clear.
3336   There are two times when we want to expedite orphan handling.
3337   1/ on rmdir we need to know if the directory is really empty.
3338      This requires that we expedite the orphan handling of all
3339      blocks.  As soon as we find a non-orphan, we can give up.
3340      Then we need to make sure the index tree has collapsed.  WE
3341      can borrow that code from truncate.
3342
3343   2/ When writing past Trunc_next.  We just pass the block to
3344      special orphan handling.
3345
3346   This requires that orphan handling is re-entrant.
3347   For dir, that is protected by i_mutex, but rmdir needs to come
3348    in under the radar.
3349   For trunc, the iolock on the index blocks should be enough.
3350   I wonder if IOLock can be used on dir as well... allowing
3351   parallel orphan handling in the one dir even!!.
3352
3353   We need to ensure exclusion of orphan handling, including:
3354       - only one orphan handler at a time
3355       - don't run orphan handler while still processing action
3356         that makes it an orphan.
3357   Maybe if we just use IOLock for that?  Does that work?  Maybe
3358   but it gets messy for directories (on first attempt anyway).
3359   For directories we can just use i_mutex.
3360   Maybe i_mutex for files as well?
3361
3362 27Aug2009
3363   Orphan handling is going well... but not perfect.
3364   I'm using IOLock to ensure exclusion for orphan handling.
3365   However:
3366     I'm not really implementing that on directories
3367     Inodes go bad because lafs_erase_dblock needs the lock too.
3368     The call from rmdir will always faile because we hold i_mutex.
3369
3370   Bigger problem.  I'm IOLocking inodes across checkpoints to preserve
3371    Orphan status.  But that might stop the checkpoint proceeding.
3372    .. so use i_mutex, not IOLock - find.
3373
3374   Now... it seems I've confused myself.  Orphans don't get handled
3375   immediately.  In particular, inodes should not be handled until
3376   they final delete_inode.  So setting the B_Orphan flag and putting
3377   on the list are two separate events.  The flag must come first,
3378   but the list may come much later.  So some of that mucking around
3379   with i_mutex is pointless.
3380   So:
3381     make_orphan makes sure it is in orphan file, sets bit, and removes
3382       from list (if present).
3383     add_orphan puts it on the list for handling.
3384
3385     For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
3386         as does any unlink/rmdir/rename that fails.
3387
3388     For directories: put it on list in commit/abort.
3389
3390
3391   And...
3392     I hit the BUG where find_leaf wants and address of 0.
3393       If an index block gets cleaned out it doesn't disappear
3394       immediately.. there is no leaf to find in that direction.
3395       We probably need to avoid non-Valid blocks or something...
3396   And...
3397     Orphans 0/299 to 0/329 and  0/280 are still on the list
3398      but are not orphans.
3399      Maybe I need to catch mutex_unlock to run the orphans??
3400   And...
3401     We underflow a segment through orphans are unmount.
3402       We are cleaning and truncating at the same time.
3403       The same block gets allocated to 0 and to 1225
3404       in quick succession.
3405       Problem is that we apply new address while in writeback
3406       so a new lafs_allocated_block
3407
3408 29Aug2009
3409
3410   Review of inodes in orphan list:
3411     lafs_new_inode makes are orphan for a non-existant inode.
3412     If the inode cannot be created, orphan_release is called.
3413     If it can, a 'struct inode' is filled in with valid type
3414     and nlink==1 (!!) and attached.  The inode will only be
3415     detached when the refcnt hits 0, and the orphan list implies
3416     a refcount, so if we ever find something on the orphan list
3417     with a NULL my_inode, it must be very new and can be ignored.
3418
3419     When we find an inode block with a my_inode there are a few options:
3420       if I_Trunc is set, we must progress truncation providing we can
3421             get the i_mutex
3422       else if I_Deleting we must delete the inode
3423       else if nlink is 0, we remove from the list
3424       else nlink > 0 and we must remove orphan status.
3425     This means that if nlink is elevated, we need to be holding the mutex...
3426     So don't elevate nlink any more...
3427
3428     When nlink becomes non-zero the block need to be put back on the
3429     orphan list (it must already be an orphan).  Also when we set
3430     I_Deleting or I_Trunc it must go on the list.
3431    .. OK, I think I have all of that.
3432
3433
3434 30Aug2009.
3435    I have some wierdness that seems to be caused by the orphan stuff,
3436    probably due to it all being async now.
3437    - A deleted inode clears I_Trunc and then sets it again.  The only
3438      explanation seem to be that delete_inode is being called again,
3439      so I must be igrabing it again, maybe from cleaning.
3440    - bits of directories aren't getting deleted.  Sometimes single
3441      blocks, though the referred files are deleted.  Sometimes
3442      the whole directory... More interestingly, those blocks then
3443      don't get cleaned, so something about them means that they
3444      don't get deleted and don't get cleaned either.
3445
3446    Even weird... I just had a case where file 331 had a different
3447    index block for every 4 data blocks...
3448
3449
3450    FIXME:
3451     - What stops pinned blocks from being flushed by bdflush in middle
3452       of operation and so losing allocation?  Must make sure to set
3453       them dirty very late.
3454     - orphan_release can fail, so much make sure we can always call
3455       it, even if my_inode is NULL.... but how?
3456
3457
3458     - make_orphan could fail due to lack of space, which is not OK.
3459       I made it loop, but I'm not 100% sure that is right... it isn't.
3460       I need to pass down the 'I'm freeing space' flag, and I need to
3461       not require Credit of Dirty is set, etc.
3462
3463
3464     - I seem to have a deadlock and unmount.
3465        umount is waiting for lafs_checkpoint_lock_wait in
3466           lafs_put_super
3467        pdflush is in down_read in sync_supers
3468        lafs_cleaner is iget_locked/ifind_fast/inode_wait
3469                 This is waiting for I_LOCK to be clear.
3470
3471
3472 31Aug2009
3473   - When a file shrinks and becomes level-0, make sure
3474     old addresses get deallocated.  I seem to have
3475     a directory where they didn't.
3476
3477   - Due to the fact that we over-preallocate, we really shouldn't
3478     return ENOSPC until we have flushed dirty data and performed
3479     a checkpoint??
3480
3481
3482   - When I removed the last index from an inode
3483     (Indirect type) it seems that I didn't write
3484     out the corrected block..??
3485
3486 1sep2009
3487  I ran my simple test run repeatedly overnight.
3488  It ran 208 times before I stopped it.
3489  There are 3 possible failure modes:
3490    1/ didn't completed within 500 seconds
3491    2/ triggered a BUG
3492    3/ appeared to complete, the number of blocks
3493       in use was not the correct '7'.
3494
3495  74 (35%) did not fail!
3496  31 () did not complete
3497  40 () triggered a BUG
3498  2 did not complete but did not trigger a bug
3499
3500  94 of those that failed did not have a BUG
3501  92 actually completed.  Of these:
3502       1 final blocks 1
3503       1 final blocks 110
3504       1 final blocks 23
3505       2 final blocks 12
3506       5 final blocks 0
3507       6 final blocks 10
3508      11 final blocks 8
3509      21 final blocks 11
3510      44 final blocks 9
3511
3512  of the BUGs,
3513        1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
3514       1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
3515       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
3516       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
3517       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
3518       2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
3519       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3520       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
3521       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
3522       6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
3523       7 BUG: unable to handle kernel paging request at 6b6b6bfb
3524      11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
3525
3526
3527  super.c:655 is "block is still pinned" at unmount time.
3528   The block was always an InoIdx with a child.
3529   Either inode 0 or 16.
3530   child is held by various things:
3531       [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
3532       [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
3533       [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
3534       [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
3535       [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3536       [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3537       [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3538       [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
3539       [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
3540       [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
3541       [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
3542
3543  The "unable to handle kernel paging request" is always in
3544  umount.
3545      invalidate_inode_buffers(26/46)/lock_acquire
3546
3547
3548  block.c:529
3549     This is iblock valid when erasing a block
3550     The block we are erasing is always 0/327 or 0/328.  It is
3551     an orphan we are handling, iolocked but not always pinned
3552
3553  lafs.h:276
3554     Map an iblock which is not IOLocked
3555        always in lafs_clear_index for the InoIdx block for a directory
3556        which is in Writeback.
3557        Call is in lafs_allocated_block from cluster_flush.
3558
3559  segments.c:351
3560     seg_inc reduces seg usage below 0
3561       - lots of blocks (inode 327) that were cleaned, where then erased twice.
3562       - 2 block (inode 328) were erased twice, both from prune
3563       - ditto
3564
3565  segments.c: 1028
3566      The free list is empty.... odd as only first segment is currently
3567      in use.
3568
3569  soft lockup:
3570      Still orphan: 0/328  Index(1) is in Writeback and Dirty
3571        again inode_handle_orphan2 is in Writeback
3572
3573  inode.c:821
3574      inode_handle_orphan are end, child list is not empty.
3575        The children seem to be in Realloc - cleaner need to let go.
3576
3577  cluster.c:1219
3578      my_inode is null while cluster_flush an inode and want to set
3579         WritePhase.
3580
3581
3582  block.c:485
3583      no ICredit for unincredit in dirty_dblock from dir_delete_commit
3584      from lafs_unlock.
3585
3586
3587  spinlock lockup in subsequent to real bug
3588  ditto for sleeping function.
3589
3590  Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
3591  appear to have other strange values....
3592
3593  A select '9' has two extra block for the directory '74'.
3594  But that directory is long gone.
3595  These dir blocks are currently fully populated with numbers.
3596  This seems to be the pattern with all non-7 blocks.
3597
3598
3599  02Sep2009
3600   Found a problem, possibly related to the dir blocks not being
3601   cleaned up.
3602   When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
3603   so that fact is never copied in to the datablock.
3604   On further exploration, the I_Dirty bit is set but never used, which
3605   isn't good.
3606   So: exactly when do we copy inode into datablock, and what do we do
3607   when dirty_inode is call (if anything).
3608   We could just set I_Dirty when dirty_inode is called, checking that
3609   the block is Pinned which it usually will be.
3610   Then we copy inode to data just before writing data block.
3611   However that defeats transactional properties.  We to copy in the
3612   same transaction, and that means either straight away, or when
3613   the data block's phase changes.
3614   So dirty_inode either copies to the block, or sets I_Dirty.
3615   When lafs_refile unpins an inode data block, it need to check
3616   I_Dirty and possibly re-dirty it.
3617
3618   To redirty it we must steal the NCredits.  Any further dirty attempt
3619   will have to allocate more.
3620   The stealing is done automatically by dirty_dblock, so we just flip
3621   the phase and call dirty_inode ... making sure it doesn't try to
3622   prealloc too hard.
3623
3624   Need to review when inodes get dirtied.
3625     - commit_write only sets I_Dirty !
3626
3627     We call lafs_dirty_inode:
3628       dir_create_commit - a child of inode is PinPending
3629       lafs_create - ditto
3630       lafs_link - before dir_create_commit
3631       lafs_unlink, lafs_rmdir - data block is pinned
3632       lafs_symlink - before create_commit
3633       lafs_mkdir - before create_commit, or block pinned
3634       lafs_mknod - before create_commit
3635       lafs_rename - (moved to) before create_commit/update_commit
3636                      or data block is pinned
3637       lafs_dir_handle_orphan - (assured that) child is pinned.
3638       choose_free_inum - child is pinned
3639       lafs_incorporate - block is pinned
3640
3641     So either the data block is pinned, or the index block is pinned.
3642     In either case it is OK to set something to Dirty.
3643
3644     (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
3645     this is called from:
3646         inode_inc_link_count
3647         inode_dec_link_count
3648         ..various quota ops...
3649         inode_setattr
3650         __set_page_dirty (Which we don't use)
3651         other buffer stuff
3652         other quota stuff we won't use
3653         touch_atime
3654         file_update_time
3655         page_symlink
3656
3657     only the time updates are interesting.  Others we have locking
3658     for.
3659     file_update_time is called from generic_file_aio_write_nlock etc
3660     before ->prepare_write/->commit_write.  So they can pick up the
3661     change.
3662     Similarly before set_page_dirty is called.
3663     touch_atime is called from do_follow_link and readlink and
3664     file_accessed which is called all over the place.
3665
3666     So what to do?
3667     If block is pinned, then dirty it to ensure writeout.
3668     If not, don't.  But copy data in any case.
3669
3670
3671 4sep2009
3672
3673     OK, I've decided that I don't like clearing B_Valid when an index
3674     block contains no indexes.  The final straw was that I seemed
3675     to need to initialise the index block when I didn't hold IOLock.
3676     That was probably fixable, but I'm sure more problems were coming.
3677
3678     So: what to do instead?
3679     One issue that must be resolved is that an index block can still
3680     have valid children even when it become empty.
3681     This can happen if we erase blocks from a file, then add them back
3682     after a checkpoint, and so in the next phase.
3683     The checkpoint writeout could need to show an empty index block,
3684     but the next phase will see real addresses.
3685     We cannot easily avoid this, so we must handle it.
3686     This interact badly with the index lookup algorithm that finds
3687     the best index block currently in the parent, and then scans
3688     the children.  If there is no index block in the parent, we
3689     cannot find any children.
3690     This could be handled by responding to an empty index block by
3691     scanning all children.  But that isn't a full solution as if
3692     just one index block got erased, it's unincorporated siblings
3693     would still be lost.
3694     We could treat empty index blocks like orphans.  i.e. don't
3695     discard them immediately but leave them with possibly real
3696     addresses.  Then when they have no children we allocate the
3697     0.
3698     But we still need to ensure that index blocks off which siblings
3699     have been split but not yet incorporated remain present in the
3700     tree to mark the place for their siblings.
3701     There is another problem.  A horizontal split could leave the
3702     new block with no addresses and everything in the uninc list.
3703     Nothing can be found in there.
3704
3705     So maybe we need to revise the lookup mechanism.
3706     The goal is to find an index block that starts at or before
3707     the target and contains an address at or after the target.
3708     Then out search can stop.
3709     In rare cases.....
3710
3711 7sep2009
3712     I thought about this more over the weekend and think I have an answer.
3713     We need to treat internal and leaf index blocks somewhat differently.
3714
3715     An internal index block must never be empty (while unlocked).
3716     Any child block which has not had it's address incorporated must be
3717     attached (simply in the sibling list) to a block which has been
3718     incorporated.  This will be the block that it was split off.
3719     The uninc block needs to hold a reference so that the primary isn't
3720     released.
3721     When a 'primary' becomes empty it cannot be discarded, so the
3722     addresses in the first dependent index block must be copied
3723     across.  This is awkward for indirect blocks so they might be
3724     allowed to be empty (they aren't internal so don't violate the
3725     above).
3726     When a horizontal split break a sequence of dependent blocks
3727     between two parents, the second parent must be incorporated
3728     immediately so that the first block in the second half of the
3729     sequence is incorporated.
3730     If an internal index block does become empty and it has no
3731     dependent blocks to fill from, it must be invalidated immediately.
3732     It cannot have any children - even in next phase - as at least one
3733     would have to be incorporated and so the block would not be empty.
3734     Invaliding involves allocating to address 0.
3735     If index lookup finds a block with PhysValid address of 0, it
3736     must look to the previous index block.  If there was none .... it
3737     gets a bit complex.
3738
3739     Leaf index blocks can become empty, but we try to avoid it.
3740     If a leaf has blocks which have been created in the next phase,
3741     and others which have been deleted in this phase, it can be empty
3742     but still have children.  In this case we just treat it as a real
3743     index block that doesn't actually have any addresses.  We still
3744     write it out even though that is a waste of space.
3745
3746     We have been working on the assumption that every address always
3747     has a corresponding leaf index block.  It is the leaf with the
3748     highest index at or below the target address.
3749     However this requires the every internal index block has a child
3750     with the same address as the parent.
3751     Preserving this requirement when the first child of an internal
3752     become empty requires either:
3753        - loading the 'next' child and reassigning this to the start
3754        - changing the address of the parent to match the first child.
3755     The former requires possibly reading a block from storage.
3756     The latter only involves modifying blocks that are due to be
3757     written out anyway, but makes block look up slightly interesting.
3758     When lookup finds an invalid block that is 'first', it needs to
3759     start again from the top.
3760     When incorporation creates an invalid block that is first, it
3761     needs to walk down from the top and any index block at the same
3762     address needs to be relocated/rehashed.  If the block is
3763     incorporated, the incorporated address needs to be updated.
3764     So:
3765      - flag for unincorporated index blocks which implies a reference
3766        on primary
3767      - after split, immediately incorporate second block
3768      - change lookup to retry when finding invalid block
3769      - When internal block becomes empty, either merge with
3770        first dependent or invalidate.  If first in parent,
3771        update address and parent and recurse.
3772        Need some 'clever' locking here.
3773        Before unlocking the invalidated block, we take i_alloc_sem,
3774        then walk up the ->parent tree locking blocks as
3775        required.
3776        The index lookup, when it finds an invalid block will take
3777        i_alloc_sem, then drop it, then start again.
3778        Or maybe some other lock than i_alloc_sem...
3779      - When leaf becomes empty, invalidate only if it has no children.
3780        When internal leaf becomes unpinned, check if empty.
3781
3782 21sep2009
3783    That locking doesn't look like it will work, and we can never 'merge
3784    with first dependant' as it is not valid to have a index block
3785    where the first child is at a different address.
3786    And we cannot always change the parent address, particularly if it
3787    is zero - increasing it then cannot work.
3788    And there is no need to load a block if we are just going to change
3789    its start address (not internal index blocks anyway).
3790    Let's drop the idea of relocating the parent.
3791    If an internal index block becomes empty:
3792      If it is last in parent, no loss, just discard
3793        If parent would be empty, need to recurse up.
3794      If it is not last relocate the next sibling to this location,
3795       rehashing it and updating the parent.
3796    If a leaf index block becomes empty we cannot just delegate to
3797       next as it might be indirect... not a problem if address is
3798       stored.  But that requires a format change... now might be a
3799       good time!
3800
3801
3802    So:
3803      If we hold an index block locked and it becomes empty and we choose
3804      to invalidate it, we need to ensure that doing so does not
3805      break any indexing paths.
3806      So we take a separate lock (i_alloc_sem??) and flag the block as invalid
3807      by setting physaddr to 0 while PhysValid is set, and unlock the block.
3808      Any lookup that finds such a block must take and release i_alloc_sem,
3809      and then restart from the top.
3810      - If the block was not incorporated, we just remove from sibling list
3811           and all is done - the space in implicitly included in
3812           previous block.
3813      - If the block has a different fileaddr than the parent then update
3814           the parent directly, either removing the entry, or changing it to
3815           point to the first unincorporated sibling (if there is one).
3816           This requires taking the lock on the parent of course.  That is
3817           why we dropped the lock on the child.
3818           Then all done.
3819      - If the block has the same address as the parent we need to find
3820           a 'next block' to relocate to the start of the parent.
3821           It is either the first unincorporated sibling, or the next
3822           block in the index block, or nothing, meaning the parent is
3823           about to become empty.
3824         We lock the parent (still holding i_alloc_sem), and rehash the
3825           chosen child.  If it doesn't exist, or is not dirty, we need
3826           to update the phys address directly in the
3827           accordingly, erasing or replacing the first address.
3828           Then we need to rehash the index block, but we need to lock
3829           the parent for that.
3830           So set a 'busy' flag on the block, unlock it, lock parent,
3831           rehash, clear busy flag, and repeat.
3832       - We can never relocate a block with fileaddr of zero, as the
3833           InoIdx block cannot be relocated.  So leaf index block 0
3834           must never be erased unless the file is empty.  So
3835
3836 28sep2009
3837   New idea.
3838   We store the start address of an indirect block in the block.
3839   These means that the meaning of any index block is completely
3840   independent of the location of the block, so we can change the location
3841   easily and without touching the block.
3842   So if a block becomes empty, we simply move the next block back to
3843   fill the gap.
3844   i.e. when an index block becomes truely empty (i.e. no children)
3845    - if it wasn't incorporated, simply remove it
3846    - if it was,
3847        - if there is a dependent block, rehash it to take my address
3848        - if there is a next block that is dirty, rehash it
3849        - if there is a next block that is not dirty,
3850           update parent to merge my entry with next, and rehash next
3851           if it exists
3852        - if there is no next block but we are not first, just update
3853           parent
3854        - if no next block and we are first, parent becomes empty,
3855           recurse upwards.
3856
3857 12Oct2009
3858  - too long, I've forgotten what I was up to..
3859    + I've changed the format of indirect blocks to store an address.
3860    + I've handled incorporation of an empty block
3861    So now internal index blocks can never be empty - they get immediately
3862    unlinked if they are.
3863    Leaf index blocks can be empty while they have children.  We don't
3864    flag them as empty, but rather wait until another child gets incorporated.
3865    But I don't think I really like that.  It is an external ugliness based
3866    entirely on internal implementation details.  Empty index blocks should
3867    not get written out.  We need some way to reliably find an empty index
3868    block.  The address won't appear in the parent so a lookup will find the
3869    previous block which we cannot link to now as it may not exist yet.
3870    Worse - if first index block goes empty, we can only unlink it by moving
3871    the parent to start at the next block.  That would make this index block
3872    totally unfindable.
3873    So I think we have to stick with writing out empty index blocks very
3874    rarely.  So we need to be sure they disappear properly.
3875    The difficult case is if an index block becomes empty while it has some
3876    children which don't end up getting dirtied. e.g. an update aborts.
3877    We need to leave the block with enough credits to be written out.
3878    I guess the Ncredit should be enough...
3879    Maybe worry about that later.
3880
3881  - what about InoIdx blocks when they become empty?  It would be helpful
3882    to flag them so that inode deletion can check....
3883    Maybe just set depth to 0..
3884
3885  ARRGGG... I've completely lost it.  In need another ITO week.
3886   I just got a bug in summary.c:71!!
3887
3888 7 Jun 2010
3889  - summary.c:71.
3890    ablocks_used has hit zero too soon.
3891    This should be the count of blocks for which space has been allocated
3892    (B_Prealloc is set) but have not been given a phys address yet - at which
3893    point the usage count is moved to cblocks_used or pblocks_used.
3894    The last block (which may not be the cause of the problem) does not have
3895    B_Prealloc set, yet physaddr == 0.
3896    The block is 0/1, so the inode for the inode usage map.  This should have
3897    physaddr 8 !!
3898    We did find 8, then change to 73, but then changed to 0!
3899   Ahhh... recent fix exposed a subtle bug ... fixed.
3900
3901  Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3902      cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3903      cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3904      cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3905      cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3906    We are allocating an InoIdx block, but data block is not valid??
3907
3908  That isn't very reproducible so I'll have to leave it for now...
3909     erasedblock had been called on the data block .. inode 17??
3910
3911   Problem is that I keep changing the rules.
3912    I don't erase the InoIdx block any more.
3913    I used to, then change it to iolock_block/cluster_allocate->0
3914
3915  Problem: When all files are removed, usage is still quite high, two
3916    segments have over 400 blocks (out of 512).  Cleaning keeps running and
3917    not making much progress.
3918   segment 6 has usage of 484.
3919   'cluster 3072' shows: cluster 3072, 3085, 3086 3092
3920     Inode 0:  blocks 267 272 276
3921     Inode 277: blocks 0/4 6/2
3922     Inode 0: blocks 0/2 8 16
3923     Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
3924     Inode 16: 1/1
3925     Inode 17: 0/28
3926     Inode 283: 12/18
3927           etc.
3928
3929   All 'old', so must be the product of cleaning, as you would expect.
3930   All (most) of this has been deleted though, but count didn't drop.
3931    'Count' add to 508, plus the 4 cluster heads makes 512 - good.
3932   lafs_seg_move definitely isn't being called on these blocks.
3933   it is only called from lafs_summary_update
3934   cblocks_used "exactly" matches the number of un-removed blocks.
3935
3936
3937   Another problem
3938 bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3939 /home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3940 bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3941 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3942 bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3943 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3944
3945  and
3946 free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
3947 Want dump of usage
3948
3949 ------------[ cut here ]------------
3950 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3951  free list is empty - that should not be.
3952
3953 and another...
3954 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3955 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3956  [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
3957  [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
3958  [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
3959  [<c02256bf>] ? autoremove_wake_function+0x0/0x33
3960  [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
3961
3962
3963 08Jun2010
3964  Weirdness with truncating.
3965  The cleaner relocates a file resulting in the InoIdx block being
3966  Maybe-dirty and phys_addr == 0.
3967  Then truncate doesn't prune but just incorporates, finding
3968   something weird there..
3969   file 278, blocks around 4100
3970   seem to find 1949 instead??
3971
3972  Note: When a non-InoIdx block is erased we set PhysValid
3973   and physaddr == 0 to record the fact because it will not be stored...
3974
3975 modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3976 Async ??
3977 modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3978 Still Async ... wonder what it means.
3979
3980 - directory block got corrupted.  Maybe conversion to indexed??
3981
3982
3983 Getting bug in remove_from_index because the addr isn't
3984 there, possibly block is empty.  But incorporation is
3985 ??? instant?  No it isn't.
3986 If an index block hasn't be incorporated it has B_PrimaryRef
3987 set as it hold a ref to something earlier index.
3988 But what if nothing is incorporated?
3989
3990
3991 Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
3992 looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
3993
3994 Then spin in a soft-lockup in lafs_inode_handle_orphan
3995
3996
3997 -----------
3998  - grow_index_tree needs to do initial incorporation so things can be found.
3999     just like end of do_incorporate_internal.
4000    NO - cannot incorp yet as do not have phys addr.  Don't need to as
4001    lafs_leaf_find explicitly handles this.
4002    For truncate case we don't use the stored address, but ensure all
4003    leaf indexes must be dirty (or gone) so whole tree must be
4004    accessible for walking around.
4005  - do_incorporate_internal needs to set B_PrimaryRef and take the ref
4006  - when we remove a B_PrimaryRef without incorporating it, we need to
4007    drop a ref if the *next* in the list is B_PrimaryRef
4008  - need to use a constant to identify 'async' calls etc.
4009  - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
4010    it isn't found as async....
4011
4012 09Jun2010
4013  STILL struggling with incorporation.
4014  We have a premise that any file address is coverred by precisely
4015  one leaf index block.  Every leaf index has an implicit address
4016  and it covers all addresses from there to the next leaf.  The last
4017  leaf covers to EOF.
4018  So there must always be a leaf at address 0.
4019  This applies within the tree from an internal index block too.
4020  Beneath an internal index block there must be a leaf covering every
4021  address up to the next internal index block.  So there must be
4022  a first.  So storing the first address is pointless.  And harmful.
4023  When an index block becomes empty and disappears its coverage is
4024  included in the previous block unless there is none, in which case
4025  the next index block must be re-addressed.  If there is no 'next',
4026  this index block must be empty and so must disappear.
4027
4028  BUT if we re-address an index block, we implicitly re-address the
4029  first child - recursively - so we need to move/rehash them all
4030  or lose them... or record where they are.  Or do lookup not by
4031  addr....
4032  I think just rehashing them all - with an iolock - is simple
4033  and safe.  So just do that.
4034
4035
4036  So:  I cleaned up index handling a truncation somewhat.
4037   Now running looptest to see what patterns emerge:
4038
4039   block.c:197 (*9+1) During umount, the Root datablock is
4040         Dirty+Realloc
4041         Maybe just need for cleaner to become inactive
4042         during umount - hope that doesn't deadlock
4043         didn't event work...
4044   block.c:529 (*4+1)  erase dblock while iblock depth > 0
4045         When pruning InoIdx we want to set depth to 0.
4046         FIXME is this really want I want, or is depth=0
4047         only for data-inode ... FIXME
4048   cluster.c:533 (*2) cluster_allocate on invalid block
4049           Block is 8/0 in writepage from sync_inodes
4050           This is the orphan file.
4051                    blocks aren't dirty
4052           I guess the file gets truncated while we wait for it.
4053           Just need to re-test.
4054   index.c:1936 (*2).   An index block is Root - FIXED??
4055   modify.c:1056 - secondary bug, ignore for now.
4056   modify.c:1650 update_index fails to find target.
4057               second call, phys==0
4058               Code was bad ... may not be the cause though.
4059   modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
4060                    from orphan handler.
4061                 Maybe just change the do/while back to 'do'.
4062   modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
4063                Index(0)/InoIdx
4064                in do_checkpoint
4065                uninc list gets set in lafs_add_block_address (parent of iblk),
4066                 do_incorporate_internal,
4067                Maybe the InoIdx still had children.
4068   segments.c:1028.  (*4) The free list becomes empty.
4069   super.c:655 (*3)   Busy inodes after umount, and root InoIdx block
4070          is still pinned as inode 16 data block was still dirty.
4071          segusage slow.  Maybe same as block.c:197 ??
4072   invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
4073           finds invalid lock.
4074           presumably the inodes was freed before invalidated.
4075   spin on writeback during truncate (r3a) 8 times. now 10
4076         Probably because writeback cannot proceed while
4077         orphan processing keeps looping.
4078   kmalloc-1024 problems - (*2)
4079           A block - should be start of page - isn't not what it appears...
4080
4081  Others complete with 'cb' ranging from 202 to 715
4082
4083
4084 10 June 2010
4085
4086  Looking at segment.c:1028
4087   We run a seg_scan every checkpoint, so that should keep free segments
4088   in the list.....
4089   Ahh.. do_checkpoint is looping because root isn't changing phase.
4090
4091   Lowest block pinned to old phase is
4092   [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
4093   which is not on leaf list because it has IOLock
4094   With more debugging:
4095   [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
4096   or better (that was in lafs_iolock_written)
4097   [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
4098   FIXED - I didn't unlock if it wasn't dirty any more.
4099   Well almost - it occurs much less now.
4100   Out of 48 runs:
4101       8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
4102       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4103       2 BUG: unable to handle kernel paging request at 6b6b6bfbt
4104       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4105       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
4106       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
4107       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
4108       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4109       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
4110
4111   So we now have 1/12 rather than 2/3.
4112   a/ pinned by IOLock from file.c:220 - FIXED
4113   b/ as above
4114   c/  Root is pinned by 4 children
4115       328/0  with 196 of data blocks in writeback/realloc, in a cluster
4116       0/1, 74/0, 0/8   all in a cluster waiting writeout.
4117      Don't understand this.
4118   d/ as a,b
4119
4120   Of the 48, 11 ran to completion leaving blocks from 286 to 899
4121
4122
4123   Looking at the loss of blocks when truncating.
4124    tracing show small number of files with remaining blocks at delete.
4125      sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
4126    next attempt: 14+24+26*11 =324 cf cb=1124
4127    next attempt 26+6+15+68+29 == 144 cf cb=383
4128    26+18+14+19+284 = 361 cf 379
4129     files are (in order)
4130    49    bfile       - 30K
4131    325   nbfile-49   - 30K
4132    320   nbfile-44   - 30K
4133    296   nbfile-20   - 30K
4134       ??331??
4135
4136 11 June 2010
4137
4138  Thinking about truncate and index blocks becoming empty while
4139  they still have children.
4140  For leaf indexes, we need to leave the block in place in case
4141  the children get written.  We need to find a time to ultimately
4142  delete it...
4143  For internal indexes,.... uhm, it just works, OK??
4144
4145  When I drop an uninc block, I need to remove it from the
4146   uninc list, and from phase_leafs
4147   clearing dirty and refiling should remove from leafs.
4148
4149  When we recurse to a parent, we need to remove
4150  *this* block from the uninc list for said parent.
4151  It should be the only thing in the list.
4152  But even when we don't recurse, the fact that we have
4153  incorporated means that we should tidy up the ->uninc
4154  list.
4155
4156
4157
4158 12 June 2010
4159   unmount hung after lafs_run_orphans from lafs_put_super
4160   There are two orphans in Writeback which cannot progress
4161   until the current cluster is written...
4162   But they keep getting re-written!
4163   Other time, one orphan, index block is Dirty on a leaf ???
4164
4165 orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
4166 [cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1)
4167 LAFS_cluster_flush 1
4168
4169
4170 orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
4171 [cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0)
4172
4173  OK, problem is that when we truncate and remove an index block, the
4174  next index block expands backwards to fill the space.
4175  Then we apply prune_some, but don't check if anything was done.
4176  We always mark it dirty, so it has to be written and then
4177  we loop through again...
4178  So need to check if prune_some did anything.
4179
4180 TODO:
4181  - prune_some need to get more done at a time
4182  - let cleaner finish up before umount
4183  - use early segments first ??
4184  - look at write-clusters and check OK
4185  - check that df:cb= drops properly.
4186
4187 Bugs:
4188       1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170  - SECONDARY BUG
4189       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4190       3 BUG: unable to handle kernel paging request at 00100104
4191       5 BUG: unable to handle kernel paging request at 6b6b6bfb
4192       1 BUG: unable to handle kernel paging request at 7fffffff
4193       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4194       9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
4195       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
4196       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
4197       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828!
4198       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
4199       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708!
4200       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4201       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
4202      30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4203
4204 Quite a haul there!
4205
4206 super.c:655
4207     Pinned block in lafs_release:
4208          0/2 is Dirty with plenty of credits, so it is a child
4209          0/16 is Dirty/Realloc, or once Async
4210      Dirty, but not on a leaf list, not pinned
4211
4212 segments.c:332
4213     seg_deref with refcnt , 2 in lafs_seg_put_all
4214
4215 segments.c:1028
4216      No free segments - no real pattern.
4217
4218 modify.c:1708
4219      lafs_incorporate on non-dirty/realloc block
4220        328/0 Index(1).  1 in uninc_table - probably during truncate.
4221      Either we add uninc while not dirty
4222      Or we clear Dirty while uninc present
4223      or there is a race between the two.
4224
4225      Don't know:  add a bugon
4226      Bugon in get_flushable didn't fire.
4227
4228 inode.c:843
4229      children present in truncate after final incorp...
4230        328/0.  64 children, no uninc list.  Maybe we ran the orphans too early??
4231       or invalidate_page isn't removing the children.
4232       Might want print_tree here?- added that.
4233      Answer: all the children are in Realloc on Clean_leafs
4234        Maybe erase_page needs to disconnect from cleaner too??
4235
4236 inode.c:828
4237      Orphan handling - uninc but not dirty: is Realloc (sometimes)
4238      Maybe like  mod:1708
4239
4240 block.c:67 *
4241       delref 'primary' from modify.c:2063 in the q2 branch.
4242       nxt has PrimaryRef... Maybe  move earlier, but that shouldn't make a diff.
4243       ditto at modify.c:2035  nxt is primary as was I, so drop mine.
4244       Don't know - looks like sibling list got broken.
4245       Tidied up a bit and added a print-tree.
4246       v.interesting result.  Lots of consecutive index blocks all holding primary-ref
4247             on single primary - which is wrong.
4248       1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
4249             on self, as are being inserted into chain
4250       2/ When splitting, new block must be addressed as first block which cannot
4251            fix, not first block which doesn't fit.  Else incorping in reverse order
4252            can make lots of tiny index blocks.
4253
4254 block.c:529 *
4255         erase with index depth > 1.
4256         0/328 in orphan handling.  Still have 8 or 15 blocks registered!
4257        Maybe caused by index block errors.  Added some printks.
4258
4259 block.c:479 *
4260         not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
4261         74/xxxx in unlink
4262         16/1 in seg_inc/seg_move...allocated_block/cluster_flush
4263
4264         - writepage wrote the page??
4265         - checkpoint wrote it and didn't replenish the credits?
4266
4267 block.c:197 XX
4268         invalidated pages finds dirty block after EOF, after iolock_written
4269          0/0 Dirty/Realloc in unmount - all Realloc!
4270        Need to wait for cleaner etc to finish at unmount time.
4271
4272 NULL deref in 1b4  YY
4273     cleaner->cluster_flush->count_credits->lock??
4274     Trying to get a lock on an inode that has since been free??
4275         spin_lock(&dblk(b)->my_inode->i_data.private_lock);
4276
4277
4278 001001 YY
4279      generic_drop_inode -- extra iput??  in lafs_inode_checkpin from refile
4280 6b6b6b YY
4281       invalidate_inode_buffers!! in kill.  use-after-free
4282
4283 7fffff
4284     seginsert from scan_seg
4285      MAX/number-elements confusion.  Worked around for now.
4286
4287
4288 18  June 2010
4289 After a couple of fixes:
4290       1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4291       1 BUG: unable to handle kernel paging request at 00100104
4292       5 BUG: unable to handle kernel paging request at 6b6b6bfb
4293       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4294       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
4295       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
4296       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
4297      16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4298       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4299                 Realloc blocks confusing truncate
4300       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
4301       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
4302       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4303      19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4304
4305
4306 TODO:
4307  - truncate gets confused by blocks being cleaned.
4308    Need to flush cleaner, or just removed the blocks.
4309  - when add PrimaryRef in middle of list, take the right ref.
4310  - fix up wait-for-cleaner at unmount time.
4311
4312 19 Jun 2010
4313
4314       3 BUG: unable to handle kernel paging request at 6b6b6bfb.
4315       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4316       5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
4317      22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4318       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
4319       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4320       9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4321      17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4322     251 SysRq : Resetting
4323       3 SysRq : Show State
4324
4325  - We can erase a dblock while it is in the uninc_pending or
4326    uninc_next - need to be careful
4327  - At umount, 0/2 is Dirty but not Pinned, so not written out
4328    ditto from 0/16
4329    16/0 sometimes is Async
4330       16/0 Async might be from the segment scan - so wait for that.
4331    Dirty but not pinned can happen when InoIdx is pinned.
4332
4333  - I think the uninc_next list (At least) should be sorted before
4334     being allocated.
4335
4336  - root block dirty/realloc/leaf in final iput
4337    Could be it was changed during last checkpoint so
4338    pushed in to next phase?  But why Realloc?
4339    Maybe still issue with losing inode data block.
4340
4341 20 June 2010 Happy Birtyhday Dad!!
4342
4343 420 runs.
4344       4 BUG: unable to handle kernel paging request at 6b6b6bfb.
4345      26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4346      87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4347       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
4348       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
4349       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
4350      12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4351       2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4352
4353  Problems:
4354   - inode in i_sb_list has been freed.
4355   - block 0/0 is dirty/realloc/leaf after final iput
4356   - not all blocks freed by truncate
4357   - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
4358   - still children when truncate should have finished.
4359     all are Realloc
4360         Maybe inode has become unhashed and we re-load it??
4361         it is invalid after all!!
4362   - Index block not dirty when incorp - has uninc. ??
4363   - didn't wait for free segments
4364   - Data 16/0 is dirty but not pinned after final checkpoint - FIXED
4365
4366
4367 watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4368 watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4369
4370
4371  Unclear on dirtying index blocks.
4372    We normally mark it dirty first, then add the address to the uninc list.
4373    Note that this is the reverse of data blocks which are changed first, then
4374    dirtied.  So maybe we should mark dirty afterwards.  We then need to
4375    avoid incorporation while we are adding addresses else we might find it
4376    has addresses but is not dirty.  Only try if dirty?
4377    Maybe we should iolock the parent.  We need to do that anyway to flush
4378    incorporations when the table is full.   Yes, that fits the VM model
4379    better.  Always lock while updating and preparing to write.  Set
4380    writeback once write has started, then unlock.  Cool.
4381    Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..
4382
4383 21June2010
4384   Apart from tracking down the remaining bugs, I need to:
4385   1/ Decide on locking for incorporation and attaching new address to a block
4386     and implement it.
4387     In particular we need to not lose the Dirty flag before the update is done.
4388   2/ Resolve handling of pinned inode data/index blocks
4389   3/ Correct handling of empty index blocks, particularly when parent is in
4390     different phase.  Make lookup be more careful?
4391   4/ Wait for there to be enough free segments before allowing allocation.
4392
4393   2:  Problem is that we cannot handle a pinned inode-data block while the
4394      InoIdx block is pinned in the same phase.
4395      We currently unpin it so it drops off the leaf list.  But then we
4396      need to re-pin it when the InoIdx is unpinned or phasefliped, and that
4397      gets ugly.  Possible though.
4398      An alternate is to treat it like a parent and keep it off the list
4399      while the InoIdx is pinned/same-phase.  So we would need to
4400      re-assess it after unpinning or flipping the InoIdx.  That is probably
4401      a lot easier than re-pinning it.
4402
4403   1: We would normally set 'dirty' after changing the block.  But we need
4404      to differentiate Dirty from Realloc, so we set before adding addresses.
4405      This requires that are careful not to write an index block while there
4406      are pending changes.  The fact that pinned children stop any writing,
4407      as do pending addresses in a list should ensure this.
4408
4409   3: When an index block becomes empty we need to make sure that
4410      future lookup doesn't get confused by it.  Specifically future
4411      index lookup must avoid the block so nothing new gets added.
4412      Possibly a previous block will split again, but this block must remain
4413      unused.
4414      However we cannot update the parent block immedatiately as it might
4415      be in a different phase.
4416      So we must record both "don't touch this" and "where to look instead"
4417      elsewhere - in children.
4418      If the block being deleted is *not* the first child in the parent,
4419      then we direct index lookup to the earlier block.
4420      If the block being deleted *is* the first child in the parent,
4421      then redirect to the second child if there is one and we weren't just there.
4422      If there is no other block we flag the parent as empty and retry
4423      from the top.
4424      We flag a parent as empty with B_EmptyIndex.
4425
4426      What locks do we need to walk around the sibling list?
4427      the inode private_lock is minimal, but we cannot hold that to take a
4428      iolock - just to get a reference.
4429      I guess we
4430         - iolock the parent
4431         - try to find a good block using private_lock
4432         - get a ref and wait for it.
4433         - check if it is still a good block.  If not, start again
4434
4435      If we find an EmptyIndex block, it must be directly addressed by parent.
4436      It will never be followed by a PrimaryRef block because if there were
4437      such a block, we would have readdressed it back and hidden the EmptyIndex.
4438      So we need to look around for an address in the parent that leads to
4439      a non-EmptyIndex block.
4440
4441      If all children are empty, we need to make the parent empty.  But
4442      what if it is InoIdx?
4443      Maybe I am making this too hard.  I could just use i_alloc_sem to
4444      block lookups while truncate is happening.  That doesn't address
4445      single block removal e.g. from directories.
4446      So I need to be able to wait for incorporation to happen on an
4447      empty index block.  We hold iolock on the parent.  If there blocks
4448      on ->uninc, we just process them immediately.  If there are blocks on
4449      ->uninc_next, we wait for the checkpoint to complete
4450
4451      What does lafs_incorporate actually do with EmptyIndex blocks?
4452      Providing that match currently incorp addresses, they just cause
4453      those addresses to disappear.
4454
4455      If a block is in the uninc list for its parent, then is phase_flipped
4456      and changed and written out it could get a new physaddr before
4457      it is incorporated.
4458      I guess we never allocate a B_Uninc block which is in a different phase
4459      to the parent.  Currently we wouldn't do that anyway except in truncate
4460      though memory pressure on index blocks might one day??
4461      Truncate?  We cannot allocate directly in lafs_incorporate.
4462      We should get lafs_cluster_allocate to notice and DTRT.
4463
4464      Only hash index blocks when they are incorporated.  Not needed before then.
4465      When processing an uninc list, if an address appears twice, prefer the one
4466      that isn't EmptyIndex...
4467
4468 22June2010
4469     I need a clear picture of the "Steady state" for an internal index block
4470     with it's children.
4471     The internal index block contains 1 or more addresses.  For each address there
4472     maybe a child index block.  If there is it maybe the head of a list of
4473     blocks with B_PrimaryRef set thus holding the whole list in place until
4474     incorporation happens.
4475     Each of these children can be on either ->uninc_list or ->uninc_next,
4476     or possibly neither if they haven't been queued for writing yet.  Any
4477     PrimaryRef block will be Pinned.
4478
4479     When a child is incorporated and found to be Empty it is flagged as such
4480     and then must never be returned by index lookup.  Index lookup will either
4481     add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
4482     block and so have to start again from the top.
4483
4484     When a PrimaryRef block becomes empty it is simply removed from the
4485     PrimaryRef chain so it cannot be found.  The space now belongs to the
4486     previous block.
4487     When a non-PrimaryRef block which isn't the first becomes empty it is
4488     flagged and left in place so that following blocks can be found.  The
4489     address space now belongs to the previous block.
4490     When the first child (fileaddr matches parent) becomes empty - what?
4491       We could re-address first child but that forces early address change -
4492           old might not be incorp yet
4493       We could re-address the parent, but that doesn't work for InoIdx
4494       We could leave it there with physaddr == 0
4495
4496     Last sounds promising.  So we never re-address an index block.
4497
4498    So: From the top.
4499
4500     Index blocks, Indirect blocks, extent blocks each have an address
4501     that never changes.
4502     When a block becomes over-full it splits - a new block appears with
4503     a new address thus implicitly limiting the address space covered
4504     by the original.
4505
4506     When an index block becomes empty and has no pinned children it is
4507     marked as EmptyIndex (under IOLock).
4508     When an EmptyIndex is allocated it goes to phys==0
4509     An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
4510     is never used again.  Its address space is ceded to the previous
4511     index block - which could split several times...
4512     An EmptyIndex which is first can be re-used.  Once it gets pinned
4513     children the EmptyIndex is cleared.
4514
4515     An Index block always has an entry for the first address.  It might
4516     be implicit to phys==0.  Loading such a block creates an empty
4517     block.
4518
4519     InoIdx doesn't get EmptyIndex, rather it gets ->depth=1
4520
4521     Indirect *doesn't* store the first address any more.
4522
4523     Changes:
4524 DONE     - remove forcestart from layoutinfo
4525 DONE     - remove start-address from Indirect blocks
4526 DONE     - only hash index blocks when they are known to be incorporated.
4527 DONE     - when incorporating an uninc list, ignore phys==0 if also a block with
4528        same fileaddr and phys!=0.  so sort phys==0 first
4529 DONE     - Create EmptyIndex flag
4530 DONE     - Clear the flag when adding child pin to index block
4531 DONE     - avoid EmptyIndex non-start blocks during index lookup
4532 DONE     - allow index blocks to be loaded with ->phys==0
4533 DONE     - allow EmptyIndex index block to be "written" to phys 0
4534 DONE     - ensure index lookup finds implicit start address, possibly 0
4535
4536 So now after 36 runs
4537       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
4538       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
4539      10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
4540      14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4541       4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
4542       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4543       3 SysRq : Resetting
4544
4545
4546 index.c:1939
4547    block 0/2 is Realloc and being allocated from cluster_flush while
4548    parent is not Realloc or dirty
4549    That is bad as Realloc gets set in lafs_allocated_block ... except
4550     that the code was bad.  FIXED.
4551
4552 index.c:403
4553   cleaner is pinning a block (299/25) which is not Realloc,
4554     and phase isn't locked.  We are only meant to pin data blocks
4555     for updates while holding a phase lock.
4556     Ahhh - bad code again. FIXED
4557
4558 inode.c:605
4559    Truncate doesn't clean up properly.
4560     327 has 60+1
4561     331 has 108+1
4562     327 has 34+1
4563     327 has 60+1
4564    No sign of any children.
4565
4566    Very weird.  Signed in incorporation going wrong.
4567      Added more debugging.
4568
4569 Found 4084 4 12 at 890
4570 Added 4084 4 12
4571 Found 4089 4 16 at 878
4572 Added 4089 4 16
4573 Found 4094 2 20 at 866
4574 Added 4094 2 20
4575 Found 2561 2 22 at 854
4576 Added 514 2 22
4577 Found 2564 4 24 at 842
4578 Found 2569 2 28 at 830
4579 Found 0 0 0 at 818
4580
4581 Why are 2564 etc lost?  No sign of alloc-to-0
4582
4583 segments.c:1034
4584    no free segments - need to wait somewhere.
4585
4586 segments.c:624
4587    allocated_blocks has gone over free_blocks!
4588    in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
4589    Wanted CleanSpace to reserve the youthblk
4590    Maybe related to not waiting - ignore for now.
4591
4592 super.c:657
4593   block 0/2 was dirty but not pinned.  Should not happen to inodes.
4594   block 0/0 was Pinned because it had a child - as above.
4595
4596   Maybe we don't carry the pin across when we collapse dir
4597   into inode??... looks quite likely
4598
4599
4600 23 June 2010
4601
4602 116 runs.
4603       1 BUG: unable to handle kernel paging request at 6b6b6bfb
4604       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
4605       3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
4606       7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
4607      61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4608       1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4609      42 SysRq : Resetting
4610
4611
4612 6b6b6bfb:
4613   invalidate_inode_buffers called on at shutdown.
4614   Still wierd
4615
4616 block.c:497  FIXED??
4617   block 16/1 is not dirty with no credits.
4618   Maybe writepage got to it?
4619
4620 dir.c:710
4621   ouch! dir lookup failed in unlink.
4622    No real hints.  Must be hash based - some off-by-one probably.
4623    Need to stare at the code.
4624
4625 inode.c:606  FIXED
4626   Blocks still present after truncate.
4627   typically about 60, but in 1 case '4'.  No index blocks.
4628   So probably content of second index block.
4629   Yes, lafs_leaf_next was doing the wrong thing for addresses
4630    before start of block.
4631
4632 segments.c:1034
4633   same old
4634
4635 super.c:657  FIXED
4636   dir inode 0/2 is still Dirty but not pinned.
4637   Maybe lafs_dirty_inode should be pinning the block
4638
4639   But now this triggers for 16/X still dirty.
4640
4641
4642 How and when to write blocks in a SegmentMap file?
4643  - We don't want normal write-back to write them unless they have
4644    no references
4645  - We need to write them in tail of checkpoint, and index info must
4646    follow in the next checkpoint.
4647
4648 lafs_space_alloc is called from
4649   - mark_cleaning:  always CleanSpace, failure is OK
4650   - lafs_cluster_update_pin: ReleaseSpace.  -EAGAIN is OK (CHECK THIS) but failure
4651               is not - or shouldn't be.
4652   - lafs_allocated_block: CleanSpace, checking if parent of Realloc block
4653         can be saved separately from any Dirty version.  Failure OK, blocking not.
4654   - lafs_prealloc - general space allocation.
4655   -
4656 lafs_cluster_update_pin is call from:
4657   - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
4658     lafs_mknod, lafs_rename,
4659   - lafs_write_inode
4660      So best to return -EAGAIN, and it should be handled adequately.
4661
4662 lafs_prealloc is called from:
4663   - lafs_reserve_block, after modifying the alloc_type extensively.
4664   - lafs_phase_flip to re-fill the 'next' credits.  If they aren't available
4665       we simply pin all children so they aren't needed.
4666       So failure is OK
4667   - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
4668        If this fails .. what?? lafs_reserve_block fails. so...
4669
4670 lafs_reserve_block is called from
4671   - mark_cleaning - CleanSpace
4672   - lafs_pin_dblock - type is passed int...
4673   - lafs_prepare_write - on failure write will fail or retry after checkpoint
4674   - lafs_inode_handle_orphan - to help with delete. On failure we allow
4675          cleaning to happen
4676   - lafs_seg_move - should be elsewhere.  Failure BAD !
4677   - lafs_free_get - as above, failure BAD
4678   - clean_free - update youth for new clean blocks - Failure BAD
4679
4680 lafs_pin_dblock is called from
4681   - dir_create_pin - fail or again handled
4682   - dir_delete_pin
4683   - dir_update_pin
4684   - lafs_create etc
4685   - lafs_dir_handle_orphan
4686   - choose_free_inum
4687   - inode_map_new_pin
4688   - lafs_new_inode
4689     ...
4690   - lafs_orphan_release !! cannot handle failure
4691   - roll_block should use AccountSpace
4692
4693 So:  It seems we need a new allocation class that will never fail.
4694   Maybe it is allowed to BUG though?
4695    AccountSpace - i.e. space need to account for the use of space.
4696      Must never ever fail.
4697
4698 Then we must ask where blocking should happen on -EAGAIN.
4699   dir.c does "lafs_checkpoint_unlock_wait", then tries again.
4700   prepare_write does too.
4701
4702 For that to work we must start a checkpoint on returned EAGAIN.... Don't
4703 we want to wait for some cleaning to happen first though?  Maybe an extra
4704 flag, and a count of the number of empty (but not clean) blocks.
4705
4706 - Should I skip orphan handling when tight on space?  Probably not.  It will
4707   just keep failing while we keep cleaning...
4708 - roll_block should use account_space .. or not
4709
4710 - lafs_space_alloc simply allocates space, or fails.  'why' is used to
4711    guide watermark choice.
4712 - lafs_prealloc allocates space to a block and all its parents base on
4713   'why' for watermarks.  It either succeeds or failed.
4714
4715 - lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
4716   to failure as -ENOSPC or -EAGAIN based on 'why'.
4717
4718 - lafs_pin_dblock simply passes on the failure, which must be handled.
4719
4720 So: What to do when we return -EAGAIN?
4721  We need to wait until there are *enough* clean segments, then cause a checkpoint
4722  so they become free.
4723  So a flag that says 'waiting for free space' and a count of segments
4724  required.
4725
4726  But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
4727  Maybe we don't ??  Or do it later.
4728
4729 Still to do:
4730 - Audit all AccountSpace and justify them
4731  + lafs_seg_move is probably wrong.  Should have allocated when the
4732    free segment was allocated
4733 - lafs_orphan_release called lafs_pin_dblock but cannot handle failure
4734 - Need to wait not just for "enough space" but for "enough clean segments".
4735
4736 - how is 'free_blocks' set - what does this tell us??
4737
4738    free_blocks is the sum of known-clean segments.
4739    We probably want:
4740          clean segments
4741          remainder for each active segment
4742    then reserve some segments for cleaning.
4743    And separate 'allocated_block' for each ?
4744
4745 Notes:
4746  segments.c:647 fired: AccountSpace had no space available.
4747    Reserving space to write the segusage of youth block for a newly
4748    allocated segment.
4749  super.c:657 STILL
4750     0/2 is Dirty but not Pinned  Maybe we need PinPending
4751  soft lockup
4752     in the cleaner!
4753     Maybe I need cond_resched??
4754
4755 Maybe I want two separate 'free_blocks' counters.
4756  One that includes all free blocks for use in 'df' etc.
4757  One that only includes completely free segments for use in allocation...
4758
4759
4760 24 June 2010
4761
4762  Something is wrong with cleaning and segment tracking
4763  We have 5 free segments and we get them all without writing
4764  anything!  We consumer them all with cluster_flush!
4765  It seems that the root inode is not changing phase!
4766  Nothing is on the phase leafs.
4767  Most children are in Writeback on cluster. and are Realloc
4768  Others have pinned children.
4769  They are all in 'cluster', but 'flush' doesn't flush them,
4770  so they must be in a different clister???  Is the cleaner still
4771  cleaning?  Yes, they are on the cleaner 'wc' list so they are
4772  queued but not flush for the cleaner.
4773
4774 25 June 2010
4775  At last it looks like I nearly have a working FS. Out of 361 test
4776  runs, 9 triggered BUGS and one hung at umount.
4777
4778  I need a new TODO list, starting with 6 jul 2007(!) and adding any
4779  FIXMEs etc.
4780
4781 DONE 0/ start TODO list
4782 DONE 1/ document new bugs
4783 DONE 2/ Tidy up all recent changes as individual commits.
4784 DONE 3/ clean up the various 'scratch' patches discarding any tracing that
4785     I don't think I need, and making the rest 'dprintk' etc.
4786 DONE 4/ check in this README file
4787 DONE 5/ Write rest of the TODO list
4788
4789 DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
4790     It is Dirty but only has *N credits.
4791     16/1 ...
4792
4793 DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
4794    I guess we should getref/putref.
4795
4796 DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
4797     and doesn't cope well.
4798
4799 DONE 5d/ At unmount, 16/1 is still pinned.
4800
4801  6/ soft lockup in unlink call.
4802     EIP is at lafs_hash_name+0xa5/0x10f [lafs]
4803  [<d0a56283>] hash_piece+0x18/0x65 [lafs]
4804  [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
4805  [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
4806  [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
4807  [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
4808  [<c02602c1>] vfs_unlink+0x4e/0x92
4809
4810   Don't know. Looks like cleanup up a chain in dir_delete_commit.
4811   Added a BUG_ON.
4812
4813   Would we be spinning on -EAGAIN ?? 4 empty segment are present.
4814
4815  6a/ index.c:1947 - lafs_add_block_address of index block where parent
4816           has depth on 1.
4817 looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
4818 /home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
4819
4820  6b/  check_seg_cnt sees to be spinning on the 3rd section
4821     the clean list has no end!
4822     we were in seg scan
4823 CLEANABLE: 0/0 y=0 u=0 cpy=32773
4824 CLEANABLE: 0/1 y=0 u=0 cpy=32773
4825 CLEANABLE: 0/2 y=0 u=0 cpy=32773
4826 CLEANABLE: 0/3 y=32773 u=6 cpy=32773
4827 CLEANABLE: 0/4 y=32772 u=124 cpy=32773
4828 CLEANABLE: 0/5 y=32771 u=273 cpy=32773
4829 CLEANABLE: 0/6 y=32770 u=0 cpy=32773
4830
4831 of
4832 0 0
4833 1
4834 2
4835 3 6
4836 4 124
4837 5 273
4838 6 0
4839 7 496
4840 8 0
4841
4842
4843  6c/ at shut down, some simple orphans remain
4844     missing wakeup ???
4845
4846 DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
4847    truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
4848 Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
4849 SEGMOVE 1499 0
4850 Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
4851 .......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
4852 Why have I no credits?
4853 /home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
4854
4855    Cleaning is racing with truncate, and that cannot happen!!
4856    Actually it could - if i_size changed at the wrong time.
4857
4858 DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
4859    block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
4860    in touch_atime.  I think I know this one.
4861
4862  7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
4863                i.e. 1510, 1945-2038, 2448 of 5378
4864     Appear to be looping in first loop of try_clean, maybe
4865      group_size_words == 0 ??
4866     Add BUGON and wait.
4867
4868 DONE 7c/ NULL pointer deref - 000001b4
4869      Could be cluster_flush finds inode dblock without inode.
4870      Have a BUG_ON of this now.
4871
4872 DONE 7d/ paging request at 6b6b6bfb.
4873     invalidate_inode_buffers called, so inode_has_buffers,
4874     so private_list is not empty.  So presumably use-after-free.
4875     But is on s_inodes list.
4876      Probably cleaner is still active (if this is first call to
4877      invalidate_inodes in generic_shutdown_super) so list gets broken.
4878      We need locking or earlier flush.
4879
4880 DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
4881      Check for Realloc too.
4882
4883 PRESUME-FIXED 7f/ index.c:2024 no uninc credit
4884         [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
4885       found during checkpoint.  Maybe inode credit problem.
4886
4887 PRESUME-FIXED 7g/  inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
4888       ->uninc blocks.  This is during truncate.  Need some
4889       interlock with cleaner maybe?
4890       Probably the same race between cleaner and truncate.
4891
4892 DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
4893
4894 NOLONGERRELEVENT 7j/ resolve space allocation issues.
4895     Understand why CleanSpace can be tried and failed 1000
4896     times before there is any change.
4897
4898 DONE  7k/ use B_Async for all async waits, don't depend on B_Orphan to do
4899      a wakeup.
4900      write lafs_iolock_written_async.
4901
4902 DONE 7l/ make sure i_blocks is correct.
4903           set on 'import_inode'
4904           decreased when lafs_summary_update assigned block to '0'
4905           changed when lafs_summary_allocate changes e.g. quota.
4906
4907       lafs_summary_update is called when a block is assigned to a location,
4908         or to zero.  It is real usage.
4909       lafs_summary_allocate is called when we set Prealloc on phys==0 or
4910          clear Prealloc on phys==0
4911       So allocate must be followed exactly.
4912        update is already counted for setting !=0, so only dec on ==0.
4913       So all is good.
4914      What about quota? - hidden in quota_allocate / qcommit
4915
4916 7m/ delete inode could not progress through inode_map_free, so
4917    ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
4918    was permanently an orphan.
4919
4920 DONE 8/ looping in do_checkpoint
4921    root is still i Phase1 because 0/2 is in Phase 1
4922   [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
4923    Seems to be waiting for writeback, but writeback is clear.
4924      Need to call lafs_io_wake in lafs_iocheck_writeback for when
4925      it is called by lafs_writepage
4926
4927 DONE 9/ cluster.c:478
4928     flush_data_To_inode finds Realloc (not dirty) block
4929     and InoIdx block is not Valid.
4930   [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
4931   I wonder if it was PinPending, or where it was IOLocked (or if).
4932
4933    I guess we truncated, then added data, then tried to clean.
4934    Probably just a bad 'bug' given recent changes.
4935    No, I think it is the race between truncate and clean which is now fixed.
4936
4937 SEEMS TO BE GONE 10/ inode.c:606
4938     Deleting inode 328: 2+0+0 1+0
4939
4940     2 level index.
4941     first index at level 1 was full and prune properly.
4942     Nothing else found empty.
4943     Somehow the second index block and contents were lost.
4944
4945 ASSUME_DONE 11/ super.c:657
4946     Root still pinned at unmount.
4947      0/2 is Dirty:  [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4948                     [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
4949                     [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4950                     [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4951                     [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
4952     maybe dir-orphan handling stuffed up
4953     Or maybe it is the I_Dirty issue.  Assume fixed.
4954
4955
4956 ASSUME_DONE 12/ timeout/showstate in unmount
4957     umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
4958     That looks similar to 8
4959
4960 DONE 13/ delete_inode should wait for pending truncate to complete.
4961     Document I_Trunc somewhere - including that i_mutex is needed to set it.
4962     Verify that assertion.
4963     Actually it requires i_alloc_sem, or the inode to be deleted.
4964
4965
4966 DONE 14/ Review writepage and flush and make sure we flush often enough but
4967     not too often.
4968     Probably just remove the cluster_flush from write-page as lafs_flush
4969     will do that.
4970     But leave for now as it encourages heavy indexing.
4971
4972 DONE 14a/ use bio_add_page to write clusters.
4973
4974 DONE 14b/ Figure out what backing_dev to present for the filesystem.
4975
4976 DONE 15/ The inode map file lost some credits.  I think it losts a PinPending because
4977     it isn't locked properly.  Don't clear PinPending if someone else might
4978     have set it.
4979
4980 DONE15a/ Find all FIXMEs and add them here.
4981
4982
4983 DONE 15b/ Report directory size less confusingly
4984
4985 DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)
4986
4987 DONE 15d/ What does I_Dirty mean - and implement it.
4988
4989 FIXED 15e/ setattr should queue an update for the inode metadata.
4990      and clean up lafs_write_inode at the same time (it shouldn't do an update).
4991      and confirm when s_dirt should be set.  It causes fsync to run a
4992      checkpoint.
4993
4994 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
4995 ## Items from 6 jul 2007.
4996
4997 15g/ test directories with non-random sequential hash.
4998
4999 DONE 15h/ orphan deadlock
5000     lafs_run_orphans- lafs_orphan_release can block waiting for written
5001      in erase_dblock, but that won't complete until cleaner gets to run,
5002      but this is the cleaner blocked on orphans.
5003
5004
5005 DONE 15i/ separate thread management from 'cleaner' name.
5006
5007 DONE 15j/ review rules in getref_locked - and document them
5008
5009 DONE  - fix accesses to iblock
5010
5011 DONE 15k/ newblocks should probably be a count of segments.  Review that.
5012
5013 DONE 15l/ make sure checkpoint_youth is decayed properly.  Review youth decay.
5014
5015 DONE 15m/ consider combining .orphans and .cleaning lists.  If something is an
5016     orphan, we probably don't want to clean it just now(?).
5017
5018 DONE 15n/ consider if lafs_pin_dblock should check for iolock.  Maybe
5019      iolock or PinPending (which must be set under iolock).
5020      Just require PinPending and always get iolock_written for that
5021      except in special cases.
5022
5023 DONE 15o/ Can there be async blocks when checkpoint starts?  Could they
5024      pin blocks in old phase?  Do I need to check for them?
5025
5026 DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
5027      yet' thing - or somehow avoid the yuckiness.
5028
5029 DONE 15q/ check checksums when reading cluster_header for cleaner
5030        This is already done!
5031
5032 DONE 15r/ consider further optimisation in cleaner to avoid lookups.
5033
5034 DONE 15s/ memory barrier for i_size check in cleaner???
5035
5036 DONE 15t/ review usable-space calculations in clean.
5037
5038 DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
5039
5040 DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
5041      common helpers.
5042
5043 DONE 15w/ review cluster updates and make sure space used is accounted properly.
5044
5045 DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
5046      try to create it.  Would this actually save anything significant?
5047
5048 DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?
5049
5050 DONE 15z/ make sure symlink creation is safe - do I need to log the body??
5051
5052 DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
5053
5054 DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
5055      after lock is taken on block?
5056
5057 DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
5058       writeout.
5059
5060 DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
5061         if more is needed.
5062
5063 DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
5064     before removing from lru.
5065
5066 DONE 15af/ Review all locking in lafs_refile
5067
5068 DONE 15ag/ Don't allocate data part of InoIdx block.
5069
5070 DONE 15ah/ Is there a problem with lafs_allocated_block putting an
5071     about-to-be-truncated block on an uninc list?
5072
5073 DONE 15ai/ When allocating a new segment during checkpoint, delay the
5074     youth-block update until after the checkpoint
5075
5076 DONE 15aj/ When roll-forward finds a new segment, make sure youth number is
5077     updated.
5078
5079 DONE 15ak/ Load orphan file during roll-forward and make every block an
5080     orphan.
5081
5082 DONE 15al/ set filesystem update_time somewhere.
5083
5084 DONE 15am/ filesystem 'name' needs to be handled uniformly.
5085
5086 DONE 15an/ can we be sure 'b' will be non-null in delete_inode?
5087
5088 DONE 15ao/ determine what locking is needed to walk the children list
5089     in lafs_inode_handle_orphan.  Probably the address_space private lock.
5090
5091 15ap/ Make sure write_inode has been cleaned up.  See if this applies to
5092     rollforward of a symlink (see FIXME)
5093
5094 DONE 15aq/ change inode map to be little-endian, not host-endian
5095
5096 DONE 15ar/ understand what to do about errors in lafs_truncate
5097
5098 15as/ handle errors from lafs_write_super ???
5099
5100 DONE 15at/ More wait_queues to wait for different blocks.
5101     just use wait_on_bit / wake_bit
5102
5103 DONE 15au/ How should iocheck_block set the page error?
5104        and block_loaded <- this gets it right.
5105
5106 15av/ ditto for write errors?
5107
5108 DONE 15aw/ when lafs_incorporate makes a new block where the
5109       old is Realloc, the new should be Realloc too.
5110
5111 15aw2 / When a block is a snapshot block it can never be dirty
5112     so we only need credits for realloc...
5113
5114 DONE 15ax/ Think about what happens when we relocate a block
5115     in the orphan list (lafs_orphan_release), particularly
5116     if the block isn't actually loaded.
5117     FIXME still need to make sure errors will loading the orphan
5118     file are handled correctly - I guess we mark all bad orphans as
5119     type==0 and when we find those during release, reduce the size
5120     of the orphan file.
5121
5122 DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup
5123     when an inode or dir mutex is released.
5124     No, there isn't.
5125
5126 DONE 15az/ Sanity check all values in cluster head during roll-forward
5127       i.e. in roll_valid.  If the head isn't complete, we can still
5128       use this to commit some previous checkpoints.
5129
5130 DONE 15ba/ roll forward should not BUG on bad data like inodefile in
5131     non-primary filesystem.
5132
5133 DONE 15bb/ Do I need to sync something before copying an update over part
5134     of an inode, then reloading the inode.
5135
5136 DONE 15bc/ Handle DescHole in roll forward.
5137
5138 DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
5139     in roll forward, just for consistency.
5140
5141 DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
5142     are actually the correct type.
5143
5144 DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
5145    a lookup - or at least we can test for that.
5146    lafs_seg_apply_all has similar problems and needs a good solution.
5147
5148 DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
5149     if parent splits.  See what to do about that.
5150
5151 DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
5152   or handle if it has.
5153
5154 DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
5155   to twostage.
5156
5157 DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
5158    segdelete.
5159
5160 DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
5161
5162 DONE 15bl/ better checks for 'valid state block address' in valid_devblock
5163     include that segment_count is credible
5164     also in valid_stateblock
5165
5166 15bm/ make sure everything gets free properly on error during mount / lafs_load
5167
5168 15bn/ How does refcounting of 'struct fs' work with multiple filesets?
5169
5170 DONE 15bo/ use put_super to drop last refer to superblocks
5171
5172 DONE 15bp/ review all superblocks - maybe use more anon??
5173
5174 15bq/ check readonly status in lafs_get_sb
5175
5176 DONE 15br/ sync_fs should probably wait for something if 'wait'.
5177
5178 DONE 15bs/ set f_fsid properly in lafs_statfs
5179
5180 DONE  - use new write_begin / write_end
5181
5182 15bt/    - review how we ensure that credit remain with block.
5183
5184 15ca/ When pin inode data block, pin it as well as index block I think
5185     It is still kept of the leaf list until the index block is done with
5186     I think.
5187
5188 15cb/ Layout issues:
5189      DONE - subset filesys still needs a parent pointer
5190      DONE - cluster head needs mtime/ctime to log these.
5191      - need better tracking of which devices are in this array??
5192             Need to be able to have read-only devices that are shared
5193              among arrays.
5194      DONE - need multiple parallel write-clusters to allow parallel writes.
5195      - record tuning in state block:
5196            - max_segs
5197      DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
5198      - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
5199      - policies of whether old or new data is allowed on each device
5200      - policies of how much duplication of metadata is required
5201      DONE - inode map - not host-endian
5202      DONE - segments > 16bit:
5203         segusage file - what about youth?
5204         cluster_head Clength
5205
5206 15cc/ free any stray B_ASync block found in destroy_inode
5207
5208 15cd/ Some code assumes a cluster header does not exceed 1 page.
5209      Is this safe?  Is in true? Is it enforced?p
5210      roll-forward now handles large cluster_head.
5211      Need cleaner to handle it, and need to possibly write large
5212      cluster head when making new clusters.
5213
5214 15ce/ classify BUGs as
5215         - internal logic errors
5216         - IO errors
5217         - unusual conditions I want a warning of
5218         - data corruption errors
5219
5220 DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
5221      This is needed for the cleaner - the cleaner needs to hold a ref somehow.
5222
5223 15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
5224       stuff??
5225
5226 15ch/ Review values of youth and checkpoint_youth and think about off-by-one
5227      issues.
5228
5229 15da/ Replace directory updates!!!!!
5230
5231 15db/ Decide how version string will be used.
5232
5233 15dc/ resolve table_size - it should be stored in the segusage file and validated
5234       based on device geometry.
5235
5236 15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
5237       cluster on same device to verify previous.
5238
5239 15eb/ When multiple devices and lots to do and plenty of free space,
5240         allow multiple segments, one per device, to be open at once,
5241         and possibly be writing multiple clusters at once using
5242         VerifyDevNext2
5243
5244 15ec/ Implement i_version tracking.  This should be a 64bit numbers
5245         that appears to change every time the file changes.  We only
5246         need a new number when someone looks at the value with
5247         getattr.
5248         We could simply use mtime with the sub-millisecond part being
5249         a counter of times that getattr sees a change in the same
5250         millisecond.
5251         However as mtime can go backwards we might get i_version going
5252         backwards, which is awkward.  I wonder if I care.
5253         Otherwise, leave for an inode extention later.
5254
5255 16/ Update locking.doc
5256
5257 17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
5258     calls  lafs_iolock_written.  How do we know that won't block on cluster_flush?
5259
5260 18/ See if per-fs shrinker is available yet and consider it for index blocks.
5261
5262 19/ Review WritePhase and make sure it is used properly.
5263
5264 20/ Review places where we update blocks and be sure they are not in writeout
5265     or in a different phase.
5266
5267 21/ Review and document all lru uses (locking.doc) and make sure they are
5268     all locked properly.
5269
5270 22/ Check possible failures:
5271     - thread allocation
5272     - memory allocation
5273     - reading critical metadata
5274     ...
5275
5276 23/ Rebase on 2.6.latest.  Done for .38
5277
5278 24/ load/dirty block0 before dirtying any other block in depth=0 file,
5279     else we might lose block0
5280
5281 25/ use kmem_cache for
5282         datablock
5283         indexblock - probably a mempool because we cannot allow failure when
5284                      splitting an index block.
5285         skippoint (mempool?)
5286         segsum - mempool??
5287         others?
5288
5289 26/ Review seg addressing code for 2-D geometries.
5290
5291 27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.
5292
5293 28/ Make sure youth blocks are always referenced properly.
5294
5295 29/ Make sure new segments are referenced properly.  I think there might be
5296     some double referencing.
5297
5298 30/ Decide when to use VerifyNULL or VerifyNext2
5299
5300 31/ Implement non-logged files
5301
5302 DONE 32a/ Store access time in a file
5303 32b/ Make it a non-logged file
5304 32c/ Avoid writing out dirty atime file blocks when not necessary.
5305       i.e. keep the page clean and active, and trigger 'write'
5306      on release_page.
5307
5308 33/ Support quota : group / user / tree
5309
5310 34/ handle subordinate filesystems:
5311      ss[]->rootdir needs to be array or list
5312      lafs_iget_fs needs to understand this
5313
5314 35/ review snapshots:
5315       - peer lists and cleaning
5316       - how to create
5317       - failure modes
5318       - how to destroy
5319
5320 36/ review roll-forward
5321
5322 DONE 36a/  make sure files with nlink == 0 are handled well
5323 DONE 36b/  sanity check before trusting clusters
5324 DONE 36c/ handle miniblocks which create new inodes.
5325 DONE 36d/ Handle DescHole in roll_block
5326 DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
5327      than just iolock, for consistency...
5328 DONE 36f/ What to do if table becomes full when add_block_address in
5329      roll_block ??
5330 DONE 36g/ Write roll_mini for directories.
5331 DONE 36h/ In roll_one, use the cluster counting code to find block number and
5332      make sure we don't exceed the segment.
5333 DONE 36i/ add more general error checking to lafs_mount -
5334             lafs_iget orphans and segsum.  Check type is correct.
5335          errors from lafs_count_orphans or lafs_add_orphans.
5336          alloc_page failure for chead - maybe allocate something bigger??
5337
5338 37/ Configure index block hash_table at run time base on mem size??
5339
5340 38/ striped layout
5341         review everything needed for safe RAID5
5342
5343 39/ How to handle all different IO errors
5344
5345 40/ Guard against data corruption at every level.
5346
5347 41/ Add checksums on index blocks and dir blocks and Inodes and ???
5348
5349 42/ Store duplicates of some blocks.  At least index and inode.
5350
5351 43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
5352     Make sure ->page_mkwrite sets up credits properly
5353
5354 44/ Examine created filesystem and make sure everything looks good.
5355
5356 DONE 45/ mkfs.lafs
5357
5358 46/ fsck.lafs
5359
5360 47/ Write good documentation
5361
5362 48/ Review all code, improve all comments, remove all bugs.
5363
5364 49/ measure performance
5365
5366 50/ Support O_DIRECT
5367
5368 51/ Check support for multiple devices
5369     - add a device to an live array
5370     - remove a device from a live array
5371
5372 DONE 52/ NFS export
5373
5374 53/ 'overlay' support
5375         So I mount one device read-only an another device
5376         writable which gets all the updates.  metadata on first
5377         device not updated.
5378
5379 54/ cluster support - is this possible?
5380
5381 55/ is any useful variant of reflink  possible?
5382
5383 56/ Review roll-forward completely.
5384
5385 57/ learn about FS_HAS_SUBTYPE and document it.
5386     This is for fuse in particular so users can know the real type
5387
5388 58/ Consider embedding symlinks and device files in directory.
5389     Need owner/group/perm for device file, but not for symlink.
5390     Can we create unique inode numbers?
5391     hard links for dev-files would be problematic.
5392     What do we gain?  Maybe something for short symlinks.
5393     40 seems a good length to get 70% of symlinks.
5394
5395 59/ Fix NeedFlush handling so we don't drop-then-retake
5396     a mutex as that isn't sensible.
5397
5398 60/ Introduce some fs state recording that fsck is needed and possibly
5399     identifying what sort of fsck.
5400
5401 61/ Try to make the inode struct smaller - maybe move some of the
5402     fs metadata into a separately-allocated struct.
5403
5404 62/ System/trusted extended attributes:
5405          fileset max size
5406          directory hash/seed
5407
5408 63/ user extended attributes.
5409
5410 64/ wonder if index blocks can be flushed out by memory pressure somehow.
5411    e.g. if a data block is written by reclaim, flag the index block.
5412    When a flagged index block has no children, it is incorporated and written.
5413     ??
5414
5415 65/ review why lafs_allocated_block needs the new_parent label.  Should not
5416    lafs_incorporate leave all parents dirty? Maybe it is just the need for
5417    B_Realloc - so maybe lafs_incorporate should leave the new block either
5418    realloc or dirty rather than lafs_allocated_block doing it.?
5419    See also 15ad below.
5420
5421 66/ Delay writeout of directory updates until an fsync.  If a checkpoint happens
5422    first, discard the updates (and fsync waits for checkpoint to complete).
5423    If a cross-directory rename happens care is needed:  either flush updates
5424    first or ensure that a flush does happen before the cross-directory
5425    update is flushed.
5426    Note that if the target of a rename is a directory, it must also be fully
5427    flushed before the rename can proceed.
5428
5429 26June2010
5430  Investigating 5a
5431
5432    Normal sequence is to surrender UnincCredit, then to clear Dirty,
5433     then to write.  If anyone re-dirties after Dirty is clear, they
5434     will naturally have to add an UnincCredit having reserved space first.
5435    However it seems that the Cleaner gets in the way as the block in question
5436    has just previously been cleaned, which consumed the UnincCredit
5437    Do we need ReallocUnincCredit?? I hope not.
5438    We generally need a way to say "I might want to write to this" so cleaner
5439    doesn't write it early.
5440    For index blocks that is pincnt.  For data it is 'PinPending'.
5441    This keeps index blocks off clean_leafs until they are ready, but
5442    not data blocks.
5443    And in any case, TypeSegmentMap blocks don't get PinPending as they
5444    get written *after* the checkpoint.  That is a rather ugly exception.
5445    Maybe we make their different handling more explicit.  We put them on
5446    a separate list unpinned so the rest of the checkpoint can complete.
5447    Then we flush that list?
5448    Then PinPending keeps them off the clean_leafs list.
5449
5450    So to clarify the plan:  If a block is already Pinned to this phase,
5451    we can "clean" it by marking it Dirty rather than Realloc.  This is
5452    appropriate for blocks that are likely to change soon (as blocks written
5453    to the cleaner segment are not likely to change soon).
5454    For data blocks we take "PinPending" to say "might change soon".  For
5455    index blocks ... we don't know if it is pinned by Realloc or Dirty or
5456    PinPending children.  So we set Realloc and wait for any children to
5457    be unpinned for whatever reason.  If it is only pinned by Realloc blocks,
5458    it will end up on clean_leafs and be processed to the cleaner segment.
5459    If it is pinned by anything else it will be found by the checkpoint and
5460    processed to the new-data segment.
5461
5462    So Index blocks always get Realloc, PinPending blocks get Dirty,
5463    Other data blocks get Realloc.  Good.
5464
5465    Must review PinPending usage... always set, then maybe-dirty inside
5466    checkpoint lock.  In cases of unlocked usage (inode map) we don't clear
5467    PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
5468    It is likely to be changing though, so not a big cost.  Even good.
5469
5470    Could make the distinction later.  PinPending blocks don't go on
5471    clean_leafs.  So if they are still realloc at the checkpoint, we Realloc
5472    to the new-data segment.  This has the same net effect but is arguably
5473    cleaner.  It means that if a realloc block gets pinpending set, it
5474    immediately stops being a clean leaf and so is safe.
5475    So: just keep PinPending blocks off clean_leafs.  Keep them on phase_leafs.
5476    However there is no mechanism for moving things from phase_leafs to clean_leafs.
5477    So maybe they stay on clean_leafs, but when the cleaner gets to them, it
5478    dirties them and drops them.... that would work.
5479
5480    So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
5481    Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
5482    to pick up.
5483
5484    BUT:  Does this work for TypeSegmentMap blocks?  They aren't PinPending.
5485
5486    We could treat them specially in the cleaner.  Or we could set PinPending
5487    and pin them to the phase, but treat them differently in checkpoint.
5488    If we gathered them onto a separate list, then flush the list after
5489    the phase had changed, it might be quite neat.  No more getting writepages
5490    to do our work for us.
5491    They would need to be re-pinned to the next phase, then written out.
5492    Or just unpinned, and let seg_inc re-pin as appropriate... except that
5493    seg_inc is too later to pin.  It dirties.  We need to pin when we get
5494    SegRef.  We currently reserve but we don't pin.
5495    We really do need to phase_flip these segmentmap blocks.  But that requires
5496    getting extra credits, and Pinning everything if new credits are not available.
5497    And we don't really have a good list of 'everything' that depends on a segment.
5498    But seeing the space_alloc never fails for these...
5499    So Pin them, and flip them with AccountSpace
5500
5501    So:
5502     - split out common 'flip' code
5503     - add 'flip' for data blocks
5504     - create list of accounting blocks and flip accounting file blocks onto
5505       that list during checkpoint
5506       Flush should write that list,  not the files.
5507     - Get cleaner to ignore pinpending blocks, marking them dirty.
5508     - pin segusage blocks while ref on them is held.
5509     - writepage no longer needs special case for TypeSegmentMap, just PinPending
5510     - lafs_prealloc just tests PinPending
5511
5512
5513    [[aside: quota files seem to be handled like segmentmap files.  Is that
5514      right??
5515      We only track usage of data blocks based on various 'owners' of the file.
5516      We need to know if a block was written in one phase or the next, and
5517      only count blocks written/allocated in the one.
5518      Data blocks can slip into 'this' phase quite late - any time before the
5519      parent is finally incorporated.  So we don't write quota blocks
5520      until checkpoint is done.  So yes, they are like SegmentMap
5521    ]]
5522
5523
5524   segsums....
5525    If there are hundreds of snapshots, then a block being cleaned (whether to
5526    cleaner segment or new-data segment) could affect hundreds of segment
5527    usage counters.  That would be clumsy to work with.  Every block in the
5528    free table would need to hold references to hundreds of blocks.  This
5529    is do-able and might not be a big waste of space, but is still clumsy.
5530    I could change the arrangement for accounting per-snapshot usage by having
5531    a limited number of snapshots and having all the counters for one segment
5532    in the one blocks. So 1024byte block could hold 512 counters (youth plus
5533    base plus 510 snapshots).  Half that if I go to 4byte counters.
5534    In more common case of 32 snaphots, could fit counters for 8 segments in
5535    a block.  This means using space/io for all possible snapshots rather than
5536    all active snapshots.  It would also mean having a fairly fixed upper limit.
5537    I wonder what NILFS does....
5538    Worry about this later.
5539
5540   Still trying to get pinning of SegmentMap blocks right.
5541   Normally we need a phase-lock when pinning a data block so that we
5542   don't lose the pinning before we dirty.  But as we phase_flip
5543   these it doesn't matter... So just add that too the test??
5544
5545 28June2010
5546  Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
5547   datablock not, and doesn't cope.
5548   We either prealloc both, which seems clumsy, or always defer
5549   to InoIdx if it is present and pinned.
5550   lafs_prealloc does both Index and Data blocks for inode.
5551   But Data could lose as writeout while index will replenish at
5552   phase_flip, so maybe not a good idea.
5553   If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
5554   credits across to the data block (on non-cleaning segments) so the
5555   Data block doesn't need to have credits.
5556
5557   dirty_inode gets called:
5558      {__,}mark_inode_dirty{,_sync}
5559      inode_{inc,dec}_link_count
5560      [[various quota ops]]
5561     inode_setattr
5562     touch_atime
5563       file_accessed
5564     file_update_time
5565       generic_file_...write
5566       do_wp_page
5567
5568   updates through inode_setattr go to lafs_setattr so the
5569   data block will be pinpending and the checkpoint lock will be held.
5570
5571   updates through inode_*_link_count happen in filesystem and the inode data
5572    block is PinPending, or a block in the file is pinned and will be
5573    dirty, so it will get written.
5574
5575   updates through touch_atime or file_update_time are unexpected and
5576   cannot be prepared for.  file_update_time changes will be caught by
5577   normal file writeout.  atime changes will be lost until we get the
5578   atime file working.
5579
5580   So:
5581     dirty_inode cannot change the block as it might be in writeout, and
5582     it cannot lock anything as it might be in touch_atime which shouldn't
5583     block and cannot fail.
5584     So just set I_Dirty and use that to flush inode to db at writeout.
5585     Any changes which must be in the next phase will come via setattr and
5586     so will wait for incompatible changes to be written out.
5587
5588  Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
5589   my_inode is set
5590      lafs_import_inode
5591          iget and mount-time stuff
5592      lafs_inode_dblock
5593
5594   my_inode is cleared
5595     When I_Destroyed is set and the last ref on the block is dropped
5596     When inode_map_new_prepare claims an inodeblock
5597
5598   So we could easily not have a my_inode - e.g. just cleaning the data block.
5599   ->my_inode cannot disappear while we hold the block, so a test is safe.
5600
5601
5602  ----------------------------------------------
5603  Space reservation and file-system-full conditions.
5604
5605   Space is needed for everything we write.
5606   Some things we can reject if the fs is too full
5607   Some things we can delay when space is tight
5608   Some things we need to write in order to free up space.
5609   Others absolutely must be written so we need to always have
5610   a reserve.
5611
5612   The things that must be written are
5613        - cluster header  - which we never allocate
5614        - some seg-usage and youth blocks - and quota blocks
5615          Whese continually have credit attached - it is a bug if there
5616           are not enough. (We hit this bug)
5617
5618   Things that we need to write to free up space are
5619    any block - data or index - that the cleaner finds.
5620
5621   Things that we can delay, but not fail, are any change to a block that
5622    has already been written or allocate.
5623
5624   When space is needed it can come from one of three places.
5625      - the remainder of the current main segment
5626      - the remainder of the current cleaner segment
5627      - a new segment.
5628
5629   Only Realloc blocks can go to the cleaner segment, so the
5630   'must write' blocks cannot go there, so unused + main must have enough
5631   space for all those.
5632   Realloc blocks can go anywhere - we don't need a cleaner segment if things
5633   are too tight.
5634
5635   When we run out of space there are several things we can do to get more:
5636    - incorporate index blocks.  This tends to free up uninc-credits which
5637      are normally over-allocated for safety.
5638    - cluster_allocate/cluster_flush so more blocks get allocated and so
5639      more can be incorporated.  See above.  This is probably most helpful
5640      for data blocks.
5641    - clean several segments into whole cleaner segments or into the main segment.
5642   Much of this happens by triggering a snapshot, however we should only do that
5643   when we have full cleaner-segments (or zero cleaner segments).
5644
5645   When cleaning we don't want to over-clean.  i.e. we don't want to commit
5646   any blocks from a second segment if that will stop us from commiting blocks
5647   from the first segment.  Otherwise we might use one cleaning segment up by
5648   makeing 4 half-clean.  This doesn't help.
5649
5650
5651   So: we reserve multiple segments for the cleaner, possibly zero.
5652
5653   We clean up to that many segments at a time, though if that many is zero,
5654   we clean one segment at a time.
5655   lafs_cluster_allocate only succeeds if there was room in an allocated segment.
5656   If allocating a new segment fails, the cluster_allocate must fail.  This
5657   will push extra cleaning into the main segment where allocations must not
5658   fail.
5659
5660   The last 3(?) [adjusted for number of snapshots] segments can only be allocated
5661   to the main segment, and this space can only be used for cleaning.
5662   Once the "free_space - allocated_space"  drops below one segment, we
5663   force a checkpoint.  This should free up at least one segment.
5664
5665   We need some point at which we stop cleaning because the chance of finding
5666   something to clean is too low. At that point all 'new' requests defintely
5667   become failures.  They might do earlier too.
5668   Possibly at some point we start discounting youth from new usage scores so
5669   that the list becomes sorted by usage.
5670
5671
5672   Need:
5673     cut-off point for free_seg where we don't allow cleaner to use segments
5674       3? 4?
5675
5676     event when we start using fixed '0x8000' youth for new segment scores.
5677        Maybe when we clean a segment with usage gap below 16 or 1/128
5678     event when we stop doing that.
5679        Maybe when free_segs cross some number - 8?
5680
5681     point when alloc failure for NewSpace becomes ENOSPC
5682        same as above?
5683
5684     point when we don't bother cleaning
5685       no cleaner segments can be allocated, and checkpoint did not increase
5686       number of clean segments (used as many as freed).
5687       Clear this state when something is deleted.
5688
5689
5690    Allocations come out of free_blocks which does not included those
5691    segments that have been promised to the cleaner.
5692    CleanSpace and AccountSpace cannot fail.
5693      We *know* not to ask for too many - cleaner knows when to stop.
5694    ReleaseSpace fail (to be retried) if available is below a threshold,
5695      providing the cleaner hasn't been stopped.
5696    NewSpace fail if below a somewhat higher threshold.  If we haven't entered
5697      emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
5698
5699
5700    Possibly limit some 'cleaner' segments to data only??
5701
5702
5703   So: work items.
5704     - change CleanSpace to never fail, but cluster_allocate new_segment
5705       can for cleaner segment.  This is propagated through lafs_cluster_alloc
5706     - cleaner pre-allocates cleaner segments (for new_segment to use)
5707       and only cleans that many segments at a time.
5708     - introduce emergency cleaning mode which causes ENOSPC to be returned
5709       and ignores 'youth' on score.
5710     - pause cleaner when we are so short of space that there is not point
5711       trying until something is deleted.
5712
5713 30june2010
5714   notes on current issue with checkpoint misbehaving and running out of
5715   segments.
5716
5717   1/ don't want to cluster-flush too early.  Ideally wait until segment is
5718    full, but we currently hold writeback on everything so we cannot delay
5719    indefinitely.
5720   2/ row goes negative!!  let's see...
5721
5722     seg_remainder doesn't change the set, but just returns
5723         the remaining rows times the width
5724
5725     seg_step  move nxt_* to *, stepping to the next ... row?
5726              save current as 'st_*
5727
5728     seg_setsize - allocate space in the segment for 'size' blocks plus
5729          a bit to round of to a whole number of table/rows
5730                nxt_table nxt_row
5731
5732     seg_setpos initialises the seg to a location and makes it empty,
5733        st_ and nxt_ are the same
5734
5735     seg_next reports address of next block, and moves forward.
5736
5737     seg_addr  simply reports address of next block
5738
5739    So the sequence should be:
5740
5741      seg_setpos  to initialise
5742      seg_remainder as much as you want
5743      seg_setsize when we start a cluster
5744      seg_next  up to seg_remainder times
5745      seg_step  to go to next cluster (when not seg_setpos).
5746             or maybe just before seg_setpos
5747
5748      Need cluster_reset to be called after new_segment, or after we
5749      flush a cluster but don't need a new_segment.
5750
5751    I think I'm cleaning too early ...  I am even cleaning
5752    the current main segment!!!!
5753
5754    OK, I got rid of the worst bugs.  Now it just keeps cleaning
5755    the same blocks in the current segment over and over.
5756    2 problems I see
5757       1/ it cleans a segment that it should not touch
5758            We need to  avoid cleaner segment increasing the
5759              checkpoint youth number.
5760       2/ it has 6 free segments and doesn't use them
5761
5762    clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
5763    watermake is 4 segs, so free < 4.  So we have 3 allocated to cleaner,
5764    3 in reserve and so nothing much to clean!
5765
5766    The heuristic for returning ENOSPC is not working.  Need something more
5767    directly related to what is happening.
5768    Maybe if cleaning doesn't actually increase free space.
5769
5770    !Need to leave segments in the table until we have finished
5771    writing to them, so they cannot be cleanable. - DONE
5772
5773    WAIT - problem.  If cleaner segment is part-used, the alloc_cleaner_segs
5774    doesn't count that.  Bad?
5775
5776    When nearly full we keep checkpointing even though it cannot help.
5777    Need clearer rules on when there is any point pushing forward.
5778    Need to know when to fail requests.
5779
5780 02 july 2010
5781
5782   I am wasting lots of space creating snapshots that don't serve any
5783   purpose.
5784   The reasons for creating a snapshot are:
5785     - turn clean segments into free segments
5786     - reduce size of required roll-forward
5787     - possibly flush all inode updates for 'sync'.
5788
5789   We currently force one when
5790        newblocks > max_newblocks
5791           max is 1000 , newblocks is never reset!
5792           probably make that a number of segments.
5793        lafs_checkpoint_start is called
5794           when cleaner blocks, and space is available
5795           at shutdown
5796           on write_super is s_dirt
5797              __fsync_super before ->sync_fs
5798                freeze_bdev
5799                fsync_super
5800                  fsync_bdev
5801                  do_remount_sb
5802              generic_shutdown_super before put_super if s_dirt
5803              sync_supers is s_dirt
5804                do_sync
5805              file_sync !!! is s_dirt
5806
5807       I think I should move checkpoint_start to
5808             ->sync_fs
5809
5810
5811  After testing
5812   - blocks remaining after truncate - one index and 1-4 data
5813   - truncate finds blocks being cleaned
5814          FIXED - move setting of I_Trunc
5815   - orphans aren't being cleaned up sometimes.
5816         Hacked by forcing the thread to run.
5817   - parent of index block has depth==1
5818         Don't reduce depth while dirty children.
5819         Probably don't want uninc either?
5820
5821   - some sort of deadlock? lafs_cluster_update_commit_both
5822      has got the wc lock and wants to flush
5823     writepage also is flushed.
5824    Not sure what the blockage is.
5825    I think the writepage is the one in clusiter_flush, and it
5826     is blocking
5827
5828   - Async is keeping 16/0 pinned during shutdpwn
5829 03July2010
5830
5831   Testing overnight with 250 runs produced:
5832  - blocked for more than 120 seconds
5833       Cleaner tries to get an inode that is being deleted
5834       and blocks, so inode_map_free is blocked waiting for
5835       checkpoint to finish - deadlock.
5836      Need to create a ->drop_inode which provides interlock with
5837      cleaner/iget
5838
5839     But this is hard to get right.
5840     generic_forget_inode need to write_inode_now and flush all changes
5841     out and then truncate the pages off so the inode will be
5842     empty and can be freed.  But flushing needs the cleaner thread
5843     which can block on the inode lookup.
5844     Ahh.... I can abuse iget5_locked.
5845     If test sees I_WILL_FREE or similar, it fails and sets a flag.
5846     if the flag was set, then 'set' fails
5847
5848
5849  - block.c:504 DONE (I trink).
5850     unlink/delete_commit dirties a block without credits
5851     It could have been just cleaned..
5852     It looks like it was in Writeback for the cleaner when
5853     unlink pinned and allocated it....
5854     or maybe it was on a cluster (due to writepage) when
5855     it was pinned.  Then cluster_flush cleared dirty ... but
5856     it should still have a Credit.
5857     Maybe I should iolock the block ??
5858
5859     On reflection it wasn't cleaning, just tiny clusters
5860     of recent changes which were originally written as tiny
5861     checkpoints. Maybe lots of directory updates triggered the clusters.
5862     I guess writepage is being called to sync the directory???
5863     Or maybe the checkpoint was pushed by s_dirt being set.
5864
5865     So use PinPending and iolock to protect dir blocks from writepage.
5866
5867  - dir.c:1266 DONE
5868     dir handle orphan find a block (74/0) which is not
5869     valid
5870     This can happen if orphan_release failed to reserve a block.
5871     We need to retry the release.
5872  - inode.c:615
5873     index block and some data blocks still accounted to deleted file.
5874
5875     No theory on this yet.  Always one index block and a small number
5876     of data blocks.  Maybe the index block looked dirty, but was then
5877     incorporated with something that was missed from the children list...
5878     Or maybe I_Trunc is cleared a bit early...
5879     Or trunc_next advanced too far?? or too soon
5880     ??
5881
5882  - segments.c:640 DONE
5883      prealloc in the cleaner finds all 2315 free blocks allocated.
5884      no clean reserved.
5885     Need to be able to fail CleanSpace requests when cleaner_reserve
5886     is all gone.??
5887
5888     or just slow down the cleaner to one segment per checkpoint when
5889     we are tight..  Hope that works.
5890  - super.c:699
5891      async flag on 16/0 keeping block pinned
5892    Maybe clear Async flag during checkpoint.  Cleaner won't need it
5893    No, just ensure to clear Async on all successful async calls.
5894
5895      orphan file 8/0 has orphan reference keeping parent pinned
5896       [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
5897    Orphan handling is failing to get a reservation to write out the
5898    orphan file block?  Not convincing as there should be lots of space
5899    at unmount, and 'orphan sleeping' has become empty.
5900
5901  - Show State
5902      orphan inode blocked by leaf index stuck in writeback:
5903    [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5)
5904    [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0]
5905
5906     This is in the write-cluster waiting to be flushed
5907
5908
5909 9July2010
5910   Review B_Async.
5911     If a thread wants async something, it
5912          - sets B_Async
5913          - checks if it can have what it wants.
5914            + if not, fail
5915            + if so, clear B_Async and succeed
5916
5917     If a thread releases something that might be requested Async,
5918          it doesn't clear Async, but wakes up *the*thread*.
5919
5920     This applies to
5921         IOLock      - iolock_block
5922         Writeback   - writeback_donem iolock_written
5923         Valid        - erase_dblock, wait_block
5924         inode I_*   - iget / drop_inode
5925
5926      orphan handler, cleaner, segscan - all in the cleaner thread.
5927
5928   107 runs,
5929    2 hit 'Show State' with a blocked orphan inode.
5930     Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
5931     Both NoPhysAddr
5932
5933    Several runs blocked in cluster_flush or waiting for writeback.
5934
5935    - first case: looks like cluster flush should run but doesn't.
5936         cluster_flush runs:
5937            checkpoint, cleaner, cluster_allocate when full, update,
5938            writepage, sync_page
5939         So we have no timeout or other flush.
5940       I guess if we are waiting for writeback, we need to trigger a
5941       cluster_flush.
5942
5943    - other case - cluster_flush was called but is waiting for pending count
5944        to go down.
5945        Looks like cluster_reset shouldn't be changing pending_next
5946
5947    New hang.  Orphans not being processed:
5948         inode, because InoIdx is on leaf and checkpoint isn't pushing
5949         it along.
5950         dir block 0 is Dirty leaf
5951
5952      Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
5953
5954 10July2010
5955   Over night it looks *very* good.
5956   Have one infinite loop with 31770 repeates of
5957   ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
5958                    Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
5959
5960   So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
5961     lafs_add_orphan too short.
5962     tracing shows after truncate_inode_pages.
5963     must be blocked in inode_map_free - maybe use AccountSpace??
5964    But why isn't the the truncate progressing?
5965    Probably same reason:  No ReleaseSpace available.
5966    Maybe we aren't cleaning because there is a free segment, and
5967    we aren't checkpointing because there aren't enough yet...
5968
5969    Probably the cleaner has halted while CleanerBlocks - fix that.
5970
5971   - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
5972         Need a checkpoint to release the orphan?
5973    ditto for 0/331 - 331/0
5974     XX/0 is InoID
5975
5976 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds.  Have a nice
5977 day...
5978 This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
5979 ,UninCredit,PhysValid leaf(1) intable(6) release(1)
5980  [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
5981 Valid leaf(1) intable(6) release(1) Leaf0(0)
5982 ------------[ cut here ]------------
5983 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
5984
5985 Forgetting 0 0
5986 724 != 7  (st->free.cnt afte segdelete, close_segment, close_all)
5987 ------------[ cut here ]------------
5988 WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
5989
5990 we called segdelete on something that was on the freelist.
5991 This happens when the final cluster starts a new segment.
5992 Need to improve the fix though.
5993
5994
5995  lafs_inode_handle_orphan can make progress without leaving
5996  anything async.  Maybe we need a return status:
5997   -EAGAIN - try after async
5998   -ENOMEM - try some time soon - hope memory will be better
5999   0 we called orphan_release
6000   anything else loops.
6001
6002
6003  - we allocate a segment in last checkpoint we don't
6004    take references properly.
6005
6006  - orphan handle spinning on:
6007
6008   ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
6009    26402 calls.
6010    stuck in delete_inode?? ?
6011
6012
6013   never-ending cleaning? Maybe just computer slow ??
6014
6015 11July2010  - on plane to Prague.
6016   How can we safely access ->iblock?
6017    normally iolock, but how do we get iolock?
6018    - flush data to inode
6019    - cluster flush takes private_lock
6020    - private_lock is used to set to null.
6021   I guess we use private_lock to get a reference
6022   then iolock and revalidate
6023   but I can probably test for NULL at any time? though that can change under private_lock
6024   If we own a reference to a child with a parent, then we can use
6025    rcu_dereference to get a ref which might change
6026
6027 12july2010
6028
6029  ->write_inode is called by write_inode() called by __sync_single_inode
6030   to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
6031  Do we care?
6032
6033  change to addresss we already handle with checkpoints
6034  change due to setattr we can handle directly if we want
6035  that just cleans mtime/ctime and atime.
6036    mtime/ctime calls ->dirty_inode
6037    as does atime
6038
6039  So:
6040   getattr changes set I_Dirty so that when cluster_allocate
6041   happens all the changes get saved.
6042
6043   when dirty_inode is called, we set I_Dirty but don't dirty
6044   the inode block.
6045   If anything happened to justify an inode write, it will
6046   be dirty anyway.  If it isn't, this is just atime
6047
6048   So on dirty_inode we check if atime has changed and if so
6049   we schedule change to atime file
6050
6051   sync_inode should write an update for the inode if I_Dirty
6052   but sync_filesystems should not
6053
6054   Simple.  fsync calls ->fsync.  We get that to write an
6055   inode update, but nothing else does.
6056
6057   Possibly all directory updates could be chained onto a
6058   directory and only written when fsync is requested before
6059   a checkpoint.
6060   both sides of a rename ??
6061   leave that for later.
6062
6063 WritePhase - what is that all about?
6064   We must not change a block while it is being written to previous
6065     phase, else we corrupt causality.
6066   But we probably don't want to change it any way as that would
6067   mess up any checksum or duplication.
6068
6069  So we want to ignore WritePhase - scrap it.
6070  Before changing a block, we must iolock_written
6071   - all dir updates
6072   - inode update in fsync
6073   - orphan file
6074   - segusage?
6075   - quotas?
6076
6077  But what about regular data.  If prepare_write finds a block in
6078  writeback, do I need to wait, or can I just mark it dirty in
6079  commit_write?  If no checksum and no duplication applies, this should
6080  be fine.
6081
6082 16July2010
6083  BUT e.g. dir operations are in particular phases.  If the dirblock
6084  is pinned to the old phase, we need to flush it, then wait for io
6085  to complete.  So we need lafs_phase_wait as well as iolock_written.
6086  This is already done by pin_dblock.
6087  I wonder if we need a way to accelerate pinned blocks that are being
6088  waited for - probably not, they should be done early.
6089
6090  So we probably want to iolock after phase_wait in pin_dblock.
6091  Though dir.c pins early.
6092  I need to review all of this and get it right.
6093
6094  So:
6095   - we aren't allowed to block much holding  checkpoint_lock as
6096     checkpoint_start waits for that.  However phase_wait will only
6097     block if a new checkpoint has started already, so there is not
6098     chance of phase_wait ever blocking checkpoint_start.
6099     So it is safe to call phase_wait in checkpoint_lock.
6100     phase_wait will wait until block is written, added back to
6101     the lru clean, then found and flipped... I wonder if that is
6102     good - it keeps parent from being a leaf, and so written, until
6103     child write has completed.
6104     We want to phase-flip a block as soon as it is allocated by cluster_flush.
6105
6106     With directory blocks, i_mutex stops other changes, so an early iolock_written
6107     will leave the block clean and phase won't be an issue.
6108
6109     With inode-map blocks.. we:
6110       set B_Pinned to ensure no-one writes except for phase change
6111         do that after lock_written so it starts safe.
6112       once we have checkpointlock, wait for phase if needed.
6113       then lock_written again which should be instant but ensures
6114       that block is locked while we change it...
6115
6116   I think I want
6117     - refile to call phase flip if index is not dirty and is in wrong phase
6118        and has no pinned children in that phase.
6119     - Only clear PinPending if we have i_mutex or refcnt == 0
6120     - before transaction:
6121           lock_written / set PinPending / unlock
6122       the inside cluster_lock
6123           lock_written pin / change / dirty / unlock
6124       it will only wait for writeout if phase changed.
6125       so don't need phase_wait
6126      but want pre-pin then pindblock
6127      Transactions are:
6128         dir create/delete/update - DONE
6129         inode allocate/deallocate - on inode map DONE
6130         setattr  DONE
6131         orphan set/change/discard
6132
6133      Orphans are a little different as when we compact the
6134      file, the orphan file block 'owned' by the orphan block
6135      can change.  As along as we keep them all PinPending it
6136      should be fine though.
6137      I think that every block in the orphan file will always be
6138      PinPending ???
6139
6140     OK - done most of that.
6141     Early phase_flip is awkward.  We need an iolock to phase_flip,
6142     and we don't have one.  The phase_flip could cause incorporation
6143     which cannot happen until the write completes.  So I guess
6144     we leave it as it is.
6145
6146
6147    FIXME what about inode data block - cluster_allocate is removing
6148     PinPending after making them dirty from the index block..
6149
6150   If all free inode numbers a B_Claimed,  don't think we allocate
6151   a new block... yes we do, as 'restarted' is local to caller.
6152
6153  Also
6154   each device has a number of flags
6155    - new metadata can go here
6156    - new data can go here
6157    - clean data can go here
6158    - clean metadata can go here
6159    - non-logged segments allowed
6160    - priority clean - any segment can be cleaned
6161    - dev is shared and read-only - no state-block updates
6162
6163   state block needs a uuid for an ro-filesystem that this is
6164   layered on.
6165
6166   Is metadata an issue?
6167     We might want it on a faster device, but ditto for directories
6168     and for some data.  So probably skip that.
6169
6170   Have separate segment tables for:
6171     - can have new data
6172     - can have clean data but not new. (this often empty)
6173
6174   Clean data can go to new-not-clean if nothing else
6175   new data can go to clean-not-new ?? if not sync??
6176   Maybe call them 'prefer clean' and 'prefer new'
6177
6178   I think we want:
6179     'no sync new' - don't write new data, unless it is in big chunks and
6180            can wait for checkpoint to be 'synced'
6181     'no write' - never write anything - this is readonly.
6182                used for removing a device from the fs.
6183
6184   A 'no sync new' device can have single-block segments.
6185   This doesn't allow compression, but avoids any need to clean
6186   In this case we don't store youth and the segusage is 32 bits per segment.
6187   That means  - for 1K block size - 0.5% of devices used for segusage.  That
6188   feels high.  For 4K, 1/1024 so a giga per terabyte.
6189   Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
6190
6191   Other segusage for 29 snaps is 1/million of space used.
6192   So we 'waste' 0.1% of device for no secondary cleaning.
6193   Can still do defrag though.
6194
6195   clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
6196   as does creating a snapshot.
6197
6198 18jul2010
6199  If lafs were cluster enabled we would want multiple checkpoint clusters,
6200  one for each node. When a node crashes some node would need to find and
6201  roll-forward.  For single node failure, it is enough to broadcast cluster
6202  address to all others.  For whole-cluster failure, need to either list all
6203  in superblock or link from main write cluster.
6204
6205  When writing to multiple devices we may want multiple write clusters
6206  active for new data.  These all need to be findable from checkpoint cluster
6207  so linking sounds good.
6208  Having a single 'fork' link in cluster head might work but does scale to large
6209  cluster.  I doesn't need to be committed to other not does checkpoint end, so
6210  that should be ok.
6211  Could have a special group_head to list other clusters for roll forward.
6212  If we put fsnum first, a large value - 0xffffffff - could easily mean
6213  something else
6214
6215  Or every  cluster head could point to an alternate stream, and if we want many
6216  quickly, each simply points to another, so we create a chain across all writers.
6217
6218
6219  Another issue...
6220   When we 'sync' we don't wait for blocks until after the checkpoint is started,
6221   and we know that will be driven through to CheckpointEnd which will commit and
6222   release everything.
6223   However 'fsync' doesn't have the same guarantee.  The sync_page call will ensure
6224   the data has been written, but we don't know it is safe until the next
6225   header is written.  So we need to push out the next cluster promptly.
6226
6227   So if sync_page is called on a page in writeback, then we mark the cluster as
6228   synchronous.  When a sync cluster completes, the next (or even next+1) clusters
6229   are flushed out promptly.  Hopefully they won't be empty on a reasonably busy system,
6230   but it is OK if they are.
6231
6232   If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
6233   as the write completes the block will be released.
6234
6235   So: to clarify sync_page:
6236     This can be called when page is in writeback or locked.
6237     If locked there is nothing we can do except maybe unplug the read queue.
6238     If page is in writeback and block is dirty, then it is probably in
6239     a cluster queue and we should flush the cluster and the next.
6240     If page is in writeback and block is not dirty, but is writeback,
6241     just flush one cluster.
6242     But we don't want these cluster flushes to start while the previous is
6243     still outstanding else we stop new requests from being added.
6244     So as soon as the cluster can be flushed we flush, but no sooner.
6245     I guess we use FlushNeeded and make that be less hasty.
6246
6247 19June2010
6248
6249   superblocks....
6250    We currently have a superblock for each device.
6251    I cannot see a good reason for that.
6252    We can just bdev_claim for 'this' filesystem.
6253    Rather we should have a number of anon superblocks,
6254     one for each fileset, then one for each snapshot.
6255    Do we use different fs types? probably yes
6256        lafs - main filesystem made from devices
6257        lafs_subset - subordinate fileset, given a path to  fileset object
6258                  can have 'create' option when given an empty directory.
6259        lafs_snap - snapshot - given a path to filesys and textname.
6260
6261     Cannot create a snap of a subset, only of the whole filesystem
6262     Is it OK to mount eith snap of subset or subset of snap?
6263     It probably does, so need to use the same filesystem type for both.
6264     Maybe lafs_sub or sublafs. Needs path to directory.
6265     can be given 'snap=foo'.
6266     No: a given filesystem may not exist in a snapshot.  You need to
6267     mount the snapshot first, then the subset of the snapshot.
6268     So we have three types as above.  All subsets as 'lafs_subset',
6269     whether they are subset of main or of snapshot.
6270
6271     Should we be able to create a snapshot or subset without mounting it?
6272     It doesn't really seem necessary but might be elegant..
6273
6274     remount doesn't seem the right way to edit a filesystem as it forces
6275      some cache flushing.
6276     What do we want to edit?
6277           - add device,  remove device
6278           - add/remove snapshot by name
6279           - add/remove subset?  Not needed, just mkdir/rmdir and mount to convert
6280                      empty dir to subset.
6281           - change cleaner settings??
6282     Could have remount as an option. If problem find other option.
6283
6284     While cleaning (which is always) we potentially need all superblocks
6285     available as we might need to load blocks in those filesystems to
6286     relocate them.
6287     Unfortunately each super needs to be in a global list so there is a cost
6288     in having them appear and disappear. I guess that is not a big deal.  They
6289     are refcounted and will disappear cleanly when the count hits zero.
6290
6291     So:
6292      DONE - change all prime_sb->blocksize refs to fs->blocksize
6293      DONE - create an anon sb for the main filesystem
6294      DONE - discard the device sbs, just bd_claim the devices and add to list
6295      - use lafs_subset for creating/mounting subsets.
6296
6297   Changed s_fs_info to point to the TypeInodeFile for the super, but
6298    for root/snapshot that doesn't exist early enough to differentiate the
6299    super in sget.
6300    So we make an inode before the super exists and attach it after.
6301    Need to do all that get_new_inode does.
6302         inode_stat.nr_inodes++   - just don't generic_forget the inode
6303         add to inode_in_use -   seems pointless - just set i_list to something
6304         add to sb->s_inodes - if we don't it won't flush - maybe that is good?
6305         add to hash - don't want
6306         i_state == lock|new - only really needed if hashed.
6307     but there is lots of initialisation in alloc_inode that we cannot access!!
6308
6309    Problem is that we need s_fs_info to uniquely identify the fs with something
6310    that can be set in the spinlock, so allocating an inode is out.
6311    And also to get to the filesystem metadata which is in the inode.
6312    I guess we allocate a little something that stores identifier and later inode.
6313      for lafs  we use uuid
6314      for subset we use just the inode
6315      for snapshot we use fs and number
6316
6317
6318 25July2010
6319   superblocks:
6320    - sget gives us an active super_block.  We need to attach to a vfsmnt
6321      using simple_set_mnt, or call deactivate_locked_super.
6322    - sget's set should call set_anon_super
6323    - kill_sb (called by deactive_super) should then call kill_anon_super
6324
6325   If we have a vfsmnt, we have an active reference, so we can atomic_inc
6326   s_active safely.  So use this to allow snapshots and subsets to hold a
6327   ref on the prime_sb and thence on the 'fs'.
6328
6329 26July2010
6330  - DONE  need to set MS_ACTIVE somewhere!!
6331  - FIXME if an inode is being dropped when iget comes in, it gets confused
6332     and the inode appears to be deleted.
6333
6334    We cannot really break the dblock <-> inode link until after write_inode_now,
6335    but there is no call-back before generic_detach_inode is complete.
6336    The last is write_inode which is only calledif I_DIRTY_something.
6337    Maybe when writeback completes on an inode dblock, we should check if
6338    the inode is I_WILL_FREE and if so, we break the link...
6339
6340    Or maybe when we find my_inode set we can check the block and if it isn't
6341    dirty or being deleted we break the link directly... That makes more sense.
6342
6343    So... what is the deal with freeing inodes???
6344      ->iblock is like a hashtable reference.  It is not refcounted
6345              It gets set under private_lock
6346       iblock is freed by memory pressure or lafs_release_index from
6347              destroy_inode
6348      when refcount of iblock is non-zero, ->dblock ref is counted,
6349      else it is not.
6350      dblock is set to NULL if I_Destroyed, or when dblock is discarded,
6351        (under lafs_hash_lock)
6352        and set to 'b' in lafs_iget and lafs_inode_dblock
6353
6354      We can drop the dblock link as soon as iblock has no reference
6355
6356     probably get clear_inode to break the link if possible, which it should
6357     be on 'forget_inode'.  Then lafs_iget can wait on the bit_waitqueue.
6358     or maybe do clear_inode itself
6359
6360    FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
6361       dblock is not NULL.
6362
6363 28July2010
6364   So: ->dblock and ->my_inode need to be clarified.
6365
6366   Neither is a counted reference - the idea is that either can be freed and
6367   will destroy the pointer at the time so if the pointer is there, the
6368   object must be ... but we need locking for that.
6369   ->dblock is reasonably protected by private_lock, though if ->iblock exists
6370   we hold a ref of ->dblock so we can access it more safely.
6371
6372   Need to check getiref_locked knows ->dblock exists when called on iblock
6373   and lafs_inode_fillblock
6374    yes, both safe!
6375
6376  But ->my_inode needs locking too so the inode can safely disappear without
6377  having to wait for the data block to go.  After all data blocks some in sets,
6378  and one shouldn't keep others with inodes.
6379  So something light-weight like rcu might work.
6380  We use call_rcu to free the inode and rcu_readlock to access ->my_inode
6381
6382  Yes, that will work.  Occasionally we will want an igrab to, but not
6383  often.
6384  Should look into rcu for index hash table and ->iblock as well.
6385  Current ->iblock is only cleared when the block is freed .. I guess that is fine...
6386
6387
6388 31Jul2010
6389   rcu protection of ->my_inode
6390   A/ orphan inodes - are they protected?
6391   B/ orphan blocks - are the inodes of those protected? Probably...
6392
6393   inodes are 'orphan' for two reasons
6394     1/ a truncate is in progress
6395     2/ there are no remaining links, so inode should be truncated/deleted
6396        on restart.
6397
6398   The second precludes us from holding a refcount on any orphan inode,
6399   else it would never get deleted.
6400   So we must assert that an inode with I_Deleting or I_Trunc has an implied
6401   reference and so delete must be delayed... not quite.
6402   If we set I_Trunc but not I_Deleting, then we igrab the inode until
6403   I_Trunc is cleared.  While we hold the igrab, I_Deleting cannot possibly
6404   be set as that is set when last ref is dropped.
6405
6406 01Aug2010
6407   FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
6408     .. and in lafs_orphan_release
6409   Well... only iolock_written can be a problem, and our rules require that
6410   only phase-change writeout can set writeback.  So the cleaner can never
6411   wait for writeout here.  Maybe it can wait for a lock, and maybe we don't
6412   really need a lock, just 'wait_writeback'.
6413 08Aug2010
6414   So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
6415    It is writeback waiting on 74/BIGNUM fromm file.c:329.  So writepage
6416    tried to write a block in a directory .. but it is PinPending so that
6417    must have been set after writepage got it...
6418    lafs_dir_handle_orphan gets an async lock, then sets PinPending.
6419    If write_page is before that, it will have the lock and dir_handle will try later.
6420    If write_page is after it will block on the lock, or see PinPending and
6421    release the lock.
6422    So someone else must be clearing PinPending!
6423      - checkpoint clears and re-sets under the lock, so that is safe
6424      - dir.c clears under i_mutex
6425          dir_handle_orphans always hold i_mutex ... or does it.
6426      - refile drops when the last non-lru reference goes.
6427      - inode_map_new_abort clears for inode
6428    No, not that - just bad test on result lof iolock_written_async ;-(
6429
6430   Now have an interesting deadlock.
6431     rm in lafs_delete_inode in inode_map_free is waiting for the block to
6432     flush which requires the cleaner.
6433     The cleaner thread in inode-handle_orphan is calling erase_dblock
6434      on the same inode which blocks while inode_map_free has it locked....
6435      no, not same block - just waiting for writeout which requires cleaner.
6436      lafs_erase_dblock from inode_map_free must be async!
6437    pin_dblock in lafs_orphan_release must too.... no - only the setting of
6438    PinPending needs to be async or out side of cleaner, which it is.
6439
6440   Ok, got that fixed.  All seems happy again, time for a commit.
6441
6442
6443 09Aug2010
6444    14b/  What backing-dev to show the filesystem.
6445      backing-dev holds:
6446          congested state
6447          unplug function
6448          read-ahead info
6449          throughput measurements
6450
6451     Much of that is for generic code to use.  We need to:
6452      - provide an unplug funtion that unplugs all devices
6453      - provide a congested function that which checks all devices,
6454        or for 'write' - at least the device we are writing to.
6455
6456     How do we set the backing device?
6457     The 'struct address_space' point to one, as does struct super_block.
6458     set_anon_super establishes a null bdi, set_bdev_super gets it from the
6459     bdev->queue
6460
6461     We need to bdi_init and bdi_register (if no error) our bdi.
6462     bdi_destroy calls unregister and reverses bdi_init
6463     or just bdi_setup_and_register
6464     but bdi_register_dev gives a better name - isn't this sick!!!
6465
6466     Partly done ... but I'm hitting more bugs :-(
6467
6468   -Checkpoint cannot complete because...
6469    Lots of dirty inodes that are orphans are not pinned!! I
6470    guess the InoIdx is ??
6471    Most of them don't have InoIdx(?)  Only '8' does.
6472    8/0 is also an orphan and is on wc[0]
6473
6474    It seems that this block keeps getting re-written and stays in
6475    Phase0.
6476    Is that because it is a data block with PinPending.. No, that works
6477    as long as it become un-dirty: we drop pinpending, refile, and set again
6478
6479    It is being dirtied again during writeout for the checkpoint
6480    so it doesn't get to changed phase when we lift PinPending.
6481    I gues we mustn't dirty it if it is in the old phase.
6482
6483   -And twice inode 17 is deleted without B_Orphan being set!
6484    That is the only file that exists before we mount.
6485      Problem was orphan_release instead of orphan_forget
6486      I wonder why it only affected 17...
6487
6488   -at shutdown we drop an inode and try to invalidate pages, but
6489    root inode is still dirty - I wonder why.
6490      The dblock is in a different phase to the iblock.
6491      In checkpoint we wait until root iblock changes phase, but
6492      not root dblock!
6493
6494
6495   UP TO:
6496     I'm testing subordinate filesystems, which don't work yet.
6497     I need to create the root directory and inode map.
6498     Obviously I cannot record the inode map file in the inode map....
6499       inode_map should ignore everything less than 16? 8? 2?
6500     Need to make sure creating with a given inode number works.
6501     Need to make make sure auto-allocate inum is never less than 16.
6502
6503 11Aug2010
6504  How to map from filesys inode to superblock?
6505   Need in
6506     lafs_iget_fs
6507     choose_free_inum - to get inode-1
6508     ditto in inode_map_free
6509     lafs_put_super has something odd with i_sb
6510
6511   Could do an sget search..
6512   Or could just store it in the inode (but not in i_sb!!)
6513   inode already a bit large though.
6514   Do it for now, but make a note to trim the fs_md part of inode
6515   into a separate allocation.
6516
6517   lafs_new_inode should take an 'sb' not a 'filesys'.
6518   In fact, get rid of filesys.  It is
6519     MAP(i->i_sb->s_fs_info)->root.
6520
6521  15f - timestamps for roll-forward.
6522     The writeout can be much later, but logging the mtime is fairly
6523     boring ... we could log mtime in the group head, which might be cheap
6524     enough.  How much precision is needed, and against what base?
6525     probably mtime of last checkpoint from superblock.  That should
6526     be not more than 2048 seconds ago, so 16 bits gets is 30msec...
6527
6528 14Aug2010
6529  15l - decay youth info.
6530     Need to decay:
6531          youth_next and checkpoint_youth in 'struct fs'
6532          all blocks in youth files on storage
6533          all scores in seg-tracker.
6534            - not needed, they'll get updated in normal progress
6535              and being wrong for a while is no cost.
6536     ensure correct youth is stored in lafs_free_get
6537     check little-endian conversion of all youth accesses
6538
6539     checkpoint_youth only used by thread, so no locking needed
6540     youth_next protected by fs->lock
6541
6542  15m - share orphans and cleaning list_heads in datablock
6543    It certainly is possible to clean an orphan but it is very unlikely
6544    as it will have changed recently, or be changing soon.
6545    The cleaner could just dirty any B_Orphan it finds.
6546    But if orphan finds a block on the list, it must be careful...
6547    I guess when cleaner drops a cleaning ref, it should check if the block
6548    is an orphan, and re-queue if it is.
6549
6550  15o - async blocks just have an extra refcount.
6551    This could:
6552      - keep PinPending set
6553      - keep an index block pinned - will phase-flip
6554      - keep ->parent link
6555    not not get in the way of a checkpoint.
6556
6557    Should we clear any that we find though?
6558    Normally async is only used by cleaner, orphan processing, or segscan
6559    So it should all be finished when we do a checkpoint.
6560
6561    So if checkpoint, or release_page, finds an async block, drop it.
6562
6563  15r - further optimisations in cleaner to avoid lookups.
6564    We have fsnum,inum,blocknum and cluster seq number and trunc num.
6565
6566    I want to introduce more async though.  Currently it only loads
6567    one inode at a time.
6568    To do more, I need to mark inodes as 'done' when they are and always
6569    restart from the start of the cluster (only do one cluster at a time
6570    for now).
6571    So if we get all the way though a cluster with no 'EAGAIN' we finish
6572    with the cluster.
6573
6574  15y - when could a directory block become an orphan?
6575     - when deleting that last entry - we don't know if it can be fully
6576       deleted until we look in next block
6577     - when deleting an entry follows a chain back to the first block
6578     - when deleting the last entry in the block.
6579
6580     So it could be an orphan if the entry found:
6581         - is at end of block
6582         - is first entry
6583         - is only entry
6584      or first entry is already deleted.
6585
6586 15Aug2010
6587   looking at flushing etc when run out of space.
6588   We often force a checkpoint when it won't do any good as
6589   nothing has been cleaned.
6590   In fact we write lots of dead checkpoints to 0/0 until it is full,
6591   then move on, clean 0/0 and suddenly have space.
6592   We shouldn't do that.  sync should be what pushes us forwards.
6593   Maybe that is fixed..
6594
6595   InoIdx blocks still cause confusion.  Should they ever have credits?
6596   or do only the data block have those?  Certainly they cannot have
6597   SegRef.
6598   And there is confusion in my mind whether data blocks can be pinned
6599   while the InoIdx block is - need to clarify that.
6600
6601
6602 13Sep2010 - now, where was I...
6603  - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
6604  - test run: block.c:660 - no credits available while dirtying an InoIdx block during
6605    orphan handling.  lafs_reserver_block (under checkpoint lock) should have set credit.
6606    Only I just changed reserve_block to do that dblock instead - I wonder why.
6607    OK, I think I cleaned that up...
6608
6609
6610  - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
6611    so pin_dblock did too.  So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
6612    returned error.
6613    Problem is that we don't push a checkpoint when cleaner runs out of things to do.
6614    But we don't want to go back to pushing a checkpoint too often.
6615    Maybe the problem is that we only force the checkpoint when we have enough space to do
6616    new allocations, but we need to force it earlier if nothing new can be cleaned.
6617
6618    Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
6619    we need to wake 'checkpoint_wait' then.
6620    But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
6621    ...anyway, that is sorted now at commit  95b6b05e460
6622
6623
6624   So: InoIdx blocks.
6625     - These never get SegRef as that is meaningless - done.
6626     - These can have credits.  It possibly isn't necessary bit it makes things
6627       easier.  They are 'written' by transfering the credits to the data block, or discarding them.
6628     - I think dblock and iblock can both be pinned
6629       The problem this caused was that the dblock might get processed as a leaf before iblock.
6630       We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
6631       is pinned to the same phase.
6632       lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
6633       it unpins an iblock
6634       So lafs_pin_dblock doesn't need to pin the inode instead.
6635    OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000
6636
6637  15u - I don't need to get a segref there, but I need to have one from the original dirty block,
6638        so fix that up - commit Mon Sep 13 15:28:08 2010 +100
6639
6640  15v - What do we have?
6641        lafs_dirty_dblock:  set Dirty, clear Credit clear NCredit
6642                            set Uninc, clear Icredit clear NICredit
6643        lafs_dirty_iblock:  set dirty, clear credit
6644                            test uninc, clear ICredit, set Unincredit - not essential
6645        mark_cleaning:      test realloc, / alloc / set realloc
6646                            test dirty / clear realloc/ set credit
6647                            set uninc clear icredit
6648        cleaner_flush:      set dirty, clear realloc, clear credit
6649                            test dirty, clear realloc set credit
6650        flush_data_to_inode:
6651        lafs_cluster_allocate - there is some odd code ther!!
6652        flip_phase
6653        lafs_allocated_block
6654
6655        all rather different really.
6656        Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock
6657
6658  15w/ Space used by cluster updates??
6659        It is all fine - just some confusion of function names.
6660
6661  15z/ logging symlink creation.
6662       Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
6663       symlink itself.  So I guess we queue the block for writeout so it will go with the
6664       dir update.
6665       Yes, that works: Mon Sep 13 17:33:54 2010 +100
6666
6667  15ab/ already did that in commit f90959e6f492b6
6668
6669
6670  15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
6671     be freed after the write completes?  A checkpoint could do it, but that would write out index block
6672     that cannot be freed too.  A checkpoint would only be good after lots of data pages had been written.
6673     We could just wait and let other processes kick in..
6674
6675     I don't think we need to do anything.  lafs_shrinker doesn't really know how tight memory
6676     is, and periodic checkpoint will free up any memory that we are pinning.
6677
6678     .... but something is needed.  We need some trigger to write dirty index blocks
6679     Maybe:
6680         - a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
6681         DONE THAT.
6682
6683     Not sure this is a complete solution.  I might want to incorp/flush index block when they
6684     have no dirty children, but I'm not sure about that.
6685
6686 14sep2010
6687   15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
6688      failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
6689     This could split the parent which means the block could have been re-parented - it would have been in the
6690     child list and so found and fixed.
6691     lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
6692     Inf this case, realloc isn't an issue, only dirty.  lafs_incorporate must have made it dirty and
6693     it won't get written while it has these in-phase children, so all is happy.
6694
6695  15ae - refile race?  Someone might set B_IOLock before removing from lru, so
6696           onlru is 0 and refcnt is elevated so it doesn't seem to be unused.
6697           But then whoever has the refer will refile again when dropping it and
6698           so the right thing will be done.
6699         But more generally, do we really want the lru etc to own a counted reference?
6700         If it didn't:
6701           - we would need to refile when removing from any list
6702           - we would need to get a ref when removing from list.
6703           uhmmm..
6704
6705     lafs_refile does:
6706             clear PinPending  if refcnt is low
6707             unpin   if not PinPending, or dirty etc and data or refcnt is low
6708             place on leaf list - if pinned etc - this can be earlier
6709             drop parent linkm if refcnt is low, and not pinned etc
6710             handle dblock issues
6711
6712         if lru was not refcounted, then the only things we might do when refcnt isn't zero are:
6713             unpin a dblock once it is not dirty
6714             add to lru
6715
6716        But if we don't count lru, then we can lose the refcount on dblock
6717
6718      Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and
6719        don't get freed.
6720
6721    I think I want things on 'leafs' list to not hold a counted reference.
6722    Things *only* get removed while walking the list.
6723    InoIdx blocks hold a ref on the dblock both when counted and some other time.  Possibly
6724     when pinned.  This ensure they are held InoIdx is while a real leaf.
6725    But: When we take that first ref, how do we know the dblock even exists?
6726
6727    What is the lifetime of ->dblock?
6728          removed when page is released
6729          set by lafs_import_inode
6730          set by lafs_inode_dblock
6731          removed by clear_inode
6732    So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock
6733    This is currently callers of getiref_locked
6734           - erase_dblock_locked ?? shouldn't need a lock
6735           - ihash_lookup - never on InoIdx
6736           - lafs_make_iblock - already have dblock
6737     So none of those really need lafs_inode_dblock
6738     What about when we set Pinned
6739          only really from set_phase ... messy.
6740     What about when we set ->parent
6741            grow index tree - not relevant
6742            ditto do_incorporate_*
6743            block_adopt
6744               Can be called on InoIdx from:
6745                 lafs_make_iblock  only!!
6746
6747 15sep2010
6748
6749   I have tidied lafs_refile up a lot but I need to make locking a lot cleaner.
6750   In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref
6751   is taken until I have finished my cleanup.  I suspect the inode private_lock is the one to use.
6752   I also need to clean up getiref_locked and getref_locked - having both is awkward.
6753
6754   So: when are they called?
6755
6756    getref_locked:
6757      lafs_get_flushable - hold fs->lock
6758      first_in_seg       - holds private_lock, but shouldn't need _locked as hold a ref through child.
6759      (getiref_locked)
6760      pin_all_children   - hold private_lock
6761      find_better        - private_lock
6762    getdref_locked
6763      lafs_invalidate_page - to get a ref on each block to either erase or invalidate it
6764                           presumably page is locked
6765      lafs_get_block     - holds private_lock - plus once with only page_lock
6766      lafs_release_page  - holds private_lock
6767      (getiref_locked on dblock) - no locking
6768      lafs_inode_dblock  - private_lock of my_inode...
6769      lafs_delete_inode  - private_lock of my_inode
6770      lafs_destroy_inode - ditto
6771      lafs_drop_inode    - ditto
6772   getiref_locked
6773      erase_dblock_locked - private_lock
6774      lafs_get_flushable - fs->lock
6775      ihash_lookup       - lafs_hash_lock
6776      lafs_make_iblock   - private_lock
6777
6778   So private_lock looks like a good choice.  Issues are:
6779        - what is the story with dblock on my_inode->private_lock
6780        - what is the lock ordering
6781        - what can refile negate that we need to be careful of.
6782          i.e. we want to keep things stable while refile does its tests, but what do we need to keep
6783            stable for others?
6784             + we break the parent link?? and so the siblings link
6785             + move things to freelist
6786             + can put_page
6787             + free dblock if not page_private
6788
6789    Lock_ordering.  private_lock, then fs->lock, then lafs_hash_lock
6790    So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock
6791
6792    This is getting messy - I need something nice and clear.
6793    So:
6794      Index Blocks.
6795         If Pinned, either has references or is on a leaf list - possibly both
6796         If no references and not pinned then not on leaf list, so can be on free list
6797
6798         Pinned can only be set when there are references, and can only be cleared under private_lock
6799                   This is violated by phase_flip, which badly reads refcnt
6800         If refcnt is zero and not pinned, then can be moved to free_list
6801         If on freelist and refcnt is zero under hash_lock, can be freed
6802
6803         So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore.
6804             Someone else must hold a ref and will put it and it will refile.  but that is pointless as
6805             it could immediately be cleared after we test Pinned.
6806
6807         lafs_get_flushable should get a reference before deleting from list.  This ensure it won't be freed
6808          by lafs_shrinker, though it could be on the free list.  If it is, then it isn't pinned so it is not
6809          interestin to us.
6810
6811
6812        Data Blocks:
6813          These are removed from lru when freed - we just need the extra refcnt check after removing from list.
6814          No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt
6815          so they cannot disappear
6816
6817     What is the story with my_inode->private_lock though?  This is used to protect ->dblock accesses.
6818     I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared?
6819               dblock is cleared in refile for the dblock,
6820               or in clear_inode under the inode rivate lock.
6821
6822     So:
6823      There are various places that hold a non-counted reference to a block.
6824      These include
6825             - index hash table            lafs_hash_lock
6826             - index free list             lafs_hash_lock
6827             - phase_leafs / clean_leafs   fs->lock                        only if pinned
6828             - inode->iblock               lafs_hash_lock
6829             - inode->dblock               inode->i_data.private_lock
6830
6831      Each of these is protected by its own lock, but not all the same lock.
6832      When we turn one of these into a counted reference, we increment refcnt under the local lock,
6833      then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has
6834      finished.  This must be done before changing/using the block in any way.
6835      To free an index block it must first be removed from _leafs list.  Then if the refcount is still
6836      zero it can be freed - or put on freelist and subsequently freed.
6837      An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference.
6838      To free a data block we similarly need to recheck refcnt after removing from leaf list.
6839      If it is in an inode file we also take that inode's private_lock to clear dblock.
6840          We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero.
6841
6842 17sep2010
6843    review lafs_refile - are some of those tests redundant? - yes, one is gone.
6844
6845  So:
6846   15ah - What about truncated blocks sitting on an uninc chain?
6847        I don't see the problem.  It will eventually get incorporated and do the right thing...
6848
6849   15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in
6850       a stable way.....
6851     No, I don't think that is really a problem.  It only gets written out in the tail of the checkpoint after
6852     the root.  I guess it could then get a youth number for a segment that it has no count for, if the root is
6853     written at the end of one segment and the segusage/youth written at the start of the next.
6854
6855     But I think roll-forward is missing something.  Blocks in the next phase need to be counted into segusage.
6856     Are they?  oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space,
6857     but the important blocks picked up by the roll-forward are handled.
6858
6859     So....
6860
6861      A checkpoint could cover multiple segments.  We need to be sure these each get a valid youth number.
6862      Probably most of them will, but we need a consistent approach to be sure.
6863      They don't need to be added to the segtracker, except the last needs to be active, and it already is.
6864      So as we find a new segment we want to do much like was lafs_free_get does youth_update.
6865      But the data block - isn't that youthblk?  When it that set?
6866         segsum_find sets if it ssnum == 0
6867
6868 19sep2010
6869    15ak - run the orphan file at mount time.
6870      After roll-forward when we have a working filesystem, we need to read the orphan file, load each block
6871      mentioned, and register each as an orphan.
6872      This involves:
6873             - setting the orphan_slot
6874             - setting B_Orphan
6875             - lafs_add_orphan
6876          Just like at the start of orphan_commit
6877      We also need to initialise nextfree and possibly 'reserved'.
6878      But: can orphans be created during roll-forward?  They certainly can.  We currently hide that in a re-use of
6879      the orphan list..  But directory updates are possible too, and not handled.
6880
6881      I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot
6882      change the orphan file.  Then after roll-forward, we read the original part of the file and set up
6883      any orphans that aren't yet.
6884      So we want to read once to get the size.  Then read again to process content up to that size.
6885
6886    15am - filesystem name.
6887        This is only used for identifying snapshots
6888
6889 01oct2010
6890   - mkfs is done to an initial version of lafs-utils. !!!
6891
6892  So: 15am - filesystem name - used to identify snapshots
6893    So the name is pointless in subordinate filesets.  So I could just shrink
6894     the metadata.  The primary metadata needs to be big enough to get a name
6895     easily though.
6896
6897  15aw..
6898     When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'.
6899     But we have the same B_UnincCredit bit for both.  Is that safe?
6900     Processing the cleaner could absorb the UnincCredit while the blocks is
6901     reserved but not dirty.  Then when it gets dirtied, there may be not
6902     enough credits to split.
6903     We set Dirty from Credit, and use ICredit for UnincCredit.
6904     But when only Realloc (not dirty) we don't use those bits.  We allocate
6905     fresh credits or set Dirty if that fails.
6906
6907 03Oct2010
6908    Need lafs_iget_fs to work on other filesystems.  And other snapshots?
6909    We use it:
6910      in cleaner when parsing cluster head
6911      in orphan handler when loading orphan file or when rearranging it.
6912      in roll forward
6913
6914    Each of these might need to kern-mount the fs - so we need to hold the ref
6915    somewhere.
6916    Cleaner also needs to explore snapshots.
6917
6918    Don't want kern_mount - that is too heavy weight and includes a vfsmnt.
6919    Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need
6920    to hold.
6921    Similarly for snapshots.  Cleaner needs to consider all snapshots, so they
6922    all need to be mounted.
6923
6924    So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops.
6925    Subset 'sb's can be attached to the parent inode and then only dropped when
6926    the inode goes... only sb currently references inode.
6927    So maybe the first ref to an sb doesn't ref the inode but others do - is that
6928    possible? No, as we don't see them being dropped.
6929    Every inode in the subset could ref the filesys inode.  That would keep it active
6930    the right amount of time, but release/destroy could still be racy.
6931
6932    I guess cleaner/orphan/roll need to explicitly ref the fs.
6933      cleaner already refs inode when B_Cleaning, so hold fs too.
6934      B_Orphan seems to own and inode ref too.
6935
6936    So:
6937        lafs_iget_fs gets a ref on the inode and the sb.
6938        need lafs_iput_fs to drop both references
6939        B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref.
6940
6941     cleaner holds refs on all snapshots
6942
6943     FIXME I probably need to hold inode/fs for B_Async too.
6944        No.  Async only refs the block, not the inode or fs.
6945         Something else would normally ref the inode - e.g. cleaner.
6946         When the inode is free, the page invalidation will notice the
6947          B_Async flag and release it.
6948
6949     So that is all done now, except I don't hold refs on snapshots in the cleaner
6950     yet.
6951
6952 11oct2010
6953  DescHole
6954    - When is this used? directory etc don't need it.
6955    - a regular file might, but there is no API to punch
6956      a hole.... yet I guess.
6957    - So we just want to allocate these blocks to 0.
6958
6959 15oct2010 - happy birthday Daniel...
6960  Looking at 36:
6961   a/ files with nlink==0;
6962         If we happen to find them, we hold a reference until all roll-forward
6963         is done, incase a name is found - it is important not to start deletion
6964         early.
6965
6966 18oct2010
6967   36g - write roll_mini for directories.
6968    We get a name, an inode number, and one of:
6969       LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET
6970
6971    The REN_SOURCE is linked with a REN_*_TARGET which could be in a
6972    different directory, so we need to stash the SOURCE until the TARGET
6973    arrives.
6974    We simply impose the implied change on the directory and update the
6975    link count in the target inode.
6976    So:
6977      load the inode
6978      possibly record REN_SOURCE for later
6979
6980      calls prepare/pin/commit as appropriate.
6981      Put the inode on orphan list if appropriate - needs care
6982         as we retarget orphan list.
6983      update inode link count.
6984
6985    (28Feb2011)
6986    Just a refresh on the purpose of these updates.
6987    1/ They allow us to fsync a directory without performing a full checkpoint.
6988      As directory blocks are not processed in roll-forward we need the update
6989      for data to be safe.  As fsync of directories are rare in some common
6990      situations we could avoid actually writing these.  Simply queue them
6991      internally and discard them on a checkpoint.  If an fsync comes before the
6992      checkpoint, only then do we write them out.  If there are any cross-directory
6993      renames then the preceeding updates in both directories need to be flushed
6994      before the cross-directory rename.  It might be easier to always flush on
6995      a cross-directory rename.
6996    2/ They ensure consistency of inode link-count wrt to names in the filesystem,
6997      but as link count is only updated by these (or a checkpoint) there is no
6998      problem with delaying.
6999
7000    So: when replaying these we must update the directory content and the inode
7001    link count.
7002    It is OK to delay the write-out of these until an fsync, and not bother
7003    if a checkpoint happens.
7004    So add that to th TODO list - item 66.
7005
7006 28feb2010
7007   - roll forward directory updates ... I wonder if I got it right :-)(untested).
7008
7009
7010   I don't seem to have easy-access notes about the various meaning of
7011   'width' and 'stride'
7012
7013   width:  The number of independent devices across which the (virtual) device
7014     is placed.  The normal goal is to write 'width' blocks on every single write.
7015     On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
7016     and it will keep all devices equally busy with writes.
7017     The 'width' blocks probably aren't consecutive.
7018
7019     There are two different layouts - one with width*stride <= segment_size
7020     and one with width*stride > segment_size.
7021
7022   width*stride <= segment_size
7023      This is a traditional striped layout like RAID0/4/5/6.
7024      The 'stride' is the chunk size, so 'width*stride' is the stripe size,
7025      and segment_size must be a multiple of this.
7026      In this case all addresses in a single segment are contigious.   We don't
7027      necessarily write them in order if we want to write less than one stripe.
7028      segment_offset will normally be a multiple  of width*stride though this isn't
7029      enforced as one could have a partition with an non-aligned start.
7030
7031   width*stride > segment_size
7032      This implies a catentated layout.  If parity-redundancy is in use when the
7033      blocks which combine to form a stripe are 'stride' blocks apart.
7034      The benefit of this layout is that an extra drive can be added by simply
7035      zeroing it and joining it to the array - no re-stripe needed.
7036      This will make all stripes slightly larger so at first the space will not
7037      be available.  As cleaning happens the space will gradually become
7038      available.  This still requires restriping, but unlike a normal
7039      raid5 restripe, the space becomes available in small amounts immediately,
7040      when there is no demand for more space, the re-striping (cleaning) can happen
7041      at a very low priority with no cost.
7042
7043      In this case the blocks in a segment are not contiguous.
7044       'segment_size/width' are, then there is a large gap (in virtual address
7045       space) to the next chunk.
7046
7047      The segment_offset is an amount of space which is free at the start of
7048      each device.  0..segment_offset and stride..stride+segment_offset etc
7049      do not contain data and can be used for metadata.
7050
7051   When width > 1 it makes sense to replicate each state block across
7052      every device - as we want to write the whole stripe anyway.
7053   For now we only write and read the first two copies at the beginning, and
7054   the last two at the end...
7055
7056   Question:  what do we want to do about metadata on flash devices?  We really
7057    don't want a small number of locations to store the metadata, but a large
7058    number that we search through - possibly a binary search.
7059    These could be all at start/end or scattered throughout the device.
7060    The later would make it impossible to find efficiently - there is no way to
7061    create useful linkage without writing something else at start of end.
7062    As many devices optimise for random writes where the FAT table would be,
7063    it make sense to just put the metadata there and not at the end.
7064    We should allow one 'page' for each metadatum, which probably meanss
7065    32K.
7066    So we should allow all state blocks to be near the start.
7067
7068 01mar2011 - Autumn arrives.
7069
7070   Time to add handling of 'atime' and non-logged files.
7071
7072   The idea is to have a separate file for storing only 'atime'
7073   This is separate from the inode file because the volatility of the data
7074   is very different and one of the principles of log-structured-fs is that
7075   differently volatile data should be kept separate.
7076
7077   This does mean that an inode lookup requires getting data from two files,
7078   but it is hopped that the 'atime' file will mostly be in cache as each
7079   block contains the atime for lots of different inodes.
7080
7081   The atime file contains 2 bytes for each inode, so with a block size of 4K,
7082   each block would hold info for 2048 inodes.  1 million inodes would require
7083   2 megabytes.
7084
7085   The 16bits are treated as a positive floating point number which
7086   gets added to the atime stored in the inode.  The lower 5 bits are
7087   the exponent, the remaining 11 bits are mantissa.  Though there is a
7088   little complexity in interpreting the exponent.
7089      If the exponent is 0, the mantissa is used as milliseconds -
7090        so shift left 5 and multiply by 1000000 for nanoseconds.
7091        The smallest change that can be recorded in 1 millisecond.
7092        and values up to (2^11-1) milliseconds - or 2seconds can be stored.
7093      If the exponent is 1 to 10, the mantissa has a '1' appended as a
7094        new msb, and is shifted by the exponent-1 and then treated as milliseconds.
7095        This ranges up to 2^(12+9) milliseconds or 30 minutes, where
7096        the granularity will be 2^9 millisecs or 0.5 seconds
7097
7098      For exponents from 11 up to 31 we add the 1 msb and treat
7099        the number as seconds after shifting (e-11).  So at e==31,
7100        we shift a number that is
7101        up to 4095 by 20 to get nearly 2^32 seconds or 136 years.
7102        At this point the granularity is 2^20 seconds or 12 days.
7103
7104
7105    So overall we can update the atime for 136 years without needing to
7106    update the inode, and can record differences of 1msec for the first
7107    couple of seconds, then gradually less granularity until we are
7108    down to one second an hour after the last change, and 4 hours a
7109    year later.
7110
7111    To convert a number of seconds to this format:
7112
7113    If >= 2048 seconds, we shift down until less than 4096 seconds
7114    counting the shift.  We add 11 to that number to form exponent,
7115    and shift the resulting mantissa up 5, or with exponent, and mask
7116    out bit 16.
7117
7118    Otherwise we convert to milliseconds (divide nanno by 1000000 and
7119    multiply seconds by 1000, and add). Then if < 2048, we shift up by
7120    5 leaving a zero exponent and use that.
7121
7122    Otherwise we shift down until < 4096 counting shifts, add 1 to the
7123    shift to form an exponent, and combine with mantissa as above.
7124
7125    So that is the format - how do we implement it?
7126
7127    We don't want to expose to user-space numbers that we cannot store.
7128    So any 'utimes' call updates that the inode directly can clear the
7129    value in the atime file.  Only updates due to accesses go to the atimes
7130    file.
7131    We define a 'getattr' function which looks at the atime stored in
7132    the vfs inode and if it has changed we need to deal with it.
7133     - if the inode is still dirty we simply update the lafs inode
7134       and use the number as-is, clearing the atimes entry
7135     - else we subtract the stored atime from the new atime.  If this
7136       is negative or exceeds 136 years we mark the inode dirty and
7137       store it there.  It we cannot mark the inode dirty for some
7138       reason we just store all 1s in the atime file.
7139
7140     The same operation is needed when dirty_inode is called to make
7141     sure atime updates get saved even when no getattr is called.
7142
7143     As we always need to be able to update the atime file, it needs to
7144     be permanently pinned whenever an inode is read in.  For
7145     non-logged files this should be cheap but we must do it anyway as
7146     the file might not be non-logged.
7147     So we need to keep a permanent reference to each block while the
7148     inode is loaded.  That can keep it pinned.
7149
7150
7151     We don't want updates to the atime file to be flushed in any great
7152     hurry, especially if it is a logged file.  We would be quite happy
7153     to only write at 'unmount' and probably 'sync'.
7154     So we want to stop the pages from appearing dirty in the page
7155     cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty
7156     (I_DIRTY).
7157     We can still keep them dirty in lafs metadata so if release_page
7158     is called we can schedule a write out then.
7159
7160
7161    So some steps:
7162
7163     1/ load atime file at mount time - there is one for each
7164       filesystem.  It has inum of 3 and type of TypeAccesstime (6).
7165       Also release it on unmount.
7166
7167     2/ loading an inode must take a ref to the block in the atime file
7168       if it exists.  A new inode flag records if this has happened.
7169       Unless mounted noatime, we pin the block and reserve space.
7170
7171     3/ getattr and dirty_inode must resolve any issues with the
7172        atime.  So lafs_inode probably needs an extra field to be able
7173        to check for changes
7174
7175
7176
7177   Hmm.. this is getting confusing...
7178   When atime is changed the only way we find out is by ->dirty_inode
7179   being called.  But that is called when anything is changed.
7180   Filtering out whether or not we need to update the inode itself
7181   is awkward... maybe there is some context we can use.
7182   ->dirty_inode is called by mark_inode_dirty which is called:
7183    - by touch_atime, if something changed
7184    - file_update_time  - at which time we also update iversion
7185    - setattr ... which has changed recently (2.3.37ish)
7186    - page_symlink
7187    - generic_file_direct_write - which increasing size of inode
7188    - set_page_dirty_nobuffers
7189
7190   So either the inode is pinned, or it isn't.
7191   If it isn't, then this *must* be an atime-only update.
7192   If it is, then it could be anything, but in any case we update the
7193   atime directly.
7194   So: dirty_inode should try to get dblock and check if it is pinned.
7195    If it is pinned, then update the atime immediately and the offset
7196    in the atime file too.
7197    If not, just update the offset
7198
7199
7200 03mar2011
7201   ARGggg... checkpin is interfering with unmount - it keeps an
7202     s_active count so unmount 'works' but doesn't release anything.
7203
7204   checkpin is needed is needed to ensure that inodes remain safe while
7205   we are cleaning.  Particularly, while the inode index block is
7206   pinned, we keep the inode and fs referenced as well.  I guess the
7207   theory is that they won't stay pinned for long - but they do.
7208   e.g. segusage blocks are permanently pinned.
7209
7210
7211   We could have a rule about the prime filesystem always being mounted.
7212   Then we don't need refcounts, but kill off the cleaner before
7213   unmount...  which we sort-of do..
7214
7215   All subordinate filesystems have references on the prime_sb so the
7216   prime_sb must be the last one to go.  When it goes it kills
7217   everything off...
7218   So we don't need checkpin to take a ref on the prime_sb.
7219
7220   There might be still an issue with files in subset filesystems
7221   being permanently pinned so they stay around longer than they
7222   should... need to check on that somehow.
7223   The idea is that a quota file block is permanently pinned so it
7224   will keep the fs pinned.  That in turn will keep everything else
7225   pinned... Worry about that when we implement quotas FIXME
7226
7227 04mar2011
7228   I really need to sort this out, and it isn't easy...
7229   We really want to know when "all" filesystems have been unmounted
7230   so the block device(s) can be released and the cleaner stopped.
7231   But we don't have a count for that.  We could if that was all
7232   we counted - but that would mean that we only have a single
7233   struct super_block for all filesystems.
7234
7235   So that is what I have to do.  A single super_block for all parts
7236   of the filesystem.  I probably still need to allocated other
7237   dev numbers stat->dev, but I don't need to use them internally.
7238   Maybe I even allocate superblocks... Yes - we need to use
7239   set_anon_super and kill_anon_super to allocate the numbers.
7240   lafs_inode will need a pointer to the filesystem - we use that
7241   instead of the sb.
7242
7243   -------
7244
7245   Testing...
7246    bug at block.c:658.  Block not B_Valid in lafs_dirty_iblock from
7247    lafs_allocate_block  from cluster_flush.
7248    Block is 74/0: InoIdx block of a newly created file I think.
7249     '74' was /f23, then  /mnt/1/adir.  We are creating file in that
7250    dir.
7251    This is a depth=0 InoIdx block - i.e. the data is in the
7252    dblock, so there is no index info, so it kind-a makes sense for the
7253    index block to not be Valid.
7254      yes- commit d268a566605bf006cf33c confirms that.
7255
7256    So why are we trying to dirty it?..
7257
7258    Maybe:
7259      We create a couple of directory entries, then flush and end up
7260      with an in-line data block.
7261      Then we add more, flush again and so try to dirty parent...
7262    Where to we turn depth=0 inodes to depth=1??
7263       - erase_dblock_locked - don't want that
7264       - lafs_incorporate
7265    So I guess the 'bug' is in error - it is OK to mark that invalid
7266    block as dirty.
7267
7268 04mar2011
7269   So - back to the super_block reworking.  We want only one
7270   superblock.
7271   So we use the TypeInodeFile inodes a bit more to hold the details
7272   of different filesystems.  We need to store a unique 'dev' number in
7273   there use set_anon_super/kill_anon_super on a local 'struct
7274   super_block' and copy s_dev in/out.
7275
7276   As we only have one sb, we can only have one fstype, so we cannot
7277   use the fstype to choose what to do.
7278     - if dev_name is a block device we try an normal mount
7279     - if dev_name is a Inode file, we perform a subset mount
7280     - if dev_name is a lafs dir and '-o snapshot=name', we mount that
7281       snapshot
7282     - if dev_name is a lafs dir in root with perm zero and
7283       '-o subset=MAXSIZE', create a subset filesystem.
7284
7285   - lafs_iget needs an inode rather than a superblock
7286     ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free,
7287     choose_free_inum, inode_map_new_prepare
7288   - lafs_iput_fs,lafs_igrab_fs, ino_from_sb
7289
7290   - NFS filehandles need careful thought
7291      They are 'per-super-block', not 'per-vfsmnt' which might be
7292      better.
7293      We could change that but.....
7294      For non-snapshot files it is easy - just record two inodes, the
7295      fs and the target.
7296      For snapshots there is nothing that is really stable.
7297      Maybe we could have different superblocks for snapshots.
7298      The snapshot doesn't need the cleaner as it is read-only, though
7299      the cleaner can need the snapshot...
7300
7301      So the cleaner might automagically mount a snapshot, but a
7302      snapshot will never invoke the cleaner or any other thread stuff.
7303
7304   So I guess we want one superblock for the fs and one for each
7305   snapshot.
7306   The filehandle is then either inum+gen or inum+inum+gen where first
7307   inum must be TypeInodeFile
7308
7309 07mar2011
7310   ... though I could just put a snapshot number and partial timestamp
7311       in..
7312
7313
7314 08mar2011
7315  This isn't a new to-do list, it is a list of the main features that are
7316  still not implemented:
7317    - full 2D layout
7318         + at very least I don't pad with zeros yet
7319         + if stripe size were multiple of 3*3*5*7*2^N, then changing
7320           width might be managable.
7321           e.g. stripe size: 40320 blocks.. But with megabyte chunksizes,
7322           we really want 32bit segsizes and 322560 block segments.
7323    - non-logged files - with interface to request access-time file
7324    - quotas
7325    - snapshots:  particularly cleaning
7326    - error handling
7327    - metadata (inode/directory/etc) CRCs and duplication
7328    - fsck / debugfs
7329
7330
7331   What would fsck do?
7332    - locate and validate device and state blocks.
7333    - locate and validate checkpoint cluster.
7334    - locate and validate filesystem root
7335    - roll forward to collect segusage and quota blocks.
7336    - load inode map, read inode file, validate each inode and make sure
7337      map is correct.
7338    - explore each file, following all indexing, count segusage for each
7339      segment and make sure segusage file is consistent.
7340    - check no block is allocated twice.  This might require multiple passes,
7341      each time we examine a different collection of segments.
7342
7343    - checking a file requires:
7344           - checking inode is consistent
7345           - checking index blocks are consistent with depth
7346           - checking index/extent blocks are sorted with no overlaps
7347           - checking block/iblock counts are correct.
7348    - checking all cluster headers in the current segment to ensure they
7349      look consistent and agree with file information. i.e. if cluster_header
7350      identifies a block, the block must live there, or later in the segment.
7351
7352    - scan all directories looking for consistency of hash etc.  Count links
7353      for all inodes.  This might need to be multi-pass too.
7354      Could use a bitmap for single-link files, and table for others.
7355
7356    How to fix errors.
7357      - First must find segments which are not in use according to segusage file
7358        or according to block search.
7359        If there are none, require a new device be provided.
7360      - If anything looks incorrect, write corrected version to new segment
7361        Then write out new segusage files
7362
7363    In some cases we might need to search all write-clusters for missing blocks??
7364    That could take a very long time!
7365
7366
7367    What do I really want to do about CRCs and hashes.
7368     It might be nice to store a hash for each block in the index block.
7369     But that wastes precious index-block space.
7370     If I store a CRC together with address info in the block, then I could
7371     be fairly sure it is the right block.  So e.g. inodes store the inode number,
7372     Index blocks could hold inode+depth+address.
7373     Last 8 bytes of each block could be a 4byte CRC and a 4byte identity.
7374     identiy is XOR of fsinum inum blocknum generation - or a CRC of these.
7375
7376     Actually, we don't need to store the identity info - we just need to
7377     include it in the CRC.  That either saves space, or allows more bits to
7378     be used for the CRC, which is probably the best use of bits for detecting
7379     errors.
7380     Though it might be nice to store phys-addr in the CRC too, we cannot as
7381
7382 21mar2011
7383   My short-term todo list is:
7384 DONE  - get 'lafs' to the stage where I can create an fs requiring roll-forward
7385 DONE  - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
7386 DONE  - Make lots of 'layout' changes - see 15cb
7387
7388 02may2011
7389   - 'run' goes to completion, but segusage isn't updated in the final cluster
7390        and the number left over from before looks wrong.
7391 DONE  - 'ls -l' on a subset file gets confused.
7392   - fs created by 'lafs' has wrong Blocks and Inodes counts
7393   - we lose a ref to a segsum and sometimes put it too often.
7394 REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
7395 REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
7396 REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
7397
7398
7399 03may2011
7400   Once I have these bugs sorted out I want to make some format changes.
7401
7402    DONE - fs_metadata need a 'parent' link
7403         rename needs to be careful about what is updated!
7404         so does roll_mini
7405         lafs_get_parent needs some thought.
7406
7407    DONE - roll-forward should get exact mtime stamps, and ctime.
7408      So each data block must have an exact timestamp
7409      of when the change actually happened.   Or the group_head
7410      has a timestamp for the most recent update to the file
7411      As we use nanosecond timestamps (pointless though they are)
7412      we need 30 bits for the nanoseconds and at least 11 for the seconds.
7413      So 48 bits (6 bytes) is plenty.
7414      So include a 64bit timestamp in the cluster_head and 48bit
7415      number to subtract in the group_head
7416      But saving 2 bytes per file isn't really worth it, and we may
7417      well lose it in padding.  So just store a 64bit timestamp in
7418      the group_head.
7419
7420    DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
7421
7422    DONE - state block flags for inconsistencies found
7423         If any inconsistency found, fsck is advised.
7424         For some it may be imperative.
7425         Things that can be wrong include:
7426         - generic read error
7427         - segusage negative
7428         - index block incoherent
7429         - dir block incoherent
7430         - link count negative
7431         - cluster header incoherent
7432         -
7433         64 bits should be adequate and simple for this.
7434         Any unknown bit requires a full fsck.
7435
7436    DONE - 32bit segment size
7437         With 16bit at 4K blocks we are limited to 256Meg segments.
7438         64Meg with 1k blocks.  This takes about 1 second to write on
7439         a modern drive.  On an array it will take even less time.
7440         24bits gives 16 to 64 gigabytes which is plenty.
7441         However 24bits is awkward to access. a 1K block holds 341 1/3.
7442         A 4K block holds 1365 1/3.
7443         But this wastes less space than 256 or 1024 and so causes less IO.
7444         But then we probably want to size segments to be very big.
7445         A few thousand segments should be OK, which is tens of blocks.
7446         I don't think the savings with 24bits are worth it, and I do
7447         think v.big segments could be useful, so lets go with 32bit segments.
7448
7449         Youth is currently tuned to 16bits.  Let's leave it there and
7450         maybe waste some space.
7451
7452
7453    - parallel new-data write clusters.
7454         I think it is sufficient to include a second 'next_addr' in the
7455         cluster_head - or maybe two.  alt_next_addr[2].
7456         When a thread wants to start a new stream of clusters it allocates
7457         the segments then attaches to the next outgoing write cluster.
7458         Once that is written everything in the new cluster is safe.
7459         On a checkpoint every stream writes at least one checkpoint cluster
7460         and these are linked together through alt_next_addr.
7461         The 'next' cluster for each must be the checkpoint cluster and must
7462         carry linkage but unlike with first-link, there is no need to wait
7463         The data is already safe as long as the state block isn't updated
7464         until every cluster_end block is written.
7465         So really, one is enough.  I had though 2 would enable quick fan-out
7466         but there is no real need for that.
7467
7468         As 0 is a valid write-cluster address we use 'this_address' to signify
7469         that there is no alt-next.
7470
7471         It is possible that a block of a file could be written to two
7472         different streams at different points in time between two checkpoints.
7473         We need to ensure that roll-forward gets these in the right order.
7474         'seq' can be the same in two different streams so we cannot use that.
7475         timestamp could possibly be used, but as times can go backwards it
7476         is not ideal.
7477
7478         NEW IDEA.  Just use one stream of clusters.  However it can
7479         bounce from one device to another easily.  So two different
7480         threads can be building up two different write clusters at the
7481         same time as long as they synchronise at some point to pass
7482         addresses around.  They also need some other Verify mode as
7483         VerifyNext or VerifyNext2 will destroy any parallelism.
7484         As the point of this is two write to multiple devices in
7485         parallel, maybe VerifyDevNext{,2} meaning the next header on
7486         the same device serves to verify this.
7487
7488    - policies.
7489         This includes
7490                 maximum number of segments written between checkpoints
7491                 whether data can be cleaned to a particular device
7492                 whether a device can receive new data
7493                 whether metadata duplication is needed
7494                 whether an RO device from a different array is allowed.
7495         Some of these are per-device policies.  Some are per-array.
7496
7497         The 'RO Device' thing is special.  I think I want an alt_uuid.
7498         It works like this:  You assemble the RO array when you
7499         mount a new filesystem identifying the old as a component.
7500         So that 'state' block on the new devices must identify the alt_uuid
7501         and state seq number.
7502
7503         Do we want to record more info about which devices are in the
7504         array?  Currently we just record how many.  If we find enough
7505         with the right UUID/seq, they must be it.. what else would we
7506         want?
7507
7508         For all the other policy statements it is probably simplest to
7509         allow a set of simple strings. e.g. "noclean", "nonew",
7510         "dup=2" "maxseg=5"
7511         devblock currently uses 146 bytes, so room for 878
7512         stateblock uses 112 plus some for snapshots, so much the same.
7513         We currently don't use 'version' and have no concrete plans.
7514         The vague idea is to allow lafs to *know* that it cannot mount
7515         the array, so any incompatible feature gets set.
7516         We could keep those in the policy sets.  From that perspective
7517         there are 3 types of things.
7518          - if you don't understand, don't worry
7519          - if you don't understand, don't try to write
7520          - if you don't understand, you cannot even read.
7521
7522         That last is really best avoided.  We have version info
7523         elsewhere in the tree so that a new index style will simply
7524         make that block unreadable.
7525         So I think make the dev and state blocks a simple incrementing
7526         version number which apply to that block, and have "don't
7527         worry" and "don't write" policies distinguished by first
7528         letter.
7529         Capital is "If you don't understand, don't write"
7530         Lower is "if you don't understand, don't worry".
7531
7532         These are space separated strings
7533
7534    - etc.
7535
7536    - what about i_version?  Include in timestamp?