index pages never get put on an LRU - how is this supposed to work?
-
-
-
-
-
--------------------------
Thoughts:
Inodes live in an address-space, much like a file. To load the
->waiting access with fs->lock after changing it to ->lru
DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback
only when *all* have finished.
- n/ on phase change, uninc_next blocks need to be shared out.
+DONE n/ on phase change, uninc_next blocks need to be shared out.
NO 3o/ Make sure lafs_refile can be called from irq context.
3p/ lock all lru accesses.
3q/ Lock those index blocks!!!
DONE 17/ Make sure create inherits uid etc from process.
18/ consider ranges of holes in pending_addr.
- 20/ Implement rest of "incorporate"
- 21/ Implement staged truncate
- use for setattr and delete_inode
+DONE 20/ Implement rest of "incorporate"
+DONE 21/ Implement staged truncate
+DONE use for setattr and delete_inode
DONE 22/ block usage counts.
23/ review segment usage /youth handling and make a todo list.
a/ Understand ref counting on segments and get it right.
FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
FIXME make sure empty files have depth of 1.
+
+ FIXME Truncate proceeds lazily. All data blocks need to be gone
+
+26aug2008
+ If I call lafs_erase_dblock while a write is underway, we have a problem.
+ We need to wait potentially for a checkpoint to let go of the block and
+ a write to complete.
+ This should be done with waiting for PG_writeback on the page to disappear.
+ Check this out.
+
+ When end_page_writeback is called, we must have dropped all references to the
+ page.
+ When we commit to writing a block, we have to set PG_writeback on the page
+ so that truncate et al can wait for it. Before we have committed, truncate
+ can just remove the page. Internally we differentiate by B_Alloc.
+ So before setting B_Allocated we need to test_set_page_writeback(page).
+ Be careful of races.
+ I don't think we can ensure all references are dropped. After all, that is
+ the point of refcounts. So dblock array must exist without page!
+ But we need to ensure that we don't start a writeout after truncate
+ has done wait_on_page_writeback.
+ This is done with the page locked so when we want to write a page
+ in a checkpoint, we need to lock the page first. Once we have the lock,
+ we check if the page is still dirty. If it has been truncated it
+ will be clean.
+ But how do we safely reference the page if b->page can be cleared?
+ How about:
+ When we clear PagePrivate, we take a counted reference to the page
+ for db->page. This is dropped when the page is freed by lafs_refile.
+ But while it is held, it is still safe for db->page to be dereferenced.
+ So before we commence writeout we have to lock the page and set
+ PG_writeback. After locking, we need to test if writeback is still
+ appropriate.
+
+ Maybe not. I think we can submit blocks for writeout without setting the
+ page to writeback. If we do, then we need to be sure those writes
+ finish before invalidatepage calls releasepage (block_invalidatepage
+ calls discard_buffer which calls lock_buffer which waits).
+ In our case invalidatepage need to make sure that no new write commenses.
+ Maybe we should lafs_iolock_block before we allocate to a cluster and check
+ again if the block is dirty.
+
+ So:
+ lafs_cluster_allocate does:
+ lafs_iolock_block
+ check if still dirty. If not, unlock and return
+ set allocate flag
+ allocate and write
+ when write completes, allocate is cleared.
+ unlock block
+
+ invalidatepage does
+ lafs_iolock_block
+ clear Valid,Dirty,Realloc
+ lafs_iounlock_block
+
+
+
+2008 aug 28 - happy birthday.
+FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
+lafs_prealloc complains.
+
+ mark_cleaning does too, but cleaning only happens well away from a checkpoint
+ lock.
+segsum_find is being called to reference a new segment when we flush a cluster.
+ segment usage blocks are special. Their index information doesn't
+need to be written out in the current checkpoint. We can do that, but
+the backstop is to write just the data block in the tail of the
+checkpoint and write indexing information later.
+
+2008sep10
+ unlink is getting "No space left on device". This is when trying to
+ pin the directoory block, the physaddr is 0, so it looks like we want
+ NewSpace. But we should even be trying to prealloc in that case becase
+ there should already be a prealloc on the block. i.e. there should be
+ credits.
+ Hmmm. after multiple 'syncs' how can the block not be written out.
+ Maybe it is embedded in the inode?
+ When we pin a block that was embedded in the inode it isn't clear what to
+ do. If we might grow the file so it doesn't fit any more, we need to
+ allocate NewSpace. If we know it won't grow. we use Release.
+ This still needs a proper fix.
+
+ Cleaning seems to be working nicely. However we don't get all the space
+ back that we should because lots of blocks still have credits that
+ aren't being returned.
+
+ So when should credits be returned?
+ They are set when a block is pinned. It then gets dirtied which
+ consumes a credit. Then gets unpinned. I guess if it isn't pinned,
+ then it doesn't need any credits.
+
+
+ It seems that cluster_flush is not always writing things in the correct
+ order. Root gets written before some other things below it.
+ Maybe they are temporarily out of the loop??
+ No. There are dirty blocks which one checkpoint doesn't pick up, but
+ they aren't holding the index block pinned. so they lose allocation.
+
+ But they must hold the indexblock pinned, even though they aren't pinned
+ themselves. We maybe do this just with the refcnt... maybe. That will cause
+ it to phase-flip rather than drop pinning, which I think is right.
+
+ So: too many credits remain allocated. Where are they? There are 1464
+ outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
+ But things removed from the tree have credits removed.
+
+
+
+FIXME roll forward ignores inodes. But what about an inode that contains
+ data. Should that be ignored? I think not.
+FIXME delete adir/big2 then delete adir and it cannot release:
+ Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
+ presumably there is orphan processing or something to complete???
+FIXME when files are deleted, the space isn't returned!
+ This seems to be mostly fixed - need to test.
+FIXME when I "rm [b-z]*" it waits for writeback on something???
+ zfile again!!! OK, I think that is fixed.
+
+
+12sep2008
+ Current problem:
+ seg_apply_all dirties dblocks. When should they be reserved?
+ The originally get reserved by a lafs_reserve_block call in
+ segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
+ However: that block might get written before *and* after a checkpoint.
+ So we need N* Credits. These are usually only used for Index blocks.
+ We can set these easily enough if inode type is TypeSegmentMap.
+ We move them across to Credit in seg_apply_all.
+ But when to we clear them if they aren't needed? I guess
+ when we drop the last segref. Yes, we already do that.
+ FIXME need to make sure these get flushed on next checkpoint
+ if we cannot allocate new credits after a checkpoint.
+
+ New Problem. The 'cleanable' table reports a size of 3, but it is empty!
+ Think that is fixed.
+
+ Some problems.
+ 1/ see above: rm x/y; rmdir x -> BUG - FIXED
+ 2/ Spins on 'CURRENT=1' ??
+ 3/ if alloc_space gives EAGAIN while deleting, we don't survive.
+ 4/ When I create/delete a file, ablocks_used increments by one.
+ The inode hasn't been allocated yet, so it seems the deallocation
+ isn't adjusting ablocks_used??
+ 5/ open_namei (for dd) got caught on a mutex_lock.
+ 6/ When a large file is shrunk we don't reduct the level of the InoIdx block
+ I'm not sure where we should and am not thinking very clearly.
+ Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
+ 7/ unlink (at least) can get stuck in iolock_block. Who could be holding
+ the lock? Writeout that hasn't completed?
+ Yes. writepage calls lafs_allocated_block without calling flush.
+ So the block could be sitting waiting for a flush. How long do we
+ wait??
+ 8/ It seems that some datablock can need NCredits. Make sure these
+ are handled properly re flush-or-refill after checkpoint and
+ flip_phase rather than unpin.
+ 9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
+ enough, and we lock up (see 7). Need to flush the first block
+ straight away, and the next one as soon as the first finishes, etc.
+ Or something like that. Then remove the comment from lafs_writepage.
+
+8th December 2008
+
+ I seem to be getting only 4 blocks to a cluster at the moment.
+ This is good as it motivates the code to handle block splitting in
+ the Btree. But it shouldn't happen.
+
+ ....
+ Block spliting might work - it doesn't crash at least.
+ But
+ After deleting all files, the tree is full of stuff.
+ Lots of inode data/InoIdx blocks.
+ Many but not all a Pinned. The others are OnFree
+ The Pinned ones have outstanding references.
+ Others
+
+ ....
+ Problem with the block splitting, when adding an index block.
+ The index block is initially empty - we need to find things by looking
+ at children. But we don't. We BUG_ON the iphys==0.
+ In general, when we add a block below and index block and before we incorporate,
+ the block must be found by finding the first indexed block and looking to
+ see if there is a 'next' block that contains the address we need.
+ FIXED
+
+ But if we truncate a file while an index block is pinned and dirty,
+ we spin on trying to incorporate it, which should make it empty.
+
+11th December 2008
+ deadlock.
+ sync is trying to get lock in lafs_cluster_flush
+ pdflush holds the lock and is stuck in cluster_flush_0xa40
+ some wait_event I expect.
+ Maybe we need an unplug ??
+
+ - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
+ This is in clean_free. We try to update the 'youth' to mark
+ the segment as free, and we don't have a reservation to do it.
+ Maybe just reserve it there and then.
+
+
+12th December 2008
+ When doing a lookup in an index block, we need to check the unincorp
+ address list. It isn't enough to look for unincorp blocks as they
+ might have disappeared.
+ For INDIRECT and EXTENT this is easy enough as full information is in
+ 'uninc'.
+ For INDEX it is a little tricky as we need to look at the full set of
+ addresses to know where a particular address fits.
+ We could force and incorporate first, but that has awkward implications
+ if it requires a split.
+ Maybe if we get from the lookup "start+range"....
+ That is not enough as the 'start' might get zeroed by an update.
+
+
+ rm adir/* doen't work as readdir doesn't get all the entries
+ for some reason.
+ Reason is that they are being put in the wrong block.
+ lafs_find_next doesn't correctly find the 'next' block if it
+ hasn't been incorporated yet.
+ Block can be:
+ in index tree -- easy to find
+ in uninc_table -- not too hard
+ in only in the ->children list, or attached to a page.
+ It would be nice to use find_get_pages but that isn't exported so try
+ something else for now.
+ For index blocks
+ Look in index block for 'next
+
+15th December 2008
+ FIXME when we split an index block, we need to hold a reference to
+ the original so it doesn't disappear until the split-off copy is
+ written. This is because we search from an index block to find
+ split-off copies.
+ [ note from Feb09. This should be OK now. Both will need
+ incorporation, and we now hold on to blocks until they are
+ incorporated.]
+
+
+
+23rd February 2009
+ - index block. What changes are allowed exactly.
+ - splitting certainly makes sense.
+ - merging two adjacent blocks is fine, of which a special case
+ is finding that a block is empty and so removing it.
+ - What about a 2->3 split which would require removing a block
+ and adding another at the same time?
+ or noticing that the first blocks addressed are all missing, so
+ moving the index forward?
+ In each case, searching down by indexes will find a block that
+ has been replaced by a later address. We could manage that as
+ long as the new block is attached after the replaced block.
+ So we cannot move a block. We must delete and replace.
+
+ - unincorporated index blocks..
+ unincorporated data blocks are not pinned in memory. Once they have
+ been written out, they can be freed. Their address is stored in the
+ uninc-table. This means we can delay incorporation while many
+ extents are written out and freed. When we come to incorporated, we
+ may have many hundred of address in a few extents that can be incorporated
+ efficiently without holding all that data pinned in memory.
+ The same scale doesn't apply to index blocks. An index block can
+ reference only 102 blocks (for 1K block size). And the uninc table can
+ hold far fewer so we will naturally incorporate more often.
+ So keeping index/indirect/extent blocks pinned until they are incorporated
+ is reasonable. And it makes lookup a lot easier, as we have
+ guarantees about ordering of block in the children list that we
+ don't have in the uninc table.
+
+ Incorporation could have some atomicity issues. There is no
+ concern about bad stuff appearing on disk as the phase-change
+ process handles that. In memory it might be awkward if we split
+ an index block before incorporating a block what would span them.
+ That could conceivably happen if we only incorporate 8 blocks
+ (size of uninc table) at a time.
+ So maybe we should incorporate a full uninc list (not table) at
+ a time.
+ This means quite different code paths for incorporating leaf
+ and internal index blocks....
+
+
+ - uninc_table lists are a real problem.
+ They can only be created during roll-forward so they hardly ever
+ happen.
+ But if the block is split while processing earlier things on the
+ list, then splitting an uninc table would be very messy.
+ Is there any way around this?
+ Why not just do incorporation during roll-forward?
+ We only need to incorporate leafs, not internal blocks because we
+ don't use uninc_table for internal blocks any more.
+ So during roll forward, all index blocks that are touched need to
+ be held in cache...
+ I think we live with that. If it every becomes a problem, we will
+ need to perform the roll-forward twice. The first time collects
+ the usage information so that we know where we can start writing,
+ then the second just applies all the changes. to the rest of the
+ filesystem.
+
+
+ So:
+ uninc table only used for leaves, and has no linked list
+ unincorporated index block are stored on a list, which we
+ sort before applying.
+ All uninc index blocks are therefore kept in the index tree.
+ Their order on the children list allows us to find the correct
+ index. Each block for which the fileaddr is in the parent is
+ followed by any blocks that have been split off and end after
+ this one starts. Blocks that have been emptied are Hole and are
+ skipped over when looking for a block.
+
+ When we split an internal block, the remaining uninc blocks
+ must not start with a Hole.
+
+ FIXME: what locking do I need around lafs_incorporate?
+ i_mutex?? i_alloc_sem??
+ i_alloc_sem is imposed by truncate (inode_setattr) and
+ direct_io possibly. So it is really about adding/removing
+ blocks. Not updating internals.
+ Maybe our own mutex. Could even be per-index-block !!
+ Whatever it is, we need to protect walking ->children too.
+
+
+24th February 2008
+ "rm -r" problem from 12/dec/2008 fixed now.
+ incorporate code got a make-over and is probably much better.
+
+ New problems: After test runs, cannot create files due to no space
+ on devices!! But directory tree is empty.
+ I can see:
+
+ free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
+
+ The problem is that we think 1425 has been allocated to data that
+ might still need to be written, leaving not enough room for more.
+ Index Dump shows
+ ====================414 credits ==============================
+ which doesn't explain everything, but does explain a lot. There
+ really should be nothing in the Index tree (except fs-root and
+ tree-root)
+ There is also:
+ Some inodes which are OnFree and hold no credits.
+ 0 DATA (1) 52 [0]ESegRef,Claimed,PhysValid
+ 52 1 (0) 0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
+
+ Some other inodes which are pinned with lots of credits and are
+ on the phase_leaf list
+ 0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
+ 299 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
+
+ And that is about it. some are not Valid, some are...
+ checkpoint just wants to 'flip' them.
+ They mostly have a refcnt of 1... I wonder who is holding that....
+ The reference of on the dblock is held by the iblock.
+ But what is the iblock remaining? Who holds that reference?
+
+ I restored some code to clean iblock, and now:
+ free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
+ ====================244 credits ==============================
+ which saved 130 credits. That helps.
+ There seem to be many fewer of the many-credits blocks
+ Lot of index blocks in tree are 'OnFree' and have a
+ 0 refcnt, but haven't been removed. Why?
+ It seems that the have ->parent == NULL, so lafs_refile never
+ bothers to remove them. I guess it should...
+ OK, lots of InoIdx block have gone now with their DATA blocks.
+
+ So, remaining blocks are pinned to their phase with lots of Credits,
+ have not pincnt, mostly have physaddr==0.
+ It is just the stray refcnt that keeps them there..
+ inums are 40, 56, 62-73, 275-278, 280
+ 40 is f22
+ 56 is first adir
+ 63-69 are directories 2/3/4/5/6/7/8/9
+ 70-73 are looooong symlinks
+ 275 is cfile
+ 276 is dfile - same as cfile but truncated.
+ Then some nbfile-X that were big enough.
+
+ So: what do they have in common:
+ Several only use the in-inode data block, but
+ probably not all
+
+ Can it be that it is refcounted on the Leaf list, and so
+ cannot get off?? Yes, I think so!
+ We only unpin things that have a zero refcount.
+
+ So: what to do?
+ checkpoint takes it off the list, then flips the phase and puts it
+ on the other list with refile. During that time it has a refcount
+ it doesn't lose the pinning.
+ Do we want to:
+ 1/ Not have it on the list despite being pinned.
+ 2/ Drop the PIN despite the refcnt.
+ 3/ have refile do the phase_flip so it has a chance to
+ notice the refcount has hit zero.
+
+ 2 isn't really an option. We need PIN to persist whenver we have
+ a reference. We could possibly use PinPending for index blocks too,
+ but that would require a lot of thinking.
+ 1 requires another criterea for being on the list. I suspect that would
+ get messy fast.
+ 3 we used to do I think... But refile is in a big lock, and we
+ cannot really do a phase_flip under that.. and phase flip calls
+ refile anyway so we would get recursion.
+ So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
+
+ FIXME use kzalloc where appropriate.
+
+ FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
+
+25th February 2009
+ Good progress.
+ Only 54 credits in Index Tree now.
+ Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
+ plus '74', which seems to be schedules for deletion - root has uninc_table.
+ ... and 'sync' got rid of that and left 44 credits.
+ Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
+ 50 link
+ 55 zfile
+ 72 long84
+ 73 long85
+ 74 adir
+ These seem to be the files that used data-in-the-inode
+ They still have a refcnt of 1 (or 2 for adir).
+ ... OK, that's gone now. I fould a refcount leak.
+
+ So now: 42 Credits in Index Dump. No stray files.
+
+ df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
+ So we still seem to have 1085 blocks allocated. 42 are accounted
+ for, so 1043 still missing... either we lost the count, or lost the tree.
+
+ create a finy file, remove, and sync, now
+ df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
+
+ so I lost 15, b ut now 48 are in tree. Lets try again...
+ df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
+ and 44 in tree
+ and again:
+ df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
+
+ Definitely losing more thant the difference in the tree.
+
+ Try creating empty files...
+df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
+
+ very strong pattern there.
+ What about 2 files at a time.
+df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
+
+ Slightly different pattern - not as bad.
+ Have to try 4 now.
+df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
+df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
+
+ Strange, isn't it....
+
+ Making sure we clear UnincCredit... result looks worse.
+
+26th February 2009
+ I fixed up the credit accounting 'incorporate' and then fixed a couple
+ more little bugs. And now:
+
+
+
+====================48 credits ==============================
+df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
+
+So we still have 720 allocated credits that aren't accounted for.
+But we are nicely under 100...
+
+.... and now
+
+
+====================76 credits ==============================
+df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
+
+That is different. The count of missing blocks is way down,
+but there is some extra cruft in the index tree.
+Quite a few like
+ 0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
+ 0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
+and even one
+ 0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
+ 330 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
+Time for a commit though....
+
+and now
+====================46 credits ==============================
+df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
+
+so the strays in The index tree are gone. but still have 159 outstanding
+credits.
+Now change but now
+====================36 credits ==============================
+df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
+
+
+That is a little weird...
+Hmmm. back to
+====================48 credits ==============================
+df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
+
+Oh well.
+====================34 credits ==============================
+df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
+
+It seems that the unaccounted blocks are (or can be) created by
+writing to a file then removing the file without a sync.
+..but why is cb (cblocks_used) so high?
+
+27th February 2009
+
+ Got onto a bit of a tangent...
+ What happens if we truncate a block while it is on a list to
+ be cleaned? Clearly we want to cleaner to drop it ASAP.
+ But what if invalidate_page wants to drop it *now*
+ Hopefully it is either still on clean_leafs and we can remove it,
+ or it is now iolocked and we can wait for it. So should be OK.
+
+ I keep getting caught in "looping on..."
+ We are truncating an inode and some index block which is now empty
+ is not getting removed from the tree because there is an outstanding
+ reference.... 327/0 depth=1. I guess I turn on the tracing.
+
+ ... and it seems that it is in the process of checkpointing.
+ I guess I need to lock against that ... maybe with the iolock.
+
+Credits = -1, rv=2
+ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
+
+ -------
+ Every time I create/delete a file, I get an extra 'ab' which disappears
+ on 'sync'.
+ ablocks_used is:
+ decremented when +ve summary_update on non-index
+ increased on lafs_summary_allocate... should not be done for index blocks.
+
+ OK: after test run, filesystem is empty, but cblocks_used is around 360.
+ cblocks_used:
+ is loaded at mount time
+ collects pblocks_used on a phase flip
+ is updated in lafs_summary_update (unless pblocks is)
+ So we must be missing a lafs_summary_update when phys->0
+
+
+ Lots of problem:
+ truncating big (multi-level index) seems to be bad
+ Leaves 'pb-338 !!! and cb+689, even after sync.
+ still 'looping on' occasionally
+ Haven't found cblocks_used leak yet.
+ Occasionally non-B_Valid blocks are actted on.
+ I think I need to improve io locking.
+
+---------------
+1st March 2009
+ Need some improvements to iolock locking.
+ We use this lock to wait for a block to be written out (if that is happening)
+ before we allow lafs_invalidate_page to complete.n
+ It is also use in lafs_erase_{d,i}block (Similar purpose)
+ We take the lock in lafs_cluster_allocate, and then make sure the block is
+ still dirty.
+
+ Also lock in lafs_new_inode as initing the inode is a form of IO ??
+ load_block takes the lock
+ We only clear_bit(B_Valid, ) under this lock.
+
+ So the issue is this:
+ A block that is going to be written is passed to lafs_cluster_allocate.
+ This happens either after taking it of a _leafs list, or when
+ lafs_writepage requests the write.
+
+ lafs_invalidate_page needs to be able to release the page, so there needs to
+ be no transient references. In particular, once the block has been
+ removed from a _leafs list it must already be iolocked.
+ Invalidate_page can then either remove from that list and erase the block,
+ or use io_lock_block to wait for the IO to complete.
+ So when a datablock comes out of get_flushable it must be iolocked, and must
+ remain iolocked until after Dirty and Alloc are clear
+ Index blocks belong entirely to the fs, so we can be more relaxed with them.
+ If get_flushable finds the block already iolocked, it is either being invalidated
+ or already has IO pending, so it can be dropped.
+
+
+16th Match 2009
+
+ FIXME When we sync a small file, we just write out the inode.
+ rollforward currently ignores data in inodes I think.
+ Thanks needs to be fixed to ensure this data is safe.
+
+ - stop iblock from disappearing so much.
+
+ - I think...
+ While cleaning a file, I truncate it. This makes it appear
+ to fit in the inode but it is very big and we get confused.
+ We cannot allocate block 0 until all the others have been
+ allocated to 0 and forgotten.
+ But what if we truncate a file to 10 bytes, then fsync?
+ We need to write the data promptly, but we like doing truncate
+ in the background.
+ When we extend a file we already need to wait for truncation
+ to complete (FIXME do we do that?) We could wait on fsync too.
+ We cannot just delay block0 as it might be part of a checkpoint
+ that has to complete promptly while truncation can take a long time.
+ i.e. we have a very large file. We update the first byte, then
+ truncate to 2 bytes.... we don't need to write until fsync which will wait...
+ Directory?? delete lots of entries so it shrinks to one block?
+ There is no delayed truncate there.
+ ?? Never clean an I_Trunc file.
+ If we try to allocate a file with other indexes:
+ clear Realloc
+ if Dirty and Pinned, just do normal alloc
+ if Dirty and not pinned, skip.
+
+
+ Sometimes I run out of credits while truncating a file.
+ I need credits - maybe only briefly - to dirty the index blocks.
+ -- FIXED I think.
+
+ An indexblock remains pinned while the refcount is non-zero.
+ A pinned index block can be on a _leaf lru
+ The _leaf lru holds a refcount.
+ This is an awkward referential loop.
+ We break it at checkpoint time with special code in phase-flip.
+ But there are other awkward times such as truncate.
+
+ We cannot use PinPending like we do with data blocks because there
+ could be multiple pending Pins (from different children).
+
+ We could possibly treat checkpoint_lock like pinpending, but that
+ might be racy.
+
+ We could not count the _leaf lru, but that might just make the race
+ harder to find.
+
+ I think we want to explicitly drop the pin when we truncate a block.
+ Normally, once we Pin an index block is will become dirty so we don't
+ want to de-pin before a checkpoint anyway...
+
+ Just to clarify: an index block gets dePinned:
+ - during checkpoint on a phase_flip if it is no longer dirty etc
+ - on truncation when we erase it
+ - during pre-emptive write-out which is a bit like an early phase_flip
+ not sure that we implement that one yet.
+
+17th March 2009
+ Deadlock?
+ - checkpoint calls incorporate call erase_iblock calls iolock_block
+ - rm calls orphan_pin calls phase_wait
+ The problem is in lafs_incorporate. It expects the block to be iolocked,
+ but can call erase_iblock which try to get an iolock itself...
+ ...fixed that and it still happens.
+ checkpoint calls phase_flip calls allocated_block (on uninc list) calls
+ iolock_block before calling incorporate
+ Maybe all of these should assume an IO lock.
+
+ FIXME truncate assume truncate-to-zero. We need proper ftruncate support.
+
+ It nearly works....
+ Things to do:
+ - sort out individual patches and review DONE
+ - allow compilation without refcount tracking DONE
+ - don't hold a 'leaf' reference. NO
+ - clean up *ref calls - differentiate those that can be called when zero DONE
+ - use enum for B_* DONE
+ - support truncate to non-zero offset DONE
+ - "looping on" found an 'OnFree' block!
+ - clean out lot of debugging
+
+ Hmmm.... deadlock.
+ rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
+ checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
+ Not cool. i_mutex must not be taken by checkpoint
+ Fixed that, though it is a bit of a hack....
+
+ New deadlock: checkpoint calls phase_flip which calls allocate_block,
+ to move the uninc_next across, and that tries to iolock the parent to
+ perform a partial incorporation. But that seems to be iolocked.
+ Generally that is ugly as ->uninc_next might be very long and require
+ multiple splits, and direct-driving that from phase_flip is bad.
+ I should just move the list across
+
+
+19th March 2009
+ Spent too long trying to remove refcount help by *_leaf lists.
+ This leaves InoIdx block with zero refcount so Data block can get
+ lost and bad things happen.
+ I might be able to fix it up, but it is probably better to try the
+ checkpoint_lock approach if I can only remember what that is.
+
+Locking:
+ Available locks:
+
+ Spin:
+
+ lafs_hash_lock
+ Used in:
+ lafs_shrinker
+ lafs_refile ???
+ Protects:
+ ib->hash
+ ->lru when on freelist
+
+ i_data.private_lock
+ Used in:
+ lafs_shrinker
+ Protects:
+ ->iblock / refcnt
+ ->dblock / my_inode
+ ->children / ->parent within an inode
+ setting ->private
+
+ fs->alloc_lock
+ fs->allocate_blocks
+
+ fs->stable_lock
+ segsum hash table
+ segsummary counters (in blocks)
+
+ fs->lock
+ _leafs lru
+ ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
+ Pinned consistent with lru
+ ->checkpointing / ->phase_locked
+ fs->pending_orphans
+ ->uninc and ->chain ?? Should use parent->B_IOLock ??
+ uninc_table - should use B_IOLock
+ free list / clean list segtrack
+
+ Mutex:
+
+ fs->wc->lock
+ wc[0] .. something in prepare_checkpoint
+ ->remaining etc
+ cluster_flush
+ mini blocks
+
+ i_mutex
+ inode_map
+ orphans
+
+ Other:
+
+ B_IOLock
+ erase_block
+ incorporate
+ cluster_allocate
+ allocated_block
+ IO
+ Phase flip
+ Initialising new inode
+ B_IOLockLock
+ IOLock across a page
+
+
+--------------------
+This is a list from 18 months ago, with updates
+
+ - Understand how superblock 'version' should be used.
+
+ - Review and fix up all locking/refcounts. See locking.doc
+ Also lock inode when copying in block 0 and probably
+ when calling lafs_inode_fillblock (??)
+ - lafs_incorporate must take a copy of the table under a lock so
+ more allocations can come in at any time.
+
+ - We don't want _allocated to block during cluster flush. So have
+ a no-block version and queue blocks on ->uninc if we cannot
+ allocate quickly. Find some way to process those ->uninc blocks.
+
+ - Use above for phase_flip so that we don't need to _allocated there.
+
+ - Utilise WritePhase bit, to be cleared when write completes.
+ In particular, find when to wait for Alloc to be cleared if
+ WritePhase doesn't match Phase.
+ - when about to perform an incorporation.
+ - make sure we don't re-cluster_allocate until old-phase address has
+ be recorded for incorporation.
+
+ - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
+
+ - Can inode data block be on leafs while index isn't, what happens if we
+ try to write it out...
+
+ - If InoIdx doesn't exist, then write_inode must write the data block.
+
+ - document and review all guards against dirtying a block from a previous phase
+ that is not yet safe on storage.
+ See lafs_dirty_dblock.
+ - check for proper handling of error conditions
+ b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
+ - review checkpoint loop.
+ Should anything be explicit, or will refile do whatever is needed?
+ - Waiting.
+ What should checkpoint_unlock_wait wait for?
+ When do we need to wait for blocks the change state. And how?
+
+ - load/dirty block0 before dirtying any other block in depth=0 file
+
+ - use kmem_cache for 'struct datablock'
+ - indexblock allocation.
+ use kmem_cache
+ allocate the 'data' buffer late for InoIdx block.
+ trigger flushing when space is tight
+ Understand exactly when make_iblock should be called, and make it so.
+ - use a mempool for skippoints in cluster.c
+ - Review seg addressing code in cluster.c and make sure comments are good.
+ - consider ranges of holes in pending_addr.
+
+ - review correct placement of state block given issues with stripes.
+
+ - review segment usage /youth handling and make a todo list.
+ a/ Understand ref counting on segments and get it right.
+ - Choose when to use VerifyNull and when to use VerifyNext2.
+ - implement non-logged files
+ - Store accesstime in separate (non-logged) file.
+ - quotas.
+ make sure files are released on unmount.
+
+ - cleaner.
+ Support 'peer' lists and peer_find. etc
+ - subordinate filesystems:
+ a/ ss[]->rootdir needs to be an array or list.
+ b/ lafs_iget_fs need to understand these.
+ - review snapshots.
+ How to create
+ how they can fail / how to abort
+ How to destroy
+ - review unmount
+ - need to clean up checkpoint thread cleanly - be sure it has fully exited.
+ - review roll-forward
+ - make sure files with nlink=0 are handled well.
+ - sanity check various values before trusting clusters.
+
+ - Configure index block hash_table at run time base on memory size??
+ - striped layout.
+ Review everything that needs to handle laying out at cluster
+ aligned for striping.
+
+ - consider how to handle IO errors in detail, and implement it.
+ - consider how to handle data corruption in indexing and directories and
+ other metadata and guard against problems (lot of -EIO I suspect).
+
+ - check all uninc_table accesses are locked if needed.
+
+ - If a datablock is memory mapped writeable, then when we write it out,
+ we need to with fill up it's credits again, or unmap it.
+ - Need to handle orphans asynchonously.
+
+ - support 'remount'
+ - implement 'write_super' ??
+
+ - pin_all_children has horrible gotos - remove them.
+
+ - perform consistency check on all metadata blocks read from disk
+ e.g. don't assume index blocks are type 1 or 2.
+
+23rd March 2009
+ + looking at cleanup for unmount.
+ - various more refcounts fixed up
+ - B_SegRef is never dropped! and we take a ref on a segment when
+ we start a cluster on it, but never drop that reference.
+ THIS is next thing - review all setting and clearing of B_SegRef.
+
+30th March 2009
+ - SegRef and lafs_reserve_block...
+ There is room for recursion here, I need to be careful.
+ To dirty a data block, all parent index blocks must be Pinned and must
+ be able to be written. That means their segusage blocks must be
+ available for update. And Pinning a segusage block for update requires
+ all its parents. So the segment for the block, the indexes, and the
+ segusage and indexes and so-on must all be pinned.
+ When we pin a block, we do it from the root down to avoid recursion.
+ We probably wany whatever reserve_block calls, to return an unreserved
+ block rather than call reserve_block itself.
+
+ When do we clear SegRef?? We set it when Pinning, so I guess we
+ clear it when unpinning.
+ pin_dblock, mark_cleaning, prepare_write, truncate
+ seg_move clean_free
+ We it is really when Pinning, or Dirtying or Reallocing.
+ So we clear when unpinning, or when a dblock gets written...
+ Maybe just when we lose ->parent
+
+6th April 2009
+ - sometimes sugsum counter goes zero for random data block
+ Something is going wrong in roll-forward. The block looks transiently valid
+ so doesn't get read, but has no good data in it.
+ - After deleting a directory, the block might still have incorporation
+ to happen, but is not marked dirty
+ - at unmount, there are various blocks that are still dirty.
+ - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
+
+12th April 2009
+ - that rollforward problem above:
+ When rolling the checkpoint, if we find segusage blocks we want to include
+ them directly into file. But by pinning the block we might preread a
+ segusage block.. but we must be sure not to update it.
+ So during the early stages of rollforward while still in the checkpoint,
+ seg_inc must be called with in_phase == 0.
+ so seg_move is called with phase != qphase.
+ ditto for summary update.
+ So the block must be pinned to the previous phase...
+ Normally 'phase' changes at checkpoint-start,
+ qphase changes at checkpoint-end
+ So we probably want to start with qphase being 0 and phase being 1.
+ When we reach the end of the checkpoint, we flip qphase to 1.
+
+ - blocks still in phase_leafs at unmount:
+ After we force a final checkpoint we still have Pinned:
+ root InoIdx
+ ino==8 InoIdx due to Dirty block0
+ ino=16 InoIdx due to dirty block0
+ and dirty:
+ inode block 1, inode usage map
+ 2, root directory
+ 8, orphan
+ 16 seg usage
+ Problems:
+ inode blocks dirty but not pinned? No InoIdx...
+ Segusage dirty - probably by seg_apply_all - disable that at umount
+ orphan dirty ??... but not pinned!
+ This is possible - we don't pin for clearing entries, just for setting.
+ The inode problem stems from the datablock being dirty while the
+ InoIdx block isn't. That is, at best, confusing.
+
+13th April 2009
+ segusage blocks aren't being pinned
+ They need to be pinned whenever dirty.
+ and youth blocks aren't even made dirty some times. They need to be
+ pre-pinned in many cases.
+
+ So: segusage gets changed when we write out a cluster, and when we
+ delete/relocate blocks.
+ In the first case we pin the block when it becomes part of the free list,
+ and need to keep it pinned across checkpoint changes.
+ In the second, we pin when the block is dirtied and again must keep it pinned.
+ Youth gets changed when a segment becomes free and again when we allocate
+ a segment to it.
+
+ Keeping a datablock pinned across checkpoints is awkward - we currently need
+ to repin for each dirty... I guess we can re-pin for each checkpoint
+ in lafs_seg_apply_all. That might work for segusage, but not for youth!
+ If segsnum for ssnum==0 held a reference to the youth block, that might
+ help. Segstat on 'clean' or 'free' would imply a reference to that segsum.
+
+ Is it OK to keep all youth/usage blocks for free/clean blocks
+ pinned? We can currently have 810 entries. Only half will be clean/free.
+ For each entry there can be two blocks, youth and usage. So that could be
+ 810 blocks. 1Meg? Normally much less. If it became a problem we could
+ reduce the number dynamically I guess.
+
+ maybe segusage blocks need to get phase_flipped, as other blocks do
+ depend on them, pin_all_children wouldn't be able to find them though..
+
+ 1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
+ Youth block.
+
+14th April 2009
+ I think I want to link dirty block to the space in free segments that we
+ actually know about. Each of those segments has youth and usage blocks
+ pinned (at least parent pointer is active). So we have everything we need
+ to write everything that is dirty. So 'free' or 'clean' implies
+ a segsum reference which holds youth block.
+
+ When we get low on space, we wait for cleaning/finding to progress.
+ This would limit us to 400 segments, say 16Meg each, so 6Gig of dirty
+ memory. I guess that we need to scale the 'free' list based on available
+ memory (FIXME).
+
+ When cleaning needs a segment, it needs to load the usage blocks for other
+ snapshots too.
+
+ When cleaning in the presence of snapshot we need to be careful never to
+ duplicate a block that is shared. To allow for v.many snapshots, we don't
+ even want to duplicate in memory.
+ So we need to choose a 'primary' copy - probably first one found - and
+ follow the peers link when possible...
+
+18th April 2009
+ (continuing).
+
+ So clean and free segments in the list carry a SegRef. But it could be
+ excessive if all of them did - we shouldn't be required to pin more
+ data than we need.
+ So for segments with a usage of 0, we use the score to record if a
+ segref is held. 0 means 'no', 1 means 'yes'.
+ When space_alloc wants more space we need to find an entry and
+ segref it. Maybe we want free lists - reffed and not-reffed.
+
+ Then again, SegRefs are fairly cheap as they are heavily shared.
+ maybe 512 to a block. If we hold 400 refs they could easily all be
+ in one block. We could possibly encourage this by sorting the list
+ and discarding from one end if it is too full.
+ Sorting is a good idea definitely. It keeps youth/usage updates
+ together.
+
+ Just check the numbers.
+ a 1TB device with 1K blocks might have 32M segments of which there
+ would be 32768. 512 per block means 64 blocks or 16 pages (64K).
+ So total segusage files is 128K plus snapshots. Not worth worrying
+ about surely.
+ For 16TB, that is 2Meg plus snapshots.
+
+ So
+ - keep a SegRef for all free and clean blocks.
+ This must include a youthblk reference.
+ - sort the free list when 'clean' is merged or when a pass
+ finishes.
+ sort clean list
+ fix youth value
+ merge as many as fit into free
+ sort
+
+ How is the code flow...
+ add_cleanable is called during the periodic scan. It could hold
+ a SegRef easily.
+ add_cleanable calls add_clean as does lafs_get_cleanable during
+ clean. That might block getting a segref, might even
+ deadlock?
+ add_free is also called by seg_scan
+
+ So seg_scan should get a segref and leave it with everything!
+
+ BUT.....
+ A SegRef implies a 'struct segsum' for each segment. We don't
+ want to allocated one of these for every segment in the table.
+ We only want a reference to the youth and segusage block, which
+ are heavily shared.
+
+ But these blocks need to be Pinned and SegReffed etc so we can
+ write them at any time.
+
+20th July 2009
+ The refcount held by the 'leaf' lru is a problem.
+ While it holds a count we do not unpin an index block, so it cannot
+ be removed from the list.
+ Thus we can only remove from the leaf lru on a phase change.....
+ Or when doing lru based flushing... Maybe we can remove from the
+ lru while holding the checkpoint lock.
+ This happens when truncating..
+
+ No, that is just too messy as it is too easy to get put back on the list.
+
+ Maybe the leaf lru should not imply a reference count ... or maybe
+ we need to split the refcount: 'inuse' and 'active'.....
+ How about we test refcnt against list_empty(->lru)...
+
+ ....
+
+ During truncate, we need each index block to get unpinned so they can
+ all be cleaned up.
+ But the InoIdx block is held pinned by by the inode block being dirty.
+ In this particular case, the InoIdx block is Invalid as the file is empty.
+ But.... InoIdx should always be valid until after Inode is destroyed??
+
+
+ umount
+ I need to stop the cleaner and flush everything before trying to
+ clean up.
+
+ This is awkward though.
+ The 'sync' of umount is done by kill_block_super, but I call
+ that rather late, after checking that the tree is empty.
+ There are pinned/dirty bits left after sync that we want to magically
+ clean.
+ We have:
+ - segusage/youth blocks. Maybe if we don't seg_apply_all...
+ - orphan block. Maybe don't mark it dirty when we remove things?
+ - inode map?? why is that dirty
+
+ - root directory is dirty still?? But it has been erased.
+ InoIdx is valid-but-empty. Inode Data is dirty
+ Data block 0 is Dirty at block 0.
+
+ ......
+ Ahh... need to mark page dirty when block is marked dirty !!
+
+ The seg usage blocks are now flushed out but not incorporated.
+ I feel that might be correct - we don't want to care about
+ incorporation as we will never use it.
+ For this, segusage and quota are very special cases.
+
+ Inode map is no longer dirty, but is pinned
+ Orphan does have a dirty block still
+ The orphan table contains the root directory.
+ root is now clean and gone
+
+ Segusage doesn't get incorporated after last checkpoint now
+ so that is better.
+ But now we have a circular reference for SegRef. This should not
+ be surprising given the circular problems we had setting SegRef.
+ I guess we just erase the references in the segsum table...
+
+22nd July 2009
+ Hurray!!! I can unmount without crashing!
+ Now I need to sort through all the fixes required to achieve that
+ and make discrete patches, and be sure it is all OK.
+
+DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
+DONE - (block.c) Mark page dirty when block becomes dirty
+DONE - (checkpoint.c) print orphan_slot with Orphan flag
+DONE - Don't incorporate segcount etc after final checkpoint
+DONE - Don't apply seg changes after final checkpoint.
+DONE - Don't start opportunistic checkpoint after final.
+DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
+DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
+DONE - (cluster) be more flexible about credit usage when flushing InoIdx
+DONE - (dir) do add_orphan when we abort as well as on success
+DONE - use inode_dec_link_count, not i_nlink--
+DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
+DONE - change %d/%d to strblk
+DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
+DONE - (index) refile: when unpinning, remove from lru
+ - lafs_refile: ->iblock can be non-null for inode 0.
+DONE - Make sure I_Deleting gets cleared when deleting finished.
+DONE - phase_flip should have something separate to call, not lafs_allocated_block
+ - inode.c: lafs_dirty_inode: getref_lock used to get dblock
+NONO - ?? getref_locked allowed if PagePrivate
+DONE - segment: lafs_seg_put_all needed at unmount
+DONE - segdelete_all: need to put intable references
+DONE - lafs_free_get: put the intable references
+DONE - lafs_get_cleanable: put the intable references
+DONE - fix sort splitting in add_cleanable
+DONE - add lafs_empty_segment_table for unmount
+DONE - lafs_release: flush all dirty blocks
+DONE - lafs_release: force a final checkpoint
+DONE - lafs_release: move kill_block_super before final check
+DONE - lafs_put_super: release orphans and segsum files.
+DONE - lafs_destroy_inode: putref should be 'iblock'
+ - lafs_destroy_inode: allow for iblock to be present but no ref held....
+DONE - can roll forward call lafs_allocated_block without dirty???
+
+27th July 2009.
+ - I've re-arranged lafs_release so that the flush is all done in
+ generic_shutdown_super. However it calls invalidate_inodes, and that has
+ problems with pinned inodes. So we need for fsync_super to checkpoint
+ out all inodes that we don't hold our own reference to.
+ If we do hold a reference, then invalidate_inodes will skip them,
+ and ->put_super can be used to drop the references and perform the final
+ checkpoint.
+ fsync_super calls ->sync_fs. after syncing call files. Maybe I can
+ do some sort of checkpoint there...
+ There almost is a checkpoint in there.... But only when called without
+ 'wait'....
+ I need to understand 's_dirt'.
+ This is controlled entirely by the filesystem, common code only examines it.
+ If it is set:
+ file_fsync (the generic 'fsync' method) will call ->write_super
+ fsync_super will call write_super
+ generic_shutdown_super will call write_super
+ sync_supers will call write_super
+ sync_filesystems(0) will call ->sync_fs
+ sync_fs is called:
+ twice from 'sync', once with '0', once with '1' for 'wait'.
+ (though in emergency_sync, both are '0').
+ once from unmount and remount with 'wait' set to '1'.
+ We don't want two checkpoints for a 'sync', but we want to start
+ on 'wait=0'.
+ Maybe if we get called with '0', we set a flag and treat the '1'
+ differently.. There is no locking to make this really safe, but
+ it will probably be OK... I could take a process_id, but then
+ parallel 'sync's could race.
+ write_super is called before the syncs. So it could start the checkpoint,
+ and sync could wait for it.
+ write_super is called multiple times at shutdown, We really need
+ to utilise sb_dirt to avoid some of these.
+ We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
+ - when we pin a dblock or dirty a this-phase iblock.
+
+29jul2009
+ at unmount, we iput the root inode which de-references the dblock
+ before clearing ->iblock, which fails an assertion ... why?
+ Apart from the shinker, ->iblock is only set to NULL in refile
+ when we find an I_Destroyed inode... I guess the root block isn't
+ getting Destroyed...
+ The protocol for freeing iblocks is bad. Should be:
+ - it only gets freed by the shrinker
+ - when inode dies, set ->inode to NULL
+ - when InoIdx iblock dies, set ->iblock to NULL
+ ...???
+30Jul2009
+ So, what exactly is the protocol?
+ - index blocks live either in the parent/sibling tree, or
+ on the inode's free_index list
+ - when refcnt is 0, they live on 'freelist.lru'. When refcount
+ is elevated they stay on lru until they need to be
+ added to some other lru (leafs or cluster)
+ - when shrinker finds block on freelist.lru with non-zero refcnt,
+ it just removes from lru
+ - when shrinker finds free block, it removes from free_index and discards
+ the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
+ I think not as such would either have children or be on an lru
+ - When we destroy an inode, all index blocks get disconnected from the
+ inode and freed. This must include the ->iblock
+ - When an index block becomes free due to index tree shrinkage,
+ we set the ->depth to -1 so that it cannot be found by mistake,
+ and leave it for shrinker or inode destruction.
+
+ Confused about inode<->dblock dependence.
+ We don't want the inode to refcnt the dblock as that wastes space.
+ We don't want the dblock to refcnt the inode as that stops it from being freed.
+ So each must disconnect from other when freed.
+ What locking?
+ inode takes private_lock, then checks dblock
+ dblock cannot take private_lock before checking ->my_inode..
+ Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
+ drops ref
+
+1Aug2009.
+ Tracking down the 'credit' count and making sure it stays correct.
+ It seems that I have a Dirty InoIdx block which is not pinned.
+ Due to this it has no refcount and so the data block disappears so
+ the InoIdx block is not visible in the tree. This isn't a definite bug
+ but it means I cannot count credits properly.
+ And surely Dirty index blocks must always be pinned!!??
+
+ When as small file is flushed to the inode we were dirtying the
+ iblock. That seems wrong - should dirty the dblock? Need to
+ check that is valid
+
+ I got a hang in 'rm adir/4'.
+ rm is in lafs_cluster_update_commit_both
+ getting a mutex.
+ cleaner is in lafs_do_checkpoint+0xe4
+ pdflush is in writepage/lafs_cluster_flush waiting on a lock
+ so I guess cleaner is holding a mutex and waiting for something
+ that wont happen?
+
+
+ Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
+ cleaner is at some point, holding a mutex to stop 'sh'.
+ 0e4 == 228
+
+ ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
+ to be allowed.
+ So when something locks the checkpoint and needs to flush, we have problems....
+
+
+ I seem to have fixed the above. Now:
+ Free space is a real problem. When I remount after the successful unmount,
+ we find a usage pattern like:
+CLEANABLE: 0/0 y=10 u=34179
+CLEANABLE: 0/1 y=0 u=65144
+CLEANABLE: 0/2 y=0 u=65535
+CLEANABLE: 0/3 y=32773 u=32910
+CLEANABLE: 0/4 y=32772 u=149
+CLEANABLE: 0/5 y=0 u=0
+CLEANABLE: 0/6 y=32770 u=16529
+CLEANABLE: 0/7 y=32769 u=35084
+CLEANABLE: 0/8 y=32768 u=31877
+
+ Which is ridiculous.
+ Better fix up what I have first...
+
+ ...
+ In rm /mnt/1/nbfile* we hang..
+ rm is in lafs_phase_Wait from pin_dblock in unlink
+wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
+
+ cleaner is in lafs_iolock_block from add_block_address in phase_flip
+iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
+
+ So cleaner is probably deadlocking against itself via iolock_block.
+ This is taken:
+ - in lafs_invalidate_page just to wait for any io - it isn't held long
+ - in lafs_erase_dblock while we erase and 'allocated_block'
+ - in lafs_get_flushable to protect blocks being checkpointed
+ - in lafs_writepage to call cluster_allocate (which releases), both for
+ data block or for inode when data was flushed there.
+ - lafs_add_block_address to process pending incorporations to make room.
+ This is what is trapping the cleaner.
+ - lafs_inode_handle_orphan when truncate finishes to erase_iblock
+ - lafs_inode_handle_orphan again to incorporate all removal
+ - and again to erase_iblock
+ - and for partial truncate to incorporate some removals
+ - and again....
+ - lafs_new_inode to keep it from being cleaned while being created
+ - roll_block to add addresses
+ - lafs_load_block during IO
+
+ So: who holds it?.... let's use the code to find out...
+ And the answer is : lafs_get_flushable.
+ So get_flushable iolocks the block then calls phase_flip which tries to
+ incorporate other-phase children which try to iolock the block. Deadlock.
+ Do we need to hold iolock during phase_flip ??. Not for all of it..
+
+02August2009
+ FIXME When erasing a block, do I need an uninc credit? I usually don't
+ have one and the need certainly isn't as great...
+
+ Now... let's try to get free space accounting right.
+ Observed problems:
+ - unlink sometimes failed with ENOSPC
+ - usage scan shows segmetns with enormous usage - 23039!!
+
+ no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
+ no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
+
+ no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
+
+
+ after umount/remount df says "4608 7 1544" but cannot
+ create anything.
+df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
+============= Cleanable table (7) =================
+pos: dev/seg usage score
+ 0: 0/0 1 0
+ 1: 0/5 1 64
+ 2: 0/6 6 384
+ 3: 0/7 2 128
+ 4: 0/8 3 192
+ 5: 0/3 1 64
+ 6: 0/2 2 128
+...sorted....
+ 0: 0/0 1 0
+ 1: 0/3 1 64
+ 2: 0/5 1 64
+ 3: 0/2 2 128
+ 4: 0/7 2 128
+ 5: 0/8 3 192
+ 6: 0/6 6 384
+--------------- Free table (1) ---------------
+12290: 0/4 0 0
+--------------- Clean table (0) ---------------
+CLEANABLE: 0/0 y=10 u=1
+CLEANABLE: 0/1 y=32775 u=3
+CLEANABLE: 0/2 y=32774 u=2
+CLEANABLE: 0/3 y=32773 u=1
+CLEANABLE: 0/4 y=0 u=0
+CLEANABLE: 0/5 y=32771 u=1
+CLEANABLE: 0/6 y=32770 u=6
+CLEANABLE: 0/7 y=32769 u=2
+CLEANABLE: 0/8 y=32768 u=3
+
+
+03Aug2009
+ Current issues:
+FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
+Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
+ 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
+ 4/ invalidate_page find Realloc set, even after iolock ..
+ This is during umount in generic_shutdown/lafs_put_super/iput
+ 5/
+
+
+ Thoughts:
+ If we flag a block for Realloc then Dirty before it is allocated,
+ then all is fine.
+ But if we have already allocated to a cleaning cluster... what happens?
+ We need to treat this like it was dirties after being written, so
+ it gets written to a regular cluster as well.
+ As we only have one uninc bit for both Dirty and Realloc, we need
+ to *not* incorporate the Realloc update if the block is still dirty.
+ So:
+ - block gets chosen for cleaning and allocated to a clean-cluster
+ - block gets marked dirty. This must not clear Realloc
+ - cluster is flushed, block is dirty, so don't call lafs_allocated_block
+ - Return the Realloc credit, but keep dirty and Uninc.
+ Is there a race if Dirty is set after we enter lafs_allocated_block?
+ As long as the index block gets marked Dirty, not Realloc we might
+ be safe... though it gets awkward if the Dirty writeout falls in to
+ the next phase. But reserve_block will have provided NCredits for that.
+ So:
+ 1/ don't clear Realloc when setting Dirty
+ 2/ do clear Realloc if cleaner finds the block is Dirty
+ 3/ avoid calling lafs_allocate_block when cleaning a dirty block.
+ This is an optimisation.
+
+ Almost... A B_Realloc block no longer has B_Credit so B_Dirty cannot be
+ set.
+
+
+ Thoughts3.
+ When cleaning blocks we hold no reference to the inode and it can disappear.
+ We don't want to hold the inode active, but need a reference much like
+ the truncate code has.
+ I think we need a subordinate refcount for both cleaning and truncate.
+ These hold inode present but not active.
+ Maybe every block->inode should be counted like this.
+ And this might simplify the my_inode->dblock inter-relationship.
+ For later..
+ We need to ensure that if a new iget is called on an inode that still
+ exists, we don't allocate a new one but just reuse the old.
+ But that won't work as we cannot add an inode back into the hash table.
+ So I think when cleaning a block we need to ref the inode.
+ i.e. B_Realloc implies an i_grab
+
+05aug2009
+ So I have a problem with the cleaner wanting to hold and inode that
+ the VFS is destroying.
+ I don't want the cleaner to hold i_count as that delays truncate etc.
+ So we need a second counter subordinate to i_count.
+ This is held by the cleaner and by delayed truncate, and by i_count.
+ Possibly ->my_inode holds this, which means it can be a single bit...
+
+ When a lookup wants an inode, we need to load the inode data block and
+ see if it has my_inode. If it does, we insert that inode in to the
+ hash table. If not we fall back to regular inode creation....
+
+ On reflection, that is too complicated and hard and error prone.
+ When relocating a file we need the data so it had best be in the page
+ cache so the filesystem really needs to know that the inode is still
+ active.
+ So cleaning needs to keep a reference to the inode.
+ The cost of this is that if an inode is being deleted while it is
+ being cleaned the truncate cannot happen until the cleaning
+ completes. This means that space usage will be wrong.
+ When nlink becomes zero we can drop the cleaner reference. When
+ the inode is dropped/destroyed we can tie the cleaning in with the
+ delayed truncate so that the final destruction doesn't happen until
+ the cleaner has let go.
+
+ So: how to track that the cleaner has a reference to the inode?
+ Maybe every B_Realloc block owns a ref on the inode.... but dropping
+ those references when i_nlink hits zero would be difficult.
+ They could hold a secondary refcount which, if non-zero, implies a
+ ref on the inode.
+
+ So:
+ - Set B_Cleaning when we look at a block for cleaning, and clear
+ it when we find Realloc clear and ....????
+ - Whenever a block has B_Cleaning set, it holds a counted reference
+ on LAFSI(b->inode)->cleaner_ref
+ - When cleaner_ref is non-zero and I_Deleting is not set, we hold
+ a reference on the inode (i_grab).
+ - when i_nlink hits zero, set I_Deleting and drop any reference
+ held by the cleaner.
+ DONE - cleaner must be careful not to process any block that has been
+ truncated, or file that is dead.
+ DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
+ - What about filesystem inode... how do they fit in??
+
+
+ Question. When are the index blocks for an inode flushed?
+ We need to have them gone when the inode disappears.
+ For deleted inodes, this happens in background truncate.
+ For memory-pressure inodes it will hopefully happen well in advance,
+ but we need to make sure in destroy_inode that everything is
+ written. - FIXME
+
+
+ Thinking again about B_Cleaning, any B_Realloc block will hold a
+ reference through to InoIdx and so dblock will be present and the
+ inode won't be freed. So we only need an extra reference during
+ the first little phase of cleaning when we are collecting blocks.
+ After that a reference can be useful as it will delay flushing so it
+ can be more efficient...
+
+ Maybe this is all much simpler than I thought.
+ If we hold a ref on the inode whenever the InoIdx block is Pinned
+ and i_nlink is non-zero, then we won't be forgotten until all
+ index blocks are written. We may still be deleted, but as that
+ is one-way we can hold on to the inode at little cost.
+
+ getting/putting that ref at exactly those times turns out to be
+ messy.
+ It might be best to have a flag to say "We hold an extra ref".
+ Then we occasionally call a function that validates the setting.
+ It is most important to drop the count at the right time, so
+ after unlink/rmdir/rename and when B_Pinned is dropped.
+
+ B_Pinned is set in:
+ set_phase which is called from:
+ lafs_cluster_allocated when moving 'pin' across to data block
+ so don't need checkpin
+ lafs_pin_block_ph
+ only need check_pin if dropping spinlock
+ pin_all_children
+ only pins data blocks (Index are already pinned if relevant).
+ grow_index_tree
+ where "inoidx block pinning" doesn't change
+ do_incorporate_leaf
+ No InoIdx involved
+ do_incorporate_internal
+ ditto
+ So only need check in lafs_pin_block_ph and maybe pin_all_children...
+
+08Aug2009
+ - credits get out of sync from
+ lafs_incorporate->refile->space_return from checkpoint.
+ counter is one more than we can find.
+ returning space on
+ i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
+ Note it in an Index but not InoIdx. The parent is still in the tree.
+ This that is FIXED
+
+ - and out by 8! at
+ delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
+ FIXED that.
+
+ - BUG credits<0 in space_return from lafs_incorporate from add_block_address
+ from phase_flip
+Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
+ from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
+msg: (1,3,1)(1,1,-1)
+Credits = -1, rv=1
+ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
+
+ This is a predicted but not handled problem.
+ The answer is that not all blocks need ICredit/UnincCredit.
+ The purpose of this credit is to allow for a split in the parent.
+ pre-existing index blocks can never split the parent themselves
+ If an index block becomes full, it will split and this might split
+ the parent.
+ If an index block has free space, then it will only over flow if it
+ gets multiple child updates and this will provide multiple credits.
+ So an index block with space for 3 or more new addresses does not need
+ and ICredit/UnincCredit. So when we split we don't need to provide an
+ uninc credit.
+ In particular.
+ When we have a fully InoIdx block and a single new child with 1 UnincCredit,
+ each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
+ an ICredit, then create a new intermediate such that
+ InoIdx is Dirty and has an ICredit
+ New Index is Dirty with no ICredit - it used the UnincCredit
+ New child looses its UnincCredit
+ When another block in the new index arrives, it's unincredit is used to
+ provide an ICredit
+
+ When a leaf block cannot fit a single address it will have ICredit.
+ The block is split so that each has 3 spaces and so do not need ICredit,
+ but as soon as ICredit is available, they take it.
+
+ Worst case is that every ancestor is full and the leaf is split
+ We then get two full branches, each block half empty so not needing ICredit.
+
+
+ Then...
+ free data being used in lafs_refile from cleaner.
+ b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
+ Answer: lafs_refile was derefering ->inode when it wasn't safe.
+ Need to at least have a parent before it is safe.
+
+ Hang:
+ soft lockup cleaner->lafs_iget->ifind_fast ....
+ Then (may be caused)
+Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
+.......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
+Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
+
+ It seems the cleaner gets confused and goes spinning.
+
+
+ So: space problems:
+ After the run, we have -14 used and 2055 available (of 4608), and
+ cannot create anything.
+ 4 segments ar free, one is cleanable.
+ free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
+or
+ free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
+or
+ df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
+ free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
+ and very little free
+
+ ablocks_used is going negative - why?
+ Probably we erase a dblock without clearing Prealloc.
+ Then when Prealloc later gets cleared, ablocks_used is
+ wrongly decremented.... no...
+
+
+10aug2009 (don't forget above problems)
+ Another problem.
+ read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
+ getiref_lock triggers BUG.
+ This is presumably because I have just fixed it to get the correct
+ iblock and not the iblock of the filesystem.
+
+ FIXME I hacked around this but I'm not sure the result is right.
+ The question is about when the InoIdx should be dirty and when
+ the inode data block should be dirty.
+ In this particular case we are writing a page of a small file.
+ cluster_allocate calls flush_data_to_inode which tried to dirty
+ the inode dblock but finds that iblock is not pinned...
+ When we dirty a data page we aren't pinning the parent!
+ That might be OK - we only need to count and reserve the parent.
+ We don't need to pin it until it becomes dirty.
+
+ Still need to resolve when which block gets to be dirty, and also
+ exactly when an index block needs to be pinned. And how does that
+ related to holding a ref on the inode when the inoidx is pinned.
+ Maybe it should be when the inoidx is referenced.
+ FIXME
+
+11aug2009
+ Another problem. unlink->handle_orphans->erase_dblock->allocated_block
+ and get a zero from lafs_add_block_address but parent is not pinned.
+ And... One unmount, orphan file still has pinned blocks so the inode
+ isn't free.
+ And ... root still old phase after lots of 'rm' then sync.
+ Inode 244 has pinned inode block held by writepage0 and writepage
+ this is adir/170
+
+13aug2009
+ - lots of bugs introduced by change to marking inode blocks dirty:
+ writepage/cluster_allocate wants to Dirty inode data block with no credits.
+ because I put credit in iblock!
+
+ - ohhh.... The phase contour is broken. When a block is added to a
+ cluster for allocation it isn't in the phaseleafs any more, but prevents
+ it's parent from joining. So we cannot assume that if dblock is on
+ list then iblock or a child will be too.
+ So when we find dblock we do need to remove it.... done that.
+
+ - root not changing because Data 1/0 is Pinned and IOPending
+ and held by writepage!!
+ Problem is that IOPending blocks aren't put back on lru.
+ But that should only be blocks on the cluster list.....
+ But that is where I am putting it.
+ Maybe I need exclusion between checkpointing and any other
+ code that writes to checkpoint so checkpoint can wait
+ for that ... can we use wc->lock?? That doesn't lock
+ against cleaner, but that isn't a problem...
+ But now 0/228 is still pinned and in writepage and IOPending
+ So there is more to it than that.
+ When checkpoint finds an IOLocked block, it might be about to
+ join a cluster, in which case we don't really want to wait, or it
+ might be undergoing incorporation in which case we want to wait.
+ or it could be being erased, so wait..
+ Maybe I wait until it appears on some list.... yes.
+
+14aug2009
+ At unmount Index 8/0 with child and leaf is still pinned
+ This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+
+ and..
+
+ A problem is that something goes wrong in the erase process.
+ We find new children after we erase the inoidx block!
+
+ This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
+
+ When/how do we erase indexblock and particularly inoidx blocks?
+ Does and inValid InoIdx simply mean there is no indexing and does not
+ reflect on the Data block?
+
+.xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
+
+ Orphan problem:
+nextfree = 0
+reserved = 0
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
+This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
+[cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
+ [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
+ [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
+ [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
+
+nextfree = 1
+reserved = 0
+ 0: 1 0 0 304
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
+This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
+ [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
+badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
+
+
+erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
+erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
+------------[ cut here ]------------
+WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
+unlink/orphan/erase_dblock_allocated_block
+---[ end trace 61b8bd59512ea4da ]---
+zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
+ [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
+ [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
+
+BINGO. When we remove last entry from directory we erase the InoIdx block,
+ then when we add entries, we hit problems.
+
+
+nextfree = 3
+reserved = 0
+ 0: 1 0 0 306
+ 1: 1 0 0 307
+ 2: 1 0 0 74
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
+
+This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
+ [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
+
+This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
+ [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
+
+This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
+ [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
+
+We have stray 'cleaning' references.
+It is taken -
+ on a data block that was in a to-clean segment
+ at which point we igrab the inode
+ the block is put on the ->cleaning list.
+It is put:
+ when we get an error finding the block
+ when we find that it isn't in the segment
+ when an error occurs loading the block-to-be-relocated
+ and when we mark that block for cleaning.
+ i.e. always unless we got EAGAIN or some space error.
+ If we still hold some blocks, try_clean returns 0.
+
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
+This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
+[cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
+ [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
+ [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
+ [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
+
+NOTE these inode data blocks are not pinned and so did not get written!!
+
+FIXME I should wait for the checkpoint to finish
+nextfree = 1
+reserved = 0
+ 0: 1 0 0 301
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
+This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
+[cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
+ [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
+
+16Aug2009
+ When I clean and find an inode that is already deleted, I need to be
+ very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
+ to be. lafs_delete_inode gets called a lot, but mostly for dead inodes.
+
+ BUGS:
+FIXED orphans don't get cleaned up. It seems a 'create' fails and leaves
+ and orphan block un-released.
+ - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
+ - Not sure that we handle complete truncation, then adding blocks properly.
+ - what should the state of the InoIdx block be?
+ - On remount, the filesystem contains rubbish.
+ - create fails even when there should be free space.
+ - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
+ - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
+ and 74 has similar issue
+ 327 = adir/big1 74=adir
+
+
+17Aug2009
+ Segusage blocks aren't always Pinned when we make them dirty.
+ Yes. That is correct. They are not forced out by phase change but by
+ lafs_seg_flush_all at the end of a checkpoint. So they need to be
+ preallocated, but not Pinned.
+ But, once we have finished the last checkpoint we don't want to
+ dirty Segusage blocks any more.. I wonder if we are.
+ No, but we were Pinning inodes without PinPending and they
+ lost the pinning straight away!
+
+ OK, other annoyance.
+ InoIdx block and similar are getting erased at the wrong
+ time.
+ We can only safely erase them when they have no children.
+ I guess what we really want is the incorporation leaves them
+ existing but empty, and when we go to write them out, if they
+ are empty we register an address of 0.
+ When we drop the ->parent pointer of an Index block it
+ just goes away...
+ So:
+ When incorporate or truncate produces and empty index block
+ it simply clears B_Valid.
+ When incorporate want to add to an index block, we set B_Valid
+ When cluster_allocate gets a non-Valid index block it call
+ block_allocated with phys of 0.
+
+ Yes, that seems to work. Mostly
+
+18Aug2009
+ On remount, check_credits dies: 16/20-0
+ In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
+
+19Aug2009
+ OK, this index block clearing is a mess. There must be a neat model I can
+ follow that will make it "just work".
+ The key seems to be children. If an index block has children, then it
+ really must exist. If it has no children and no content, then it can
+ be discarded, in which case it needs to be unlinked from its sibling list.
+ What locking do we use here? Probably IOLock on the parent index block.
+ So we need iolock while looking in a parent for children, and we take
+ IOLock while incorporating or pruning.
+ Once the empty index block has dropped out it will never be found again.
+ When we incorporate the zero address, the index block becomes invisible
+ unless it is shortly after it's predecessor in the sibling list. But
+ that is hard to ensure, especially if the first child is the one that
+ is being erased. So if an index block is erased, then it must be
+ discarded quickly and any children need to be relocated...
+ Or maybe not.... maybe if there are children, we just write and empty block?
+
+22Aug2009
+ We need better locking of the index information.
+ It seems best to use IOLock as that is already held during incorporation.
+ So any code that accesses or updates and index block must hold IOLock.
+ This might be a bit of a restriction if we try to do a lookup while
+ writeout is happening.... Maybe we need a separate writeback flag for that.
+ But I think it is good to use IOLock for now.
+ Places we need this are:
+ flush_data_to_inode needs to lock the InoIdx block
+ - DONE
+ lafs_leaf_find as it recurses down. This should return a locked leaf.
+ - DONE
+ callers of clear_index
+ erase_dblock for depth=0??
+ - DONE
+ incorporate should lock new blocks for consistency
+ - DONE
+
+ Locking dependency rule is that if we hold a lock, we are allowed to
+ lock a child index block, but not a parent. IF we hold a data block,
+ we are allowed to lock the an index block.
+
+
+ The read/write completion seems all wrong. It unlocks if the page was locked,
+ and that isn't really safe, because it might not have been locked for read..
+ We need to flag block0 to say if lock or writeback need to be cleared.
+ Given that, I don't need IOPending any more:
+ Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
+ Write: We queue all writes, then set 'do_clear_writeback', then check.
+
+ Now... can we use a writeback flag to avoid waiting to read while writeout
+ is happening? We would need:
+ set writeback in cluster_allocate
+ wait_writeback after some lock_block
+ clear_writeback when writeout finishes.
+ Extra checks where we already check for IOLock
+
+
+24aug2009
+ Lots of progress but....
+ cluster_flush calls cluster_done calls refile call iput call
+ drop_inode call write_inode_now calls writepage calls cluster_flush
+ and we get a locking loop.
+ I think we need the run that cluster_done from a different thread.
+
+
+ We seem to have a refcnt problem with segsum.
+
+25aug2009
+ Lots more progress but.....
+
+ orphan_release is finding that the orphan block has no credits.
+ We can allocate credits and simply not do the update if they
+ are not available: having an extra entry in the orphan file isn't
+ a problem. However we need some mechanism to clean up other than
+ waiting for a remount..
+ I think we leave that until we redo orphan handling.
+
+ and: adir sometimes loses one block so it and the contents don't get
+ deleted.
+
+ and: it seems we sometimes try to clean the segment being written
+ to. We must avoid that.
+
+ (long ago I wrote::
+ FIXME When pin fails, we need to remove PinPending from everything!!!
+ and never followed up ... I wonder?
+ )
+
+25Aug2009
+ Orphan handling.
+ Every orphan block goes on a per-fs list and gets removed only
+ if the B_Orphan bit is clear.
+ There are two times when we want to expedite orphan handling.
+ 1/ on rmdir we need to know if the directory is really empty.
+ This requires that we expedite the orphan handling of all
+ blocks. As soon as we find a non-orphan, we can give up.
+ Then we need to make sure the index tree has collapsed. WE
+ can borrow that code from truncate.
+
+ 2/ When writing past Trunc_next. We just pass the block to
+ special orphan handling.
+
+ This requires that orphan handling is re-entrant.
+ For dir, that is protected by i_mutex, but rmdir needs to come
+ in under the radar.
+ For trunc, the iolock on the index blocks should be enough.
+ I wonder if IOLock can be used on dir as well... allowing
+ parallel orphan handling in the one dir even!!.
+
+ We need to ensure exclusion of orphan handling, including:
+ - only one orphan handler at a time
+ - don't run orphan handler while still processing action
+ that makes it an orphan.
+ Maybe if we just use IOLock for that? Does that work? Maybe
+ but it gets messy for directories (on first attempt anyway).
+ For directories we can just use i_mutex.
+ Maybe i_mutex for files as well?
+
+27Aug2009
+ Orphan handling is going well... but not perfect.
+ I'm using IOLock to ensure exclusion for orphan handling.
+ However:
+ I'm not really implementing that on directories
+ Inodes go bad because lafs_erase_dblock needs the lock too.
+ The call from rmdir will always faile because we hold i_mutex.
+
+ Bigger problem. I'm IOLocking inodes across checkpoints to preserve
+ Orphan status. But that might stop the checkpoint proceeding.
+ .. so use i_mutex, not IOLock - find.
+
+ Now... it seems I've confused myself. Orphans don't get handled
+ immediately. In particular, inodes should not be handled until
+ they final delete_inode. So setting the B_Orphan flag and putting
+ on the list are two separate events. The flag must come first,
+ but the list may come much later. So some of that mucking around
+ with i_mutex is pointless.
+ So:
+ make_orphan makes sure it is in orphan file, sets bit, and removes
+ from list (if present).
+ add_orphan puts it on the list for handling.
+
+ For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
+ as does any unlink/rmdir/rename that fails.
+
+ For directories: put it on list in commit/abort.
+
+
+ And...
+ I hit the BUG where find_leaf wants and address of 0.
+ If an index block gets cleaned out it doesn't disappear
+ immediately.. there is no leaf to find in that direction.
+ We probably need to avoid non-Valid blocks or something...
+ And...
+ Orphans 0/299 to 0/329 and 0/280 are still on the list
+ but are not orphans.
+ Maybe I need to catch mutex_unlock to run the orphans??
+ And...
+ We underflow a segment through orphans are unmount.
+ We are cleaning and truncating at the same time.
+ The same block gets allocated to 0 and to 1225
+ in quick succession.
+ Problem is that we apply new address while in writeback
+ so a new lafs_allocated_block
+
+29Aug2009
+
+ Review of inodes in orphan list:
+ lafs_new_inode makes are orphan for a non-existant inode.
+ If the inode cannot be created, orphan_release is called.
+ If it can, a 'struct inode' is filled in with valid type
+ and nlink==1 (!!) and attached. The inode will only be
+ detached when the refcnt hits 0, and the orphan list implies
+ a refcount, so if we ever find something on the orphan list
+ with a NULL my_inode, it must be very new and can be ignored.
+
+ When we find an inode block with a my_inode there are a few options:
+ if I_Trunc is set, we must progress truncation providing we can
+ get the i_mutex
+ else if I_Deleting we must delete the inode
+ else if nlink is 0, we remove from the list
+ else nlink > 0 and we must remove orphan status.
+ This means that if nlink is elevated, we need to be holding the mutex...
+ So don't elevate nlink any more...
+
+ When nlink becomes non-zero the block need to be put back on the
+ orphan list (it must already be an orphan). Also when we set
+ I_Deleting or I_Trunc it must go on the list.
+ .. OK, I think I have all of that.
+
+
+30Aug2009.
+ I have some wierdness that seems to be caused by the orphan stuff,
+ probably due to it all being async now.
+ - A deleted inode clears I_Trunc and then sets it again. The only
+ explanation seem to be that delete_inode is being called again,
+ so I must be igrabing it again, maybe from cleaning.
+ - bits of directories aren't getting deleted. Sometimes single
+ blocks, though the referred files are deleted. Sometimes
+ the whole directory... More interestingly, those blocks then
+ don't get cleaned, so something about them means that they
+ don't get deleted and don't get cleaned either.
+
+ Even weird... I just had a case where file 331 had a different
+ index block for every 4 data blocks...
+
+
+ FIXME:
+ - What stops pinned blocks from being flushed by bdflush in middle
+ of operation and so losing allocation? Must make sure to set
+ them dirty very late.
+ - orphan_release can fail, so much make sure we can always call
+ it, even if my_inode is NULL.... but how?
+
+
+ - make_orphan could fail due to lack of space, which is not OK.
+ I made it loop, but I'm not 100% sure that is right... it isn't.
+ I need to pass down the 'I'm freeing space' flag, and I need to
+ not require Credit of Dirty is set, etc.
+
+
+ - I seem to have a deadlock and unmount.
+ umount is waiting for lafs_checkpoint_lock_wait in
+ lafs_put_super
+ pdflush is in down_read in sync_supers
+ lafs_cleaner is iget_locked/ifind_fast/inode_wait
+ This is waiting for I_LOCK to be clear.
+
+
+31Aug2009
+ - When a file shrinks and becomes level-0, make sure
+ old addresses get deallocated. I seem to have
+ a directory where they didn't.
+
+ - Due to the fact that we over-preallocate, we really shouldn't
+ return ENOSPC until we have flushed dirty data and performed
+ a checkpoint??
+
+
+ - When I removed the last index from an inode
+ (Indirect type) it seems that I didn't write
+ out the corrected block..??
+
+1sep2009
+ I ran my simple test run repeatedly overnight.
+ It ran 208 times before I stopped it.
+ There are 3 possible failure modes:
+ 1/ didn't completed within 500 seconds
+ 2/ triggered a BUG
+ 3/ appeared to complete, the number of blocks
+ in use was not the correct '7'.
+
+ 74 (35%) did not fail!
+ 31 () did not complete
+ 40 () triggered a BUG
+ 2 did not complete but did not trigger a bug
+
+ 94 of those that failed did not have a BUG
+ 92 actually completed. Of these:
+ 1 final blocks 1
+ 1 final blocks 110
+ 1 final blocks 23
+ 2 final blocks 12
+ 5 final blocks 0
+ 6 final blocks 10
+ 11 final blocks 8
+ 21 final blocks 11
+ 44 final blocks 9
+
+ of the BUGs,
+ 1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
+ 1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
+ 2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
+ 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
+ 6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
+ 7 BUG: unable to handle kernel paging request at 6b6b6bfb
+ 11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+
+ super.c:655 is "block is still pinned" at unmount time.
+ The block was always an InoIdx with a child.
+ Either inode 0 or 16.
+ child is held by various things:
+ [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
+ [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
+ [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
+ [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
+ [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+ [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+ [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
+ [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
+ [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
+ [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
+ [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
+
+ The "unable to handle kernel paging request" is always in
+ umount.
+ invalidate_inode_buffers(26/46)/lock_acquire
+
+
+ block.c:529
+ This is iblock valid when erasing a block
+ The block we are erasing is always 0/327 or 0/328. It is
+ an orphan we are handling, iolocked but not always pinned
+
+ lafs.h:276
+ Map an iblock which is not IOLocked
+ always in lafs_clear_index for the InoIdx block for a directory
+ which is in Writeback.
+ Call is in lafs_allocated_block from cluster_flush.
+
+ segments.c:351
+ seg_inc reduces seg usage below 0
+ - lots of blocks (inode 327) that were cleaned, where then erased twice.
+ - 2 block (inode 328) were erased twice, both from prune
+ - ditto
+
+ segments.c: 1028
+ The free list is empty.... odd as only first segment is currently
+ in use.
+
+ soft lockup:
+ Still orphan: 0/328 Index(1) is in Writeback and Dirty
+ again inode_handle_orphan2 is in Writeback
+
+ inode.c:821
+ inode_handle_orphan are end, child list is not empty.
+ The children seem to be in Realloc - cleaner need to let go.
+
+ cluster.c:1219
+ my_inode is null while cluster_flush an inode and want to set
+ WritePhase.
+
+
+ block.c:485
+ no ICredit for unincredit in dirty_dblock from dir_delete_commit
+ from lafs_unlock.
+
+
+ spinlock lockup in subsequent to real bug
+ ditto for sleeping function.
+
+ Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
+ appear to have other strange values....
+
+ A select '9' has two extra block for the directory '74'.
+ But that directory is long gone.
+ These dir blocks are currently fully populated with numbers.
+ This seems to be the pattern with all non-7 blocks.
+
+
+ 02Sep2009
+ Found a problem, possibly related to the dir blocks not being
+ cleaned up.
+ When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
+ so that fact is never copied in to the datablock.
+ On further exploration, the I_Dirty bit is set but never used, which
+ isn't good.
+ So: exactly when do we copy inode into datablock, and what do we do
+ when dirty_inode is call (if anything).
+ We could just set I_Dirty when dirty_inode is called, checking that
+ the block is Pinned which it usually will be.
+ Then we copy inode to data just before writing data block.
+ However that defeats transactional properties. We to copy in the
+ same transaction, and that means either straight away, or when
+ the data block's phase changes.
+ So dirty_inode either copies to the block, or sets I_Dirty.
+ When lafs_refile unpins an inode data block, it need to check
+ I_Dirty and possibly re-dirty it.
+
+ To redirty it we must steal the NCredits. Any further dirty attempt
+ will have to allocate more.
+ The stealing is done automatically by dirty_dblock, so we just flip
+ the phase and call dirty_inode ... making sure it doesn't try to
+ prealloc too hard.
+
+ Need to review when inodes get dirtied.
+ - commit_write only sets I_Dirty !
+
+ We call lafs_dirty_inode:
+ dir_create_commit - a child of inode is PinPending
+ lafs_create - ditto
+ lafs_link - before dir_create_commit
+ lafs_unlink, lafs_rmdir - data block is pinned
+ lafs_symlink - before create_commit
+ lafs_mkdir - before create_commit, or block pinned
+ lafs_mknod - before create_commit
+ lafs_rename - (moved to) before create_commit/update_commit
+ or data block is pinned
+ lafs_dir_handle_orphan - (assured that) child is pinned.
+ choose_free_inum - child is pinned
+ lafs_incorporate - block is pinned
+
+ So either the data block is pinned, or the index block is pinned.
+ In either case it is OK to set something to Dirty.
+
+ (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
+ this is called from:
+ inode_inc_link_count
+ inode_dec_link_count
+ ..various quota ops...
+ inode_setattr
+ __set_page_dirty (Which we don't use)
+ other buffer stuff
+ other quota stuff we won't use
+ touch_atime
+ file_update_time
+ page_symlink
+
+ only the time updates are interesting. Others we have locking
+ for.
+ file_update_time is called from generic_file_aio_write_nlock etc
+ before ->prepare_write/->commit_write. So they can pick up the
+ change.
+ Similarly before set_page_dirty is called.
+ touch_atime is called from do_follow_link and readlink and
+ file_accessed which is called all over the place.
+
+ So what to do?
+ If block is pinned, then dirty it to ensure writeout.
+ If not, don't. But copy data in any case.
+
+
+4sep2009
+
+ OK, I've decided that I don't like clearing B_Valid when an index
+ block contains no indexes. The final straw was that I seemed
+ to need to initialise the index block when I didn't hold IOLock.
+ That was probably fixable, but I'm sure more problems were coming.
+
+ So: what to do instead?
+ One issue that must be resolved is that an index block can still
+ have valid children even when it become empty.
+ This can happen if we erase blocks from a file, then add them back
+ after a checkpoint, and so in the next phase.
+ The checkpoint writeout could need to show an empty index block,
+ but the next phase will see real addresses.
+ We cannot easily avoid this, so we must handle it.
+ This interact badly with the index lookup algorithm that finds
+ the best index block currently in the parent, and then scans
+ the children. If there is no index block in the parent, we
+ cannot find any children.
+ This could be handled by responding to an empty index block by
+ scanning all children. But that isn't a full solution as if
+ just one index block got erased, it's unincorporated siblings
+ would still be lost.
+ We could treat empty index blocks like orphans. i.e. don't
+ discard them immediately but leave them with possibly real
+ addresses. Then when they have no children we allocate the
+ 0.
+ But we still need to ensure that index blocks off which siblings
+ have been split but not yet incorporated remain present in the
+ tree to mark the place for their siblings.
+ There is another problem. A horizontal split could leave the
+ new block with no addresses and everything in the uninc list.
+ Nothing can be found in there.
+
+ So maybe we need to revise the lookup mechanism.
+ The goal is to find an index block that starts at or before
+ the target and contains an address at or after the target.
+ Then out search can stop.
+ In rare cases.....
+
+7sep2009
+ I thought about this more over the weekend and think I have an answer.
+ We need to treat internal and leaf index blocks somewhat differently.
+
+ An internal index block must never be empty (while unlocked).
+ Any child block which has not had it's address incorporated must be
+ attached (simply in the sibling list) to a block which has been
+ incorporated. This will be the block that it was split off.
+ The uninc block needs to hold a reference so that the primary isn't
+ released.
+ When a 'primary' becomes empty it cannot be discarded, so the
+ addresses in the first dependent index block must be copied
+ across. This is awkward for indirect blocks so they might be
+ allowed to be empty (they aren't internal so don't violate the
+ above).
+ When a horizontal split break a sequence of dependent blocks
+ between two parents, the second parent must be incorporated
+ immediately so that the first block in the second half of the
+ sequence is incorporated.
+ If an internal index block does become empty and it has no
+ dependent blocks to fill from, it must be invalidated immediately.
+ It cannot have any children - even in next phase - as at least one
+ would have to be incorporated and so the block would not be empty.
+ Invaliding involves allocating to address 0.
+ If index lookup finds a block with PhysValid address of 0, it
+ must look to the previous index block. If there was none .... it
+ gets a bit complex.
+
+ Leaf index blocks can become empty, but we try to avoid it.
+ If a leaf has blocks which have been created in the next phase,
+ and others which have been deleted in this phase, it can be empty
+ but still have children. In this case we just treat it as a real
+ index block that doesn't actually have any addresses. We still
+ write it out even though that is a waste of space.
+
+ We have been working on the assumption that every address always
+ has a corresponding leaf index block. It is the leaf with the
+ highest index at or below the target address.
+ However this requires the every internal index block has a child
+ with the same address as the parent.
+ Preserving this requirement when the first child of an internal
+ become empty requires either:
+ - loading the 'next' child and reassigning this to the start
+ - changing the address of the parent to match the first child.
+ The former requires possibly reading a block from storage.
+ The latter only involves modifying blocks that are due to be
+ written out anyway, but makes block look up slightly interesting.
+ When lookup finds an invalid block that is 'first', it needs to
+ start again from the top.
+ When incorporation creates an invalid block that is first, it
+ needs to walk down from the top and any index block at the same
+ address needs to be relocated/rehashed. If the block is
+ incorporated, the incorporated address needs to be updated.
+ So:
+ - flag for unincorporated index blocks which implies a reference
+ on primary
+ - after split, immediately incorporate second block
+ - change lookup to retry when finding invalid block
+ - When internal block becomes empty, either merge with
+ first dependent or invalidate. If first in parent,
+ update address and parent and recurse.
+ Need some 'clever' locking here.
+ Before unlocking the invalidated block, we take i_alloc_sem,
+ then walk up the ->parent tree locking blocks as
+ required.
+ The index lookup, when it finds an invalid block will take
+ i_alloc_sem, then drop it, then start again.
+ Or maybe some other lock than i_alloc_sem...
+ - When leaf becomes empty, invalidate only if it has no children.
+ When internal leaf becomes unpinned, check if empty.
+
+21sep2009
+ That locking doesn't look like it will work, and we can never 'merge
+ with first dependant' as it is not valid to have a index block
+ where the first child is at a different address.
+ And we cannot always change the parent address, particularly if it
+ is zero - increasing it then cannot work.
+ And there is no need to load a block if we are just going to change
+ its start address (not internal index blocks anyway).
+ Let's drop the idea of relocating the parent.
+ If an internal index block becomes empty:
+ If it is last in parent, no loss, just discard
+ If parent would be empty, need to recurse up.
+ If it is not last relocate the next sibling to this location,
+ rehashing it and updating the parent.
+ If a leaf index block becomes empty we cannot just delegate to
+ next as it might be indirect... not a problem if address is
+ stored. But that requires a format change... now might be a
+ good time!
+
+
+ So:
+ If we hold an index block locked and it becomes empty and we choose
+ to invalidate it, we need to ensure that doing so does not
+ break any indexing paths.
+ So we take a separate lock (i_alloc_sem??) and flag the block as invalid
+ by setting physaddr to 0 while PhysValid is set, and unlock the block.
+ Any lookup that finds such a block must take and release i_alloc_sem,
+ and then restart from the top.
+ - If the block was not incorporated, we just remove from sibling list
+ and all is done - the space in implicitly included in
+ previous block.
+ - If the block has a different fileaddr than the parent then update
+ the parent directly, either removing the entry, or changing it to
+ point to the first unincorporated sibling (if there is one).
+ This requires taking the lock on the parent of course. That is
+ why we dropped the lock on the child.
+ Then all done.
+ - If the block has the same address as the parent we need to find
+ a 'next block' to relocate to the start of the parent.
+ It is either the first unincorporated sibling, or the next
+ block in the index block, or nothing, meaning the parent is
+ about to become empty.
+ We lock the parent (still holding i_alloc_sem), and rehash the
+ chosen child. If it doesn't exist, or is not dirty, we need
+ to update the phys address directly in the
+ accordingly, erasing or replacing the first address.
+ Then we need to rehash the index block, but we need to lock
+ the parent for that.
+ So set a 'busy' flag on the block, unlock it, lock parent,
+ rehash, clear busy flag, and repeat.
+ - We can never relocate a block with fileaddr of zero, as the
+ InoIdx block cannot be relocated. So leaf index block 0
+ must never be erased unless the file is empty. So
+
+28sep2009
+ New idea.
+ We store the start address of an indirect block in the block.
+ These means that the meaning of any index block is completely
+ independent of the location of the block, so we can change the location
+ easily and without touching the block.
+ So if a block becomes empty, we simply move the next block back to
+ fill the gap.
+ i.e. when an index block becomes truely empty (i.e. no children)
+ - if it wasn't incorporated, simply remove it
+ - if it was,
+ - if there is a dependent block, rehash it to take my address
+ - if there is a next block that is dirty, rehash it
+ - if there is a next block that is not dirty,
+ update parent to merge my entry with next, and rehash next
+ if it exists
+ - if there is no next block but we are not first, just update
+ parent
+ - if no next block and we are first, parent becomes empty,
+ recurse upwards.
+
+12Oct2009
+ - too long, I've forgotten what I was up to..
+ + I've changed the format of indirect blocks to store an address.
+ + I've handled incorporation of an empty block
+ So now internal index blocks can never be empty - they get immediately
+ unlinked if they are.
+ Leaf index blocks can be empty while they have children. We don't
+ flag them as empty, but rather wait until another child gets incorporated.
+ But I don't think I really like that. It is an external ugliness based
+ entirely on internal implementation details. Empty index blocks should
+ not get written out. We need some way to reliably find an empty index
+ block. The address won't appear in the parent so a lookup will find the
+ previous block which we cannot link to now as it may not exist yet.
+ Worse - if first index block goes empty, we can only unlink it by moving
+ the parent to start at the next block. That would make this index block
+ totally unfindable.
+ So I think we have to stick with writing out empty index blocks very
+ rarely. So we need to be sure they disappear properly.
+ The difficult case is if an index block becomes empty while it has some
+ children which don't end up getting dirtied. e.g. an update aborts.
+ We need to leave the block with enough credits to be written out.
+ I guess the Ncredit should be enough...
+ Maybe worry about that later.
+
+ - what about InoIdx blocks when they become empty? It would be helpful
+ to flag them so that inode deletion can check....
+ Maybe just set depth to 0..
+
+ ARRGGG... I've completely lost it. In need another ITO week.
+ I just got a bug in summary.c:71!!
+
+7 Jun 2010
+ - summary.c:71.
+ ablocks_used has hit zero too soon.
+ This should be the count of blocks for which space has been allocated
+ (B_Prealloc is set) but have not been given a phys address yet - at which
+ point the usage count is moved to cblocks_used or pblocks_used.
+ The last block (which may not be the cause of the problem) does not have
+ B_Prealloc set, yet physaddr == 0.
+ The block is 0/1, so the inode for the inode usage map. This should have
+ physaddr 8 !!
+ We did find 8, then change to 73, but then changed to 0!
+ Ahhh... recent fix exposed a subtle bug ... fixed.
+
+ Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+ cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+ cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+ cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+ cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
+ We are allocating an InoIdx block, but data block is not valid??
+
+ That isn't very reproducible so I'll have to leave it for now...
+ erasedblock had been called on the data block .. inode 17??
+
+ Problem is that I keep changing the rules.
+ I don't erase the InoIdx block any more.
+ I used to, then change it to iolock_block/cluster_allocate->0
+
+ Problem: When all files are removed, usage is still quite high, two
+ segments have over 400 blocks (out of 512). Cleaning keeps running and
+ not making much progress.
+ segment 6 has usage of 484.
+ 'cluster 3072' shows: cluster 3072, 3085, 3086 3092
+ Inode 0: blocks 267 272 276
+ Inode 277: blocks 0/4 6/2
+ Inode 0: blocks 0/2 8 16
+ Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
+ Inode 16: 1/1
+ Inode 17: 0/28
+ Inode 283: 12/18
+ etc.
+
+ All 'old', so must be the product of cleaning, as you would expect.
+ All (most) of this has been deleted though, but count didn't drop.
+ 'Count' add to 508, plus the 4 cluster heads makes 512 - good.
+ lafs_seg_move definitely isn't being called on these blocks.
+ it is only called from lafs_summary_update
+ cblocks_used "exactly" matches the number of un-removed blocks.
+
+
+ Another problem
+bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+
+ and
+free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
+Want dump of usage
+
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+ free list is empty - that should not be.
+
+and another...
+/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+/home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
+ [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
+ [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
+ [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
+ [<c02256bf>] ? autoremove_wake_function+0x0/0x33
+ [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
+
+
+08Jun2010
+ Weirdness with truncating.
+ The cleaner relocates a file resulting in the InoIdx block being
+ Maybe-dirty and phys_addr == 0.
+ Then truncate doesn't prune but just incorporates, finding
+ something weird there..
+ file 278, blocks around 4100
+ seem to find 1949 instead??
+
+ Note: When a non-InoIdx block is erased we set PhysValid
+ and physaddr == 0 to record the fact because it will not be stored...
+
+modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
+Async ??
+modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
+Still Async ... wonder what it means.
+
+- directory block got corrupted. Maybe conversion to indexed??
+
+
+Getting bug in remove_from_index because the addr isn't
+there, possibly block is empty. But incorporation is
+??? instant? No it isn't.
+If an index block hasn't be incorporated it has B_PrimaryRef
+set as it hold a ref to something earlier index.
+But what if nothing is incorporated?
+
+
+Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
+looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
+
+Then spin in a soft-lockup in lafs_inode_handle_orphan
+
+
+-----------
+ - grow_index_tree needs to do initial incorporation so things can be found.
+ just like end of do_incorporate_internal.
+ NO - cannot incorp yet as do not have phys addr. Don't need to as
+ lafs_leaf_find explicitly handles this.
+ For truncate case we don't use the stored address, but ensure all
+ leaf indexes must be dirty (or gone) so whole tree must be
+ accessible for walking around.
+ - do_incorporate_internal needs to set B_PrimaryRef and take the ref
+ - when we remove a B_PrimaryRef without incorporating it, we need to
+ drop a ref if the *next* in the list is B_PrimaryRef
+ - need to use a constant to identify 'async' calls etc.
+ - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
+ it isn't found as async....
+
+09Jun2010
+ STILL struggling with incorporation.
+ We have a premise that any file address is coverred by precisely
+ one leaf index block. Every leaf index has an implicit address
+ and it covers all addresses from there to the next leaf. The last
+ leaf covers to EOF.
+ So there must always be a leaf at address 0.
+ This applies within the tree from an internal index block too.
+ Beneath an internal index block there must be a leaf covering every
+ address up to the next internal index block. So there must be
+ a first. So storing the first address is pointless. And harmful.
+ When an index block becomes empty and disappears its coverage is
+ included in the previous block unless there is none, in which case
+ the next index block must be re-addressed. If there is no 'next',
+ this index block must be empty and so must disappear.
+
+ BUT if we re-address an index block, we implicitly re-address the
+ first child - recursively - so we need to move/rehash them all
+ or lose them... or record where they are. Or do lookup not by
+ addr....
+ I think just rehashing them all - with an iolock - is simple
+ and safe. So just do that.
+
+
+ So: I cleaned up index handling a truncation somewhat.
+ Now running looptest to see what patterns emerge:
+
+ block.c:197 (*9+1) During umount, the Root datablock is
+ Dirty+Realloc
+ Maybe just need for cleaner to become inactive
+ during umount - hope that doesn't deadlock
+ didn't event work...
+ block.c:529 (*4+1) erase dblock while iblock depth > 0
+ When pruning InoIdx we want to set depth to 0.
+ FIXME is this really want I want, or is depth=0
+ only for data-inode ... FIXME
+ cluster.c:533 (*2) cluster_allocate on invalid block
+ Block is 8/0 in writepage from sync_inodes
+ This is the orphan file.
+ blocks aren't dirty
+ I guess the file gets truncated while we wait for it.
+ Just need to re-test.
+ index.c:1936 (*2). An index block is Root - FIXED??
+ modify.c:1056 - secondary bug, ignore for now.
+ modify.c:1650 update_index fails to find target.
+ second call, phys==0
+ Code was bad ... may not be the cause though.
+ modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
+ from orphan handler.
+ Maybe just change the do/while back to 'do'.
+ modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
+ Index(0)/InoIdx
+ in do_checkpoint
+ uninc list gets set in lafs_add_block_address (parent of iblk),
+ do_incorporate_internal,
+ Maybe the InoIdx still had children.
+ segments.c:1028. (*4) The free list becomes empty.
+ super.c:655 (*3) Busy inodes after umount, and root InoIdx block
+ is still pinned as inode 16 data block was still dirty.
+ segusage slow. Maybe same as block.c:197 ??
+ invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
+ finds invalid lock.
+ presumably the inodes was freed before invalidated.
+ spin on writeback during truncate (r3a) 8 times. now 10
+ Probably because writeback cannot proceed while
+ orphan processing keeps looping.
+ kmalloc-1024 problems - (*2)
+ A block - should be start of page - isn't not what it appears...
+
+ Others complete with 'cb' ranging from 202 to 715
+
+
+10 June 2010
+
+ Looking at segment.c:1028
+ We run a seg_scan every checkpoint, so that should keep free segments
+ in the list.....
+ Ahh.. do_checkpoint is looping because root isn't changing phase.
+
+ Lowest block pinned to old phase is
+ [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
+ which is not on leaf list because it has IOLock
+ With more debugging:
+ [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
+ or better (that was in lafs_iolock_written)
+ [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
+ FIXED - I didn't unlock if it wasn't dirty any more.
+ Well almost - it occurs much less now.
+ Out of 48 runs:
+ 8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
+ 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+ 2 BUG: unable to handle kernel paging request at 6b6b6bfbt
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
+
+ So we now have 1/12 rather than 2/3.
+ a/ pinned by IOLock from file.c:220 - FIXED
+ b/ as above
+ c/ Root is pinned by 4 children
+ 328/0 with 196 of data blocks in writeback/realloc, in a cluster
+ 0/1, 74/0, 0/8 all in a cluster waiting writeout.
+ Don't understand this.
+ d/ as a,b
+
+ Of the 48, 11 ran to completion leaving blocks from 286 to 899
+
+
+ Looking at the loss of blocks when truncating.
+ tracing show small number of files with remaining blocks at delete.
+ sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
+ next attempt: 14+24+26*11 =324 cf cb=1124
+ next attempt 26+6+15+68+29 == 144 cf cb=383
+ 26+18+14+19+284 = 361 cf 379
+ files are (in order)
+ 49 bfile - 30K
+ 325 nbfile-49 - 30K
+ 320 nbfile-44 - 30K
+ 296 nbfile-20 - 30K
+ ??331??
+
+11 June 2010
+
+ Thinking about truncate and index blocks becoming empty while
+ they still have children.
+ For leaf indexes, we need to leave the block in place in case
+ the children get written. We need to find a time to ultimately
+ delete it...
+ For internal indexes,.... uhm, it just works, OK??
+
+ When I drop an uninc block, I need to remove it from the
+ uninc list, and from phase_leafs
+ clearing dirty and refiling should remove from leafs.
+
+ When we recurse to a parent, we need to remove
+ *this* block from the uninc list for said parent.
+ It should be the only thing in the list.
+ But even when we don't recurse, the fact that we have
+ incorporated means that we should tidy up the ->uninc
+ list.
+
+
+
+12 June 2010
+ unmount hung after lafs_run_orphans from lafs_put_super
+ There are two orphans in Writeback which cannot progress
+ until the current cluster is written...
+ But they keep getting re-written!
+ Other time, one orphan, index block is Dirty on a leaf ???
+
+orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
+[cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1)
+LAFS_cluster_flush 1
+
+
+orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
+[cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0)
+
+ OK, problem is that when we truncate and remove an index block, the
+ next index block expands backwards to fill the space.
+ Then we apply prune_some, but don't check if anything was done.
+ We always mark it dirty, so it has to be written and then
+ we loop through again...
+ So need to check if prune_some did anything.
+
+TODO:
+ - prune_some need to get more done at a time
+ - let cleaner finish up before umount
+ - use early segments first ??
+ - look at write-clusters and check OK
+ - check that df:cb= drops properly.
+
+Bugs:
+ 1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170 - SECONDARY BUG
+ 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+ 3 BUG: unable to handle kernel paging request at 00100104
+ 5 BUG: unable to handle kernel paging request at 6b6b6bfb
+ 1 BUG: unable to handle kernel paging request at 7fffffff
+ 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
+ 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828!
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708!
+ 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
+ 30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+Quite a haul there!
+
+super.c:655
+ Pinned block in lafs_release:
+ 0/2 is Dirty with plenty of credits, so it is a child
+ 0/16 is Dirty/Realloc, or once Async
+ Dirty, but not on a leaf list, not pinned
+
+segments.c:332
+ seg_deref with refcnt , 2 in lafs_seg_put_all
+
+segments.c:1028
+ No free segments - no real pattern.
+
+modify.c:1708
+ lafs_incorporate on non-dirty/realloc block
+ 328/0 Index(1). 1 in uninc_table - probably during truncate.
+ Either we add uninc while not dirty
+ Or we clear Dirty while uninc present
+ or there is a race between the two.
+
+ Don't know: add a bugon
+ Bugon in get_flushable didn't fire.
+
+inode.c:843
+ children present in truncate after final incorp...
+ 328/0. 64 children, no uninc list. Maybe we ran the orphans too early??
+ or invalidate_page isn't removing the children.
+ Might want print_tree here?- added that.
+ Answer: all the children are in Realloc on Clean_leafs
+ Maybe erase_page needs to disconnect from cleaner too??
+
+inode.c:828
+ Orphan handling - uninc but not dirty: is Realloc (sometimes)
+ Maybe like mod:1708
+
+block.c:67 *
+ delref 'primary' from modify.c:2063 in the q2 branch.
+ nxt has PrimaryRef... Maybe move earlier, but that shouldn't make a diff.
+ ditto at modify.c:2035 nxt is primary as was I, so drop mine.
+ Don't know - looks like sibling list got broken.
+ Tidied up a bit and added a print-tree.
+ v.interesting result. Lots of consecutive index blocks all holding primary-ref
+ on single primary - which is wrong.
+ 1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
+ on self, as are being inserted into chain
+ 2/ When splitting, new block must be addressed as first block which cannot
+ fix, not first block which doesn't fit. Else incorping in reverse order
+ can make lots of tiny index blocks.
+
+block.c:529 *
+ erase with index depth > 1.
+ 0/328 in orphan handling. Still have 8 or 15 blocks registered!
+ Maybe caused by index block errors. Added some printks.
+
+block.c:479 *
+ not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
+ 74/xxxx in unlink
+ 16/1 in seg_inc/seg_move...allocated_block/cluster_flush
+
+ - writepage wrote the page??
+ - checkpoint wrote it and didn't replenish the credits?
+
+block.c:197 XX
+ invalidated pages finds dirty block after EOF, after iolock_written
+ 0/0 Dirty/Realloc in unmount - all Realloc!
+ Need to wait for cleaner etc to finish at unmount time.
+
+NULL deref in 1b4 YY
+ cleaner->cluster_flush->count_credits->lock??
+ Trying to get a lock on an inode that has since been free??
+ spin_lock(&dblk(b)->my_inode->i_data.private_lock);
+
+
+001001 YY
+ generic_drop_inode -- extra iput?? in lafs_inode_checkpin from refile
+6b6b6b YY
+ invalidate_inode_buffers!! in kill. use-after-free
+
+7fffff
+ seginsert from scan_seg
+ MAX/number-elements confusion. Worked around for now.
+
+
+18 June 2010
+After a couple of fixes:
+ 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
+ 1 BUG: unable to handle kernel paging request at 00100104
+ 5 BUG: unable to handle kernel paging request at 6b6b6bfb
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
+ 16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
+ Realloc blocks confusing truncate
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
+ 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+ 19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
+
+
+TODO:
+ - truncate gets confused by blocks being cleaned.
+ Need to flush cleaner, or just removed the blocks.
+ - when add PrimaryRef in middle of list, take the right ref.
+ - fix up wait-for-cleaner at unmount time.
+
+19 Jun 2010
+
+ 3 BUG: unable to handle kernel paging request at 6b6b6bfb.
+ 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+ 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
+ 22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
+ 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+ 17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
+ 251 SysRq : Resetting
+ 3 SysRq : Show State
+
+ - We can erase a dblock while it is in the uninc_pending or
+ uninc_next - need to be careful
+ - At umount, 0/2 is Dirty but not Pinned, so not written out
+ ditto from 0/16
+ 16/0 sometimes is Async
+ 16/0 Async might be from the segment scan - so wait for that.
+ Dirty but not pinned can happen when InoIdx is pinned.
+
+ - I think the uninc_next list (At least) should be sorted before
+ being allocated.
+
+ - root block dirty/realloc/leaf in final iput
+ Could be it was changed during last checkpoint so
+ pushed in to next phase? But why Realloc?
+ Maybe still issue with losing inode data block.
+
+20 June 2010 Happy Birtyhday Dad!!
+
+420 runs.
+ 4 BUG: unable to handle kernel paging request at 6b6b6bfb.
+ 26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
+ 87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
+ 12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
+ 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
+
+ Problems:
+ - inode in i_sb_list has been freed.
+ - block 0/0 is dirty/realloc/leaf after final iput
+ - not all blocks freed by truncate
+ - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
+ - still children when truncate should have finished.
+ all are Realloc
+ Maybe inode has become unhashed and we re-load it??
+ it is invalid after all!!
+ - Index block not dirty when incorp - has uninc. ??
+ - didn't wait for free segments
+ - Data 16/0 is dirty but not pinned after final checkpoint - FIXED
+
+
+watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
+watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
+
+
+ Unclear on dirtying index blocks.
+ We normally mark it dirty first, then add the address to the uninc list.
+ Note that this is the reverse of data blocks which are changed first, then
+ dirtied. So maybe we should mark dirty afterwards. We then need to
+ avoid incorporation while we are adding addresses else we might find it
+ has addresses but is not dirty. Only try if dirty?
+ Maybe we should iolock the parent. We need to do that anyway to flush
+ incorporations when the table is full. Yes, that fits the VM model
+ better. Always lock while updating and preparing to write. Set
+ writeback once write has started, then unlock. Cool.
+ Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..
+
+21June2010
+ Apart from tracking down the remaining bugs, I need to:
+ 1/ Decide on locking for incorporation and attaching new address to a block
+ and implement it.
+ In particular we need to not lose the Dirty flag before the update is done.
+ 2/ Resolve handling of pinned inode data/index blocks
+ 3/ Correct handling of empty index blocks, particularly when parent is in
+ different phase. Make lookup be more careful?
+ 4/ Wait for there to be enough free segments before allowing allocation.
+
+ 2: Problem is that we cannot handle a pinned inode-data block while the
+ InoIdx block is pinned in the same phase.
+ We currently unpin it so it drops off the leaf list. But then we
+ need to re-pin it when the InoIdx is unpinned or phasefliped, and that
+ gets ugly. Possible though.
+ An alternate is to treat it like a parent and keep it off the list
+ while the InoIdx is pinned/same-phase. So we would need to
+ re-assess it after unpinning or flipping the InoIdx. That is probably
+ a lot easier than re-pinning it.
+
+ 1: We would normally set 'dirty' after changing the block. But we need
+ to differentiate Dirty from Realloc, so we set before adding addresses.
+ This requires that are careful not to write an index block while there
+ are pending changes. The fact that pinned children stop any writing,
+ as do pending addresses in a list should ensure this.
+
+ 3: When an index block becomes empty we need to make sure that
+ future lookup doesn't get confused by it. Specifically future
+ index lookup must avoid the block so nothing new gets added.
+ Possibly a previous block will split again, but this block must remain
+ unused.
+ However we cannot update the parent block immedatiately as it might
+ be in a different phase.
+ So we must record both "don't touch this" and "where to look instead"
+ elsewhere - in children.
+ If the block being deleted is *not* the first child in the parent,
+ then we direct index lookup to the earlier block.
+ If the block being deleted *is* the first child in the parent,
+ then redirect to the second child if there is one and we weren't just there.
+ If there is no other block we flag the parent as empty and retry
+ from the top.
+ We flag a parent as empty with B_EmptyIndex.
+
+ What locks do we need to walk around the sibling list?
+ the inode private_lock is minimal, but we cannot hold that to take a
+ iolock - just to get a reference.
+ I guess we
+ - iolock the parent
+ - try to find a good block using private_lock
+ - get a ref and wait for it.
+ - check if it is still a good block. If not, start again
+
+ If we find an EmptyIndex block, it must be directly addressed by parent.
+ It will never be followed by a PrimaryRef block because if there were
+ such a block, we would have readdressed it back and hidden the EmptyIndex.
+ So we need to look around for an address in the parent that leads to
+ a non-EmptyIndex block.
+
+ If all children are empty, we need to make the parent empty. But
+ what if it is InoIdx?
+ Maybe I am making this too hard. I could just use i_alloc_sem to
+ block lookups while truncate is happening. That doesn't address
+ single block removal e.g. from directories.
+ So I need to be able to wait for incorporation to happen on an
+ empty index block. We hold iolock on the parent. If there blocks
+ on ->uninc, we just process them immediately. If there are blocks on
+ ->uninc_next, we wait for the checkpoint to complete
+
+ What does lafs_incorporate actually do with EmptyIndex blocks?
+ Providing that match currently incorp addresses, they just cause
+ those addresses to disappear.
+
+ If a block is in the uninc list for its parent, then is phase_flipped
+ and changed and written out it could get a new physaddr before
+ it is incorporated.
+ I guess we never allocate a B_Uninc block which is in a different phase
+ to the parent. Currently we wouldn't do that anyway except in truncate
+ though memory pressure on index blocks might one day??
+ Truncate? We cannot allocate directly in lafs_incorporate.
+ We should get lafs_cluster_allocate to notice and DTRT.
+
+ Only hash index blocks when they are incorporated. Not needed before then.
+ When processing an uninc list, if an address appears twice, prefer the one
+ that isn't EmptyIndex...
+
+22June2010
+ I need a clear picture of the "Steady state" for an internal index block
+ with it's children.
+ The internal index block contains 1 or more addresses. For each address there
+ maybe a child index block. If there is it maybe the head of a list of
+ blocks with B_PrimaryRef set thus holding the whole list in place until
+ incorporation happens.
+ Each of these children can be on either ->uninc_list or ->uninc_next,
+ or possibly neither if they haven't been queued for writing yet. Any
+ PrimaryRef block will be Pinned.
+
+ When a child is incorporated and found to be Empty it is flagged as such
+ and then must never be returned by index lookup. Index lookup will either
+ add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
+ block and so have to start again from the top.
+
+ When a PrimaryRef block becomes empty it is simply removed from the
+ PrimaryRef chain so it cannot be found. The space now belongs to the
+ previous block.
+ When a non-PrimaryRef block which isn't the first becomes empty it is
+ flagged and left in place so that following blocks can be found. The
+ address space now belongs to the previous block.
+ When the first child (fileaddr matches parent) becomes empty - what?
+ We could re-address first child but that forces early address change -
+ old might not be incorp yet
+ We could re-address the parent, but that doesn't work for InoIdx
+ We could leave it there with physaddr == 0
+
+ Last sounds promising. So we never re-address an index block.
+
+ So: From the top.
+
+ Index blocks, Indirect blocks, extent blocks each have an address
+ that never changes.
+ When a block becomes over-full it splits - a new block appears with
+ a new address thus implicitly limiting the address space covered
+ by the original.
+
+ When an index block becomes empty and has no pinned children it is
+ marked as EmptyIndex (under IOLock).
+ When an EmptyIndex is allocated it goes to phys==0
+ An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
+ is never used again. Its address space is ceded to the previous
+ index block - which could split several times...
+ An EmptyIndex which is first can be re-used. Once it gets pinned
+ children the EmptyIndex is cleared.
+
+ An Index block always has an entry for the first address. It might
+ be implicit to phys==0. Loading such a block creates an empty
+ block.
+
+ InoIdx doesn't get EmptyIndex, rather it gets ->depth=1
+
+ Indirect *doesn't* store the first address any more.
+
+ Changes:
+DONE - remove forcestart from layoutinfo
+DONE - remove start-address from Indirect blocks
+DONE - only hash index blocks when they are known to be incorporated.
+DONE - when incorporating an uninc list, ignore phys==0 if also a block with
+ same fileaddr and phys!=0. so sort phys==0 first
+DONE - Create EmptyIndex flag
+DONE - Clear the flag when adding child pin to index block
+DONE - avoid EmptyIndex non-start blocks during index lookup
+DONE - allow index blocks to be loaded with ->phys==0
+DONE - allow EmptyIndex index block to be "written" to phys 0
+DONE - ensure index lookup finds implicit start address, possibly 0
+
+So now after 36 runs
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
+ 10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
+ 14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
+ 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
+ 3 SysRq : Resetting
+
+
+index.c:1939
+ block 0/2 is Realloc and being allocated from cluster_flush while
+ parent is not Realloc or dirty
+ That is bad as Realloc gets set in lafs_allocated_block ... except
+ that the code was bad. FIXED.
+
+index.c:403
+ cleaner is pinning a block (299/25) which is not Realloc,
+ and phase isn't locked. We are only meant to pin data blocks
+ for updates while holding a phase lock.
+ Ahhh - bad code again. FIXED
+
+inode.c:605
+ Truncate doesn't clean up properly.
+ 327 has 60+1
+ 331 has 108+1
+ 327 has 34+1
+ 327 has 60+1
+ No sign of any children.
+
+ Very weird. Signed in incorporation going wrong.
+ Added more debugging.
+
+Found 4084 4 12 at 890
+Added 4084 4 12
+Found 4089 4 16 at 878
+Added 4089 4 16
+Found 4094 2 20 at 866
+Added 4094 2 20
+Found 2561 2 22 at 854
+Added 514 2 22
+Found 2564 4 24 at 842
+Found 2569 2 28 at 830
+Found 0 0 0 at 818
+
+Why are 2564 etc lost? No sign of alloc-to-0
+
+segments.c:1034
+ no free segments - need to wait somewhere.
+
+segments.c:624
+ allocated_blocks has gone over free_blocks!
+ in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
+ Wanted CleanSpace to reserve the youthblk
+ Maybe related to not waiting - ignore for now.
+
+super.c:657
+ block 0/2 was dirty but not pinned. Should not happen to inodes.
+ block 0/0 was Pinned because it had a child - as above.
+
+ Maybe we don't carry the pin across when we collapse dir
+ into inode??... looks quite likely
+
+
+23 June 2010
+
+116 runs.
+ 1 BUG: unable to handle kernel paging request at 6b6b6bfb
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
+ 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
+ 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
+ 61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
+ 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
+ 42 SysRq : Resetting
+
+
+6b6b6bfb:
+ invalidate_inode_buffers called on at shutdown.
+ Still wierd
+
+block.c:497 FIXED??
+ block 16/1 is not dirty with no credits.
+ Maybe writepage got to it?
+
+dir.c:710
+ ouch! dir lookup failed in unlink.
+ No real hints. Must be hash based - some off-by-one probably.
+ Need to stare at the code.
+
+inode.c:606 FIXED
+ Blocks still present after truncate.
+ typically about 60, but in 1 case '4'. No index blocks.
+ So probably content of second index block.
+ Yes, lafs_leaf_next was doing the wrong thing for addresses
+ before start of block.
+
+segments.c:1034
+ same old
+
+super.c:657 FIXED
+ dir inode 0/2 is still Dirty but not pinned.
+ Maybe lafs_dirty_inode should be pinning the block
+
+ But now this triggers for 16/X still dirty.
+
+
+How and when to write blocks in a SegmentMap file?
+ - We don't want normal write-back to write them unless they have
+ no references
+ - We need to write them in tail of checkpoint, and index info must
+ follow in the next checkpoint.
+
+lafs_space_alloc is called from
+ - mark_cleaning: always CleanSpace, failure is OK
+ - lafs_cluster_update_pin: ReleaseSpace. -EAGAIN is OK (CHECK THIS) but failure
+ is not - or shouldn't be.
+ - lafs_allocated_block: CleanSpace, checking if parent of Realloc block
+ can be saved separately from any Dirty version. Failure OK, blocking not.
+ - lafs_prealloc - general space allocation.
+ -
+lafs_cluster_update_pin is call from:
+ - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
+ lafs_mknod, lafs_rename,
+ - lafs_write_inode
+ So best to return -EAGAIN, and it should be handled adequately.
+
+lafs_prealloc is called from:
+ - lafs_reserve_block, after modifying the alloc_type extensively.
+ - lafs_phase_flip to re-fill the 'next' credits. If they aren't available
+ we simply pin all children so they aren't needed.
+ So failure is OK
+ - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
+ If this fails .. what?? lafs_reserve_block fails. so...
+
+lafs_reserve_block is called from
+ - mark_cleaning - CleanSpace
+ - lafs_pin_dblock - type is passed int...
+ - lafs_prepare_write - on failure write will fail or retry after checkpoint
+ - lafs_inode_handle_orphan - to help with delete. On failure we allow
+ cleaning to happen
+ - lafs_seg_move - should be elsewhere. Failure BAD !
+ - lafs_free_get - as above, failure BAD
+ - clean_free - update youth for new clean blocks - Failure BAD
+
+lafs_pin_dblock is called from
+ - dir_create_pin - fail or again handled
+ - dir_delete_pin
+ - dir_update_pin
+ - lafs_create etc
+ - lafs_dir_handle_orphan
+ - choose_free_inum
+ - inode_map_new_pin
+ - lafs_new_inode
+ ...
+ - lafs_orphan_release !! cannot handle failure
+ - roll_block should use AccountSpace
+
+So: It seems we need a new allocation class that will never fail.
+ Maybe it is allowed to BUG though?
+ AccountSpace - i.e. space need to account for the use of space.
+ Must never ever fail.
+
+Then we must ask where blocking should happen on -EAGAIN.
+ dir.c does "lafs_checkpoint_unlock_wait", then tries again.
+ prepare_write does too.
+
+For that to work we must start a checkpoint on returned EAGAIN.... Don't
+we want to wait for some cleaning to happen first though? Maybe an extra
+flag, and a count of the number of empty (but not clean) blocks.
+
+- Should I skip orphan handling when tight on space? Probably not. It will
+ just keep failing while we keep cleaning...
+- roll_block should use account_space .. or not
+
+- lafs_space_alloc simply allocates space, or fails. 'why' is used to
+ guide watermark choice.
+- lafs_prealloc allocates space to a block and all its parents base on
+ 'why' for watermarks. It either succeeds or failed.
+
+- lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
+ to failure as -ENOSPC or -EAGAIN based on 'why'.
+
+- lafs_pin_dblock simply passes on the failure, which must be handled.
+
+So: What to do when we return -EAGAIN?
+ We need to wait until there are *enough* clean segments, then cause a checkpoint
+ so they become free.
+ So a flag that says 'waiting for free space' and a count of segments
+ required.
+
+ But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
+ Maybe we don't ?? Or do it later.
+
+Still to do:
+- Audit all AccountSpace and justify them
+ + lafs_seg_move is probably wrong. Should have allocated when the
+ free segment was allocated
+- lafs_orphan_release called lafs_pin_dblock but cannot handle failure
+- Need to wait not just for "enough space" but for "enough clean segments".
+
+- how is 'free_blocks' set - what does this tell us??
+
+ free_blocks is the sum of known-clean segments.
+ We probably want:
+ clean segments
+ remainder for each active segment
+ then reserve some segments for cleaning.
+ And separate 'allocated_block' for each ?
+
+Notes:
+ segments.c:647 fired: AccountSpace had no space available.
+ Reserving space to write the segusage of youth block for a newly
+ allocated segment.
+ super.c:657 STILL
+ 0/2 is Dirty but not Pinned Maybe we need PinPending
+ soft lockup
+ in the cleaner!
+ Maybe I need cond_resched??
+
+Maybe I want two separate 'free_blocks' counters.
+ One that includes all free blocks for use in 'df' etc.
+ One that only includes completely free segments for use in allocation...
+
+
+24 June 2010
+
+ Something is wrong with cleaning and segment tracking
+ We have 5 free segments and we get them all without writing
+ anything! We consumer them all with cluster_flush!
+ It seems that the root inode is not changing phase!
+ Nothing is on the phase leafs.
+ Most children are in Writeback on cluster. and are Realloc
+ Others have pinned children.
+ They are all in 'cluster', but 'flush' doesn't flush them,
+ so they must be in a different clister??? Is the cleaner still
+ cleaning? Yes, they are on the cleaner 'wc' list so they are
+ queued but not flush for the cleaner.
+
+25 June 2010
+ At last it looks like I nearly have a working FS. Out of 361 test
+ runs, 9 triggered BUGS and one hung at umount.
+
+ I need a new TODO list, starting with 6 jul 2007(!) and adding any
+ FIXMEs etc.
+
+DONE 0/ start TODO list
+DONE 1/ document new bugs
+DONE 2/ Tidy up all recent changes as individual commits.
+DONE 3/ clean up the various 'scratch' patches discarding any tracing that
+ I don't think I need, and making the rest 'dprintk' etc.
+DONE 4/ check in this README file
+DONE 5/ Write rest of the TODO list
+
+DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
+ It is Dirty but only has *N credits.
+ 16/1 ...
+
+DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
+ I guess we should getref/putref.
+
+DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
+ and doesn't cope well.
+
+DONE 5d/ At unmount, 16/1 is still pinned.
+
+ 6/ soft lockup in unlink call.
+ EIP is at lafs_hash_name+0xa5/0x10f [lafs]
+ [<d0a56283>] hash_piece+0x18/0x65 [lafs]
+ [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
+ [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
+ [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
+ [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
+ [<c02602c1>] vfs_unlink+0x4e/0x92
+
+ Don't know. Looks like cleanup up a chain in dir_delete_commit.
+ Added a BUG_ON.
+
+ Would we be spinning on -EAGAIN ?? 4 empty segment are present.
+
+ 6a/ index.c:1947 - lafs_add_block_address of index block where parent
+ has depth on 1.
+looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
+/home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
+
+ 6b/ check_seg_cnt sees to be spinning on the 3rd section
+ the clean list has no end!
+ we were in seg scan
+CLEANABLE: 0/0 y=0 u=0 cpy=32773
+CLEANABLE: 0/1 y=0 u=0 cpy=32773
+CLEANABLE: 0/2 y=0 u=0 cpy=32773
+CLEANABLE: 0/3 y=32773 u=6 cpy=32773
+CLEANABLE: 0/4 y=32772 u=124 cpy=32773
+CLEANABLE: 0/5 y=32771 u=273 cpy=32773
+CLEANABLE: 0/6 y=32770 u=0 cpy=32773
+
+of
+0 0
+1
+2
+3 6
+4 124
+5 273
+6 0
+7 496
+8 0
+
+
+ 6c/ at shut down, some simple orphans remain
+ missing wakeup ???
+
+DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
+ truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
+Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
+SEGMOVE 1499 0
+Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
+.......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
+Why have I no credits?
+/home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
+
+ Cleaning is racing with truncate, and that cannot happen!!
+ Actually it could - if i_size changed at the wrong time.
+
+DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
+ block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
+ in touch_atime. I think I know this one.
+
+ 7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
+ i.e. 1510, 1945-2038, 2448 of 5378
+ Appear to be looping in first loop of try_clean, maybe
+ group_size_words == 0 ??
+ Add BUGON and wait.
+
+DONE 7c/ NULL pointer deref - 000001b4
+ Could be cluster_flush finds inode dblock without inode.
+ Have a BUG_ON of this now.
+
+DONE 7d/ paging request at 6b6b6bfb.
+ invalidate_inode_buffers called, so inode_has_buffers,
+ so private_list is not empty. So presumably use-after-free.
+ But is on s_inodes list.
+ Probably cleaner is still active (if this is first call to
+ invalidate_inodes in generic_shutdown_super) so list gets broken.
+ We need locking or earlier flush.
+
+DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
+ Check for Realloc too.
+
+PRESUME-FIXED 7f/ index.c:2024 no uninc credit
+ [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
+ found during checkpoint. Maybe inode credit problem.
+
+PRESUME-FIXED 7g/ inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
+ ->uninc blocks. This is during truncate. Need some
+ interlock with cleaner maybe?
+ Probably the same race between cleaner and truncate.
+
+DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
+
+NOLONGERRELEVENT 7j/ resolve space allocation issues.
+ Understand why CleanSpace can be tried and failed 1000
+ times before there is any change.
+
+DONE 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
+ a wakeup.
+ write lafs_iolock_written_async.
+
+DONE 7l/ make sure i_blocks is correct.
+ set on 'import_inode'
+ decreased when lafs_summary_update assigned block to '0'
+ changed when lafs_summary_allocate changes e.g. quota.
+
+ lafs_summary_update is called when a block is assigned to a location,
+ or to zero. It is real usage.
+ lafs_summary_allocate is called when we set Prealloc on phys==0 or
+ clear Prealloc on phys==0
+ So allocate must be followed exactly.
+ update is already counted for setting !=0, so only dec on ==0.
+ So all is good.
+ What about quota? - hidden in quota_allocate / qcommit
+
+7m/ delete inode could not progress through inode_map_free, so
+ ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+ was permanently an orphan.
+
+DONE 8/ looping in do_checkpoint
+ root is still i Phase1 because 0/2 is in Phase 1
+ [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
+ Seems to be waiting for writeback, but writeback is clear.
+ Need to call lafs_io_wake in lafs_iocheck_writeback for when
+ it is called by lafs_writepage
+
+DONE 9/ cluster.c:478
+ flush_data_To_inode finds Realloc (not dirty) block
+ and InoIdx block is not Valid.
+ [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
+ I wonder if it was PinPending, or where it was IOLocked (or if).
+
+ I guess we truncated, then added data, then tried to clean.
+ Probably just a bad 'bug' given recent changes.
+ No, I think it is the race between truncate and clean which is now fixed.
+
+SEEMS TO BE GONE 10/ inode.c:606
+ Deleting inode 328: 2+0+0 1+0
+
+ 2 level index.
+ first index at level 1 was full and prune properly.
+ Nothing else found empty.
+ Somehow the second index block and contents were lost.
+
+ASSUME_DONE 11/ super.c:657
+ Root still pinned at unmount.
+ 0/2 is Dirty: [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+ [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
+ [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+ [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
+ [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
+ maybe dir-orphan handling stuffed up
+ Or maybe it is the I_Dirty issue. Assume fixed.
+
+
+ASSUME_DONE 12/ timeout/showstate in unmount
+ umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
+ That looks similar to 8
+
+DONE 13/ delete_inode should wait for pending truncate to complete.
+ Document I_Trunc somewhere - including that i_mutex is needed to set it.
+ Verify that assertion.
+ Actually it requires i_alloc_sem, or the inode to be deleted.
+
+
+DONE 14/ Review writepage and flush and make sure we flush often enough but
+ not too often.
+ Probably just remove the cluster_flush from write-page as lafs_flush
+ will do that.
+ But leave for now as it encourages heavy indexing.
+
+DONE 14a/ use bio_add_page to write clusters.
+
+DONE 14b/ Figure out what backing_dev to present for the filesystem.
+
+DONE 15/ The inode map file lost some credits. I think it losts a PinPending because
+ it isn't locked properly. Don't clear PinPending if someone else might
+ have set it.
+
+DONE15a/ Find all FIXMEs and add them here.
+
+
+DONE 15b/ Report directory size less confusingly
+
+DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)
+
+DONE 15d/ What does I_Dirty mean - and implement it.
+
+FIXED 15e/ setattr should queue an update for the inode metadata.
+ and clean up lafs_write_inode at the same time (it shouldn't do an update).
+ and confirm when s_dirt should be set. It causes fsync to run a
+ checkpoint.
+
+15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
+## Items from 6 jul 2007.
+
+15g/ test directories with non-random sequential hash.
+
+DONE 15h/ orphan deadlock
+ lafs_run_orphans- lafs_orphan_release can block waiting for written
+ in erase_dblock, but that won't complete until cleaner gets to run,
+ but this is the cleaner blocked on orphans.
+
+
+DONE 15i/ separate thread management from 'cleaner' name.
+
+DONE 15j/ review rules in getref_locked - and document them
+
+DONE - fix accesses to iblock
+
+DONE 15k/ newblocks should probably be a count of segments. Review that.
+
+DONE 15l/ make sure checkpoint_youth is decayed properly. Review youth decay.
+
+DONE 15m/ consider combining .orphans and .cleaning lists. If something is an
+ orphan, we probably don't want to clean it just now(?).
+
+DONE 15n/ consider if lafs_pin_dblock should check for iolock. Maybe
+ iolock or PinPending (which must be set under iolock).
+ Just require PinPending and always get iolock_written for that
+ except in special cases.
+
+DONE 15o/ Can there be async blocks when checkpoint starts? Could they
+ pin blocks in old phase? Do I need to check for them?
+
+DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
+ yet' thing - or somehow avoid the yuckiness.
+
+DONE 15q/ check checksums when reading cluster_header for cleaner
+ This is already done!
+
+DONE 15r/ consider further optimisation in cleaner to avoid lookups.
+
+DONE 15s/ memory barrier for i_size check in cleaner???
+
+DONE 15t/ review usable-space calculations in clean.
+
+DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
+
+DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
+ common helpers.
+
+DONE 15w/ review cluster updates and make sure space used is accounted properly.
+
+DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
+ try to create it. Would this actually save anything significant?
+
+DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?
+
+DONE 15z/ make sure symlink creation is safe - do I need to log the body??
+
+DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
+
+DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
+ after lock is taken on block?
+
+DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
+ writeout.
+
+DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
+ if more is needed.
+
+DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
+ before removing from lru.
+
+DONE 15af/ Review all locking in lafs_refile
+
+DONE 15ag/ Don't allocate data part of InoIdx block.
+
+DONE 15ah/ Is there a problem with lafs_allocated_block putting an
+ about-to-be-truncated block on an uninc list?
+
+DONE 15ai/ When allocating a new segment during checkpoint, delay the
+ youth-block update until after the checkpoint
+
+DONE 15aj/ When roll-forward finds a new segment, make sure youth number is
+ updated.
+
+DONE 15ak/ Load orphan file during roll-forward and make every block an
+ orphan.
+
+DONE 15al/ set filesystem update_time somewhere.
+
+DONE 15am/ filesystem 'name' needs to be handled uniformly.
+
+DONE 15an/ can we be sure 'b' will be non-null in delete_inode?
+
+DONE 15ao/ determine what locking is needed to walk the children list
+ in lafs_inode_handle_orphan. Probably the address_space private lock.
+
+15ap/ Make sure write_inode has been cleaned up. See if this applies to
+ rollforward of a symlink (see FIXME)
+
+DONE 15aq/ change inode map to be little-endian, not host-endian
+
+DONE 15ar/ understand what to do about errors in lafs_truncate
+
+15as/ handle errors from lafs_write_super ???
+
+DONE 15at/ More wait_queues to wait for different blocks.
+ just use wait_on_bit / wake_bit
+
+DONE 15au/ How should iocheck_block set the page error?
+ and block_loaded <- this gets it right.
+
+15av/ ditto for write errors?
+
+DONE 15aw/ when lafs_incorporate makes a new block where the
+ old is Realloc, the new should be Realloc too.
+
+15aw2 / When a block is a snapshot block it can never be dirty
+ so we only need credits for realloc...
+
+DONE 15ax/ Think about what happens when we relocate a block
+ in the orphan list (lafs_orphan_release), particularly
+ if the block isn't actually loaded.
+ FIXME still need to make sure errors will loading the orphan
+ file are handled correctly - I guess we mark all bad orphans as
+ type==0 and when we find those during release, reduce the size
+ of the orphan file.
+
+DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup
+ when an inode or dir mutex is released.
+ No, there isn't.
+
+DONE 15az/ Sanity check all values in cluster head during roll-forward
+ i.e. in roll_valid. If the head isn't complete, we can still
+ use this to commit some previous checkpoints.
+
+DONE 15ba/ roll forward should not BUG on bad data like inodefile in
+ non-primary filesystem.
+
+DONE 15bb/ Do I need to sync something before copying an update over part
+ of an inode, then reloading the inode.
+
+DONE 15bc/ Handle DescHole in roll forward.
+
+DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
+ in roll forward, just for consistency.
+
+DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
+ are actually the correct type.
+
+DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
+ a lookup - or at least we can test for that.
+ lafs_seg_apply_all has similar problems and needs a good solution.
+
+DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
+ if parent splits. See what to do about that.
+
+DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
+ or handle if it has.
+
+DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
+ to twostage.
+
+DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
+ segdelete.
+
+DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
+
+DONE 15bl/ better checks for 'valid state block address' in valid_devblock
+ include that segment_count is credible
+ also in valid_stateblock
+
+15bm/ make sure everything gets free properly on error during mount / lafs_load
+
+15bn/ How does refcounting of 'struct fs' work with multiple filesets?
+
+DONE 15bo/ use put_super to drop last refer to superblocks
+
+DONE 15bp/ review all superblocks - maybe use more anon??
+
+15bq/ check readonly status in lafs_get_sb
+
+DONE 15br/ sync_fs should probably wait for something if 'wait'.
+
+DONE 15bs/ set f_fsid properly in lafs_statfs
+
+DONE - use new write_begin / write_end
+
+15bt/ - review how we ensure that credit remain with block.
+
+15ca/ When pin inode data block, pin it as well as index block I think
+ It is still kept of the leaf list until the index block is done with
+ I think.
+
+15cb/ Layout issues:
+ DONE - subset filesys still needs a parent pointer
+ DONE - cluster head needs mtime/ctime to log these.
+ - need better tracking of which devices are in this array??
+ Need to be able to have read-only devices that are shared
+ among arrays.
+ DONE - need multiple parallel write-clusters to allow parallel writes.
+ - record tuning in state block:
+ - max_segs
+ DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
+ - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
+ - policies of whether old or new data is allowed on each device
+ - policies of how much duplication of metadata is required
+ DONE - inode map - not host-endian
+ DONE - segments > 16bit:
+ segusage file - what about youth?
+ cluster_head Clength
+
+15cc/ free any stray B_ASync block found in destroy_inode
+
+15cd/ Some code assumes a cluster header does not exceed 1 page.
+ Is this safe? Is in true? Is it enforced?p
+ roll-forward now handles large cluster_head.
+ Need cleaner to handle it, and need to possibly write large
+ cluster head when making new clusters.
+
+15ce/ classify BUGs as
+ - internal logic errors
+ - IO errors
+ - unusual conditions I want a warning of
+ - data corruption errors
+
+DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
+ This is needed for the cleaner - the cleaner needs to hold a ref somehow.
+
+15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
+ stuff??
+
+15ch/ Review values of youth and checkpoint_youth and think about off-by-one
+ issues.
+
+15da/ Replace directory updates!!!!!
+
+15db/ Decide how version string will be used.
+
+15dc/ resolve table_size - it should be stored in the segusage file and validated
+ based on device geometry.
+
+15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
+ cluster on same device to verify previous.
+
+15eb/ When multiple devices and lots to do and plenty of free space,
+ allow multiple segments, one per device, to be open at once,
+ and possibly be writing multiple clusters at once using
+ VerifyDevNext2
+
+15ec/ Implement i_version tracking. This should be a 64bit numbers
+ that appears to change every time the file changes. We only
+ need a new number when someone looks at the value with
+ getattr.
+ We could simply use mtime with the sub-millisecond part being
+ a counter of times that getattr sees a change in the same
+ millisecond.
+ However as mtime can go backwards we might get i_version going
+ backwards, which is awkward. I wonder if I care.
+ Otherwise, leave for an inode extention later.
+
+16/ Update locking.doc
+
+17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
+ calls lafs_iolock_written. How do we know that won't block on cluster_flush?
+
+18/ See if per-fs shrinker is available yet and consider it for index blocks.
+
+19/ Review WritePhase and make sure it is used properly.
+
+20/ Review places where we update blocks and be sure they are not in writeout
+ or in a different phase.
+
+21/ Review and document all lru uses (locking.doc) and make sure they are
+ all locked properly.
+
+22/ Check possible failures:
+ - thread allocation
+ - memory allocation
+ - reading critical metadata
+ ...
+
+23/ Rebase on 2.6.latest. Done for .38
+
+24/ load/dirty block0 before dirtying any other block in depth=0 file,
+ else we might lose block0
+
+25/ use kmem_cache for
+ datablock
+ indexblock - probably a mempool because we cannot allow failure when
+ splitting an index block.
+ skippoint (mempool?)
+ segsum - mempool??
+ others?
+
+26/ Review seg addressing code for 2-D geometries.
+
+27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.
+
+28/ Make sure youth blocks are always referenced properly.
+
+29/ Make sure new segments are referenced properly. I think there might be
+ some double referencing.
+
+30/ Decide when to use VerifyNULL or VerifyNext2
+
+31/ Implement non-logged files
+
+DONE 32a/ Store access time in a file
+32b/ Make it a non-logged file
+32c/ Avoid writing out dirty atime file blocks when not necessary.
+ i.e. keep the page clean and active, and trigger 'write'
+ on release_page.
+
+33/ Support quota : group / user / tree
+
+34/ handle subordinate filesystems:
+ ss[]->rootdir needs to be array or list
+ lafs_iget_fs needs to understand this
+
+35/ review snapshots:
+ - peer lists and cleaning
+ - how to create
+ - failure modes
+ - how to destroy
+
+36/ review roll-forward
+
+DONE 36a/ make sure files with nlink == 0 are handled well
+DONE 36b/ sanity check before trusting clusters
+DONE 36c/ handle miniblocks which create new inodes.
+DONE 36d/ Handle DescHole in roll_block
+DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
+ than just iolock, for consistency...
+DONE 36f/ What to do if table becomes full when add_block_address in
+ roll_block ??
+DONE 36g/ Write roll_mini for directories.
+DONE 36h/ In roll_one, use the cluster counting code to find block number and
+ make sure we don't exceed the segment.
+DONE 36i/ add more general error checking to lafs_mount -
+ lafs_iget orphans and segsum. Check type is correct.
+ errors from lafs_count_orphans or lafs_add_orphans.
+ alloc_page failure for chead - maybe allocate something bigger??
+
+37/ Configure index block hash_table at run time base on mem size??
+
+38/ striped layout
+ review everything needed for safe RAID5
+
+39/ How to handle all different IO errors
+
+40/ Guard against data corruption at every level.
+
+41/ Add checksums on index blocks and dir blocks and Inodes and ???
+
+42/ Store duplicates of some blocks. At least index and inode.
+
+43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
+ Make sure ->page_mkwrite sets up credits properly
+
+44/ Examine created filesystem and make sure everything looks good.
+
+DONE 45/ mkfs.lafs
+
+46/ fsck.lafs
+
+47/ Write good documentation
+
+48/ Review all code, improve all comments, remove all bugs.
+
+49/ measure performance
+
+50/ Support O_DIRECT
+
+51/ Check support for multiple devices
+ - add a device to an live array
+ - remove a device from a live array
+
+DONE 52/ NFS export
+
+53/ 'overlay' support
+ So I mount one device read-only an another device
+ writable which gets all the updates. metadata on first
+ device not updated.
+
+54/ cluster support - is this possible?
+
+55/ is any useful variant of reflink possible?
+
+56/ Review roll-forward completely.
+
+57/ learn about FS_HAS_SUBTYPE and document it.
+ This is for fuse in particular so users can know the real type
+
+58/ Consider embedding symlinks and device files in directory.
+ Need owner/group/perm for device file, but not for symlink.
+ Can we create unique inode numbers?
+ hard links for dev-files would be problematic.
+ What do we gain? Maybe something for short symlinks.
+ 40 seems a good length to get 70% of symlinks.
+
+59/ Fix NeedFlush handling so we don't drop-then-retake
+ a mutex as that isn't sensible.
+
+60/ Introduce some fs state recording that fsck is needed and possibly
+ identifying what sort of fsck.
+
+61/ Try to make the inode struct smaller - maybe move some of the
+ fs metadata into a separately-allocated struct.
+
+62/ System/trusted extended attributes:
+ fileset max size
+ directory hash/seed
+
+63/ user extended attributes.
+
+64/ wonder if index blocks can be flushed out by memory pressure somehow.
+ e.g. if a data block is written by reclaim, flag the index block.
+ When a flagged index block has no children, it is incorporated and written.
+ ??
+
+65/ review why lafs_allocated_block needs the new_parent label. Should not
+ lafs_incorporate leave all parents dirty? Maybe it is just the need for
+ B_Realloc - so maybe lafs_incorporate should leave the new block either
+ realloc or dirty rather than lafs_allocated_block doing it.?
+ See also 15ad below.
+
+66/ Delay writeout of directory updates until an fsync. If a checkpoint happens
+ first, discard the updates (and fsync waits for checkpoint to complete).
+ If a cross-directory rename happens care is needed: either flush updates
+ first or ensure that a flush does happen before the cross-directory
+ update is flushed.
+ Note that if the target of a rename is a directory, it must also be fully
+ flushed before the rename can proceed.
+
+26June2010
+ Investigating 5a
+
+ Normal sequence is to surrender UnincCredit, then to clear Dirty,
+ then to write. If anyone re-dirties after Dirty is clear, they
+ will naturally have to add an UnincCredit having reserved space first.
+ However it seems that the Cleaner gets in the way as the block in question
+ has just previously been cleaned, which consumed the UnincCredit
+ Do we need ReallocUnincCredit?? I hope not.
+ We generally need a way to say "I might want to write to this" so cleaner
+ doesn't write it early.
+ For index blocks that is pincnt. For data it is 'PinPending'.
+ This keeps index blocks off clean_leafs until they are ready, but
+ not data blocks.
+ And in any case, TypeSegmentMap blocks don't get PinPending as they
+ get written *after* the checkpoint. That is a rather ugly exception.
+ Maybe we make their different handling more explicit. We put them on
+ a separate list unpinned so the rest of the checkpoint can complete.
+ Then we flush that list?
+ Then PinPending keeps them off the clean_leafs list.
+
+ So to clarify the plan: If a block is already Pinned to this phase,
+ we can "clean" it by marking it Dirty rather than Realloc. This is
+ appropriate for blocks that are likely to change soon (as blocks written
+ to the cleaner segment are not likely to change soon).
+ For data blocks we take "PinPending" to say "might change soon". For
+ index blocks ... we don't know if it is pinned by Realloc or Dirty or
+ PinPending children. So we set Realloc and wait for any children to
+ be unpinned for whatever reason. If it is only pinned by Realloc blocks,
+ it will end up on clean_leafs and be processed to the cleaner segment.
+ If it is pinned by anything else it will be found by the checkpoint and
+ processed to the new-data segment.
+
+ So Index blocks always get Realloc, PinPending blocks get Dirty,
+ Other data blocks get Realloc. Good.
+
+ Must review PinPending usage... always set, then maybe-dirty inside
+ checkpoint lock. In cases of unlocked usage (inode map) we don't clear
+ PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
+ It is likely to be changing though, so not a big cost. Even good.
+
+ Could make the distinction later. PinPending blocks don't go on
+ clean_leafs. So if they are still realloc at the checkpoint, we Realloc
+ to the new-data segment. This has the same net effect but is arguably
+ cleaner. It means that if a realloc block gets pinpending set, it
+ immediately stops being a clean leaf and so is safe.
+ So: just keep PinPending blocks off clean_leafs. Keep them on phase_leafs.
+ However there is no mechanism for moving things from phase_leafs to clean_leafs.
+ So maybe they stay on clean_leafs, but when the cleaner gets to them, it
+ dirties them and drops them.... that would work.
+
+ So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
+ Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
+ to pick up.
+
+ BUT: Does this work for TypeSegmentMap blocks? They aren't PinPending.
+
+ We could treat them specially in the cleaner. Or we could set PinPending
+ and pin them to the phase, but treat them differently in checkpoint.
+ If we gathered them onto a separate list, then flush the list after
+ the phase had changed, it might be quite neat. No more getting writepages
+ to do our work for us.
+ They would need to be re-pinned to the next phase, then written out.
+ Or just unpinned, and let seg_inc re-pin as appropriate... except that
+ seg_inc is too later to pin. It dirties. We need to pin when we get
+ SegRef. We currently reserve but we don't pin.
+ We really do need to phase_flip these segmentmap blocks. But that requires
+ getting extra credits, and Pinning everything if new credits are not available.
+ And we don't really have a good list of 'everything' that depends on a segment.
+ But seeing the space_alloc never fails for these...
+ So Pin them, and flip them with AccountSpace
+
+ So:
+ - split out common 'flip' code
+ - add 'flip' for data blocks
+ - create list of accounting blocks and flip accounting file blocks onto
+ that list during checkpoint
+ Flush should write that list, not the files.
+ - Get cleaner to ignore pinpending blocks, marking them dirty.
+ - pin segusage blocks while ref on them is held.
+ - writepage no longer needs special case for TypeSegmentMap, just PinPending
+ - lafs_prealloc just tests PinPending
+
+
+ [[aside: quota files seem to be handled like segmentmap files. Is that
+ right??
+ We only track usage of data blocks based on various 'owners' of the file.
+ We need to know if a block was written in one phase or the next, and
+ only count blocks written/allocated in the one.
+ Data blocks can slip into 'this' phase quite late - any time before the
+ parent is finally incorporated. So we don't write quota blocks
+ until checkpoint is done. So yes, they are like SegmentMap
+ ]]
+
+
+ segsums....
+ If there are hundreds of snapshots, then a block being cleaned (whether to
+ cleaner segment or new-data segment) could affect hundreds of segment
+ usage counters. That would be clumsy to work with. Every block in the
+ free table would need to hold references to hundreds of blocks. This
+ is do-able and might not be a big waste of space, but is still clumsy.
+ I could change the arrangement for accounting per-snapshot usage by having
+ a limited number of snapshots and having all the counters for one segment
+ in the one blocks. So 1024byte block could hold 512 counters (youth plus
+ base plus 510 snapshots). Half that if I go to 4byte counters.
+ In more common case of 32 snaphots, could fit counters for 8 segments in
+ a block. This means using space/io for all possible snapshots rather than
+ all active snapshots. It would also mean having a fairly fixed upper limit.
+ I wonder what NILFS does....
+ Worry about this later.
+
+ Still trying to get pinning of SegmentMap blocks right.
+ Normally we need a phase-lock when pinning a data block so that we
+ don't lose the pinning before we dirty. But as we phase_flip
+ these it doesn't matter... So just add that too the test??
+
+28June2010
+ Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
+ datablock not, and doesn't cope.
+ We either prealloc both, which seems clumsy, or always defer
+ to InoIdx if it is present and pinned.
+ lafs_prealloc does both Index and Data blocks for inode.
+ But Data could lose as writeout while index will replenish at
+ phase_flip, so maybe not a good idea.
+ If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
+ credits across to the data block (on non-cleaning segments) so the
+ Data block doesn't need to have credits.
+
+ dirty_inode gets called:
+ {__,}mark_inode_dirty{,_sync}
+ inode_{inc,dec}_link_count
+ [[various quota ops]]
+ inode_setattr
+ touch_atime
+ file_accessed
+ file_update_time
+ generic_file_...write
+ do_wp_page
+
+ updates through inode_setattr go to lafs_setattr so the
+ data block will be pinpending and the checkpoint lock will be held.
+
+ updates through inode_*_link_count happen in filesystem and the inode data
+ block is PinPending, or a block in the file is pinned and will be
+ dirty, so it will get written.
+
+ updates through touch_atime or file_update_time are unexpected and
+ cannot be prepared for. file_update_time changes will be caught by
+ normal file writeout. atime changes will be lost until we get the
+ atime file working.
+
+ So:
+ dirty_inode cannot change the block as it might be in writeout, and
+ it cannot lock anything as it might be in touch_atime which shouldn't
+ block and cannot fail.
+ So just set I_Dirty and use that to flush inode to db at writeout.
+ Any changes which must be in the next phase will come via setattr and
+ so will wait for incompatible changes to be written out.
+
+ Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
+ my_inode is set
+ lafs_import_inode
+ iget and mount-time stuff
+ lafs_inode_dblock
+
+ my_inode is cleared
+ When I_Destroyed is set and the last ref on the block is dropped
+ When inode_map_new_prepare claims an inodeblock
+
+ So we could easily not have a my_inode - e.g. just cleaning the data block.
+ ->my_inode cannot disappear while we hold the block, so a test is safe.
+
+
+ ----------------------------------------------
+ Space reservation and file-system-full conditions.
+
+ Space is needed for everything we write.
+ Some things we can reject if the fs is too full
+ Some things we can delay when space is tight
+ Some things we need to write in order to free up space.
+ Others absolutely must be written so we need to always have
+ a reserve.
+
+ The things that must be written are
+ - cluster header - which we never allocate
+ - some seg-usage and youth blocks - and quota blocks
+ Whese continually have credit attached - it is a bug if there
+ are not enough. (We hit this bug)
+
+ Things that we need to write to free up space are
+ any block - data or index - that the cleaner finds.
+
+ Things that we can delay, but not fail, are any change to a block that
+ has already been written or allocate.
+
+ When space is needed it can come from one of three places.
+ - the remainder of the current main segment
+ - the remainder of the current cleaner segment
+ - a new segment.
+
+ Only Realloc blocks can go to the cleaner segment, so the
+ 'must write' blocks cannot go there, so unused + main must have enough
+ space for all those.
+ Realloc blocks can go anywhere - we don't need a cleaner segment if things
+ are too tight.
+
+ When we run out of space there are several things we can do to get more:
+ - incorporate index blocks. This tends to free up uninc-credits which
+ are normally over-allocated for safety.
+ - cluster_allocate/cluster_flush so more blocks get allocated and so
+ more can be incorporated. See above. This is probably most helpful
+ for data blocks.
+ - clean several segments into whole cleaner segments or into the main segment.
+ Much of this happens by triggering a snapshot, however we should only do that
+ when we have full cleaner-segments (or zero cleaner segments).
+
+ When cleaning we don't want to over-clean. i.e. we don't want to commit
+ any blocks from a second segment if that will stop us from commiting blocks
+ from the first segment. Otherwise we might use one cleaning segment up by
+ makeing 4 half-clean. This doesn't help.
+
+
+ So: we reserve multiple segments for the cleaner, possibly zero.
+
+ We clean up to that many segments at a time, though if that many is zero,
+ we clean one segment at a time.
+ lafs_cluster_allocate only succeeds if there was room in an allocated segment.
+ If allocating a new segment fails, the cluster_allocate must fail. This
+ will push extra cleaning into the main segment where allocations must not
+ fail.
+
+ The last 3(?) [adjusted for number of snapshots] segments can only be allocated
+ to the main segment, and this space can only be used for cleaning.
+ Once the "free_space - allocated_space" drops below one segment, we
+ force a checkpoint. This should free up at least one segment.
+
+ We need some point at which we stop cleaning because the chance of finding
+ something to clean is too low. At that point all 'new' requests defintely
+ become failures. They might do earlier too.
+ Possibly at some point we start discounting youth from new usage scores so
+ that the list becomes sorted by usage.
+
+
+ Need:
+ cut-off point for free_seg where we don't allow cleaner to use segments
+ 3? 4?
+
+ event when we start using fixed '0x8000' youth for new segment scores.
+ Maybe when we clean a segment with usage gap below 16 or 1/128
+ event when we stop doing that.
+ Maybe when free_segs cross some number - 8?
+
+ point when alloc failure for NewSpace becomes ENOSPC
+ same as above?
+
+ point when we don't bother cleaning
+ no cleaner segments can be allocated, and checkpoint did not increase
+ number of clean segments (used as many as freed).
+ Clear this state when something is deleted.
+
+
+ Allocations come out of free_blocks which does not included those
+ segments that have been promised to the cleaner.
+ CleanSpace and AccountSpace cannot fail.
+ We *know* not to ask for too many - cleaner knows when to stop.
+ ReleaseSpace fail (to be retried) if available is below a threshold,
+ providing the cleaner hasn't been stopped.
+ NewSpace fail if below a somewhat higher threshold. If we haven't entered
+ emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
+
+
+ Possibly limit some 'cleaner' segments to data only??
+
+
+ So: work items.
+ - change CleanSpace to never fail, but cluster_allocate new_segment
+ can for cleaner segment. This is propagated through lafs_cluster_alloc
+ - cleaner pre-allocates cleaner segments (for new_segment to use)
+ and only cleans that many segments at a time.
+ - introduce emergency cleaning mode which causes ENOSPC to be returned
+ and ignores 'youth' on score.
+ - pause cleaner when we are so short of space that there is not point
+ trying until something is deleted.
+
+30june2010
+ notes on current issue with checkpoint misbehaving and running out of
+ segments.
+
+ 1/ don't want to cluster-flush too early. Ideally wait until segment is
+ full, but we currently hold writeback on everything so we cannot delay
+ indefinitely.
+ 2/ row goes negative!! let's see...
+
+ seg_remainder doesn't change the set, but just returns
+ the remaining rows times the width
+
+ seg_step move nxt_* to *, stepping to the next ... row?
+ save current as 'st_*
+
+ seg_setsize - allocate space in the segment for 'size' blocks plus
+ a bit to round of to a whole number of table/rows
+ nxt_table nxt_row
+
+ seg_setpos initialises the seg to a location and makes it empty,
+ st_ and nxt_ are the same
+
+ seg_next reports address of next block, and moves forward.
+
+ seg_addr simply reports address of next block
+
+ So the sequence should be:
+
+ seg_setpos to initialise
+ seg_remainder as much as you want
+ seg_setsize when we start a cluster
+ seg_next up to seg_remainder times
+ seg_step to go to next cluster (when not seg_setpos).
+ or maybe just before seg_setpos
+
+ Need cluster_reset to be called after new_segment, or after we
+ flush a cluster but don't need a new_segment.
+
+ I think I'm cleaning too early ... I am even cleaning
+ the current main segment!!!!
+
+ OK, I got rid of the worst bugs. Now it just keeps cleaning
+ the same blocks in the current segment over and over.
+ 2 problems I see
+ 1/ it cleans a segment that it should not touch
+ We need to avoid cleaner segment increasing the
+ checkpoint youth number.
+ 2/ it has 6 free segments and doesn't use them
+
+ clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
+ watermake is 4 segs, so free < 4. So we have 3 allocated to cleaner,
+ 3 in reserve and so nothing much to clean!
+
+ The heuristic for returning ENOSPC is not working. Need something more
+ directly related to what is happening.
+ Maybe if cleaning doesn't actually increase free space.
+
+ !Need to leave segments in the table until we have finished
+ writing to them, so they cannot be cleanable. - DONE
+
+ WAIT - problem. If cleaner segment is part-used, the alloc_cleaner_segs
+ doesn't count that. Bad?
+
+ When nearly full we keep checkpointing even though it cannot help.
+ Need clearer rules on when there is any point pushing forward.
+ Need to know when to fail requests.
+
+02 july 2010
+
+ I am wasting lots of space creating snapshots that don't serve any
+ purpose.
+ The reasons for creating a snapshot are:
+ - turn clean segments into free segments
+ - reduce size of required roll-forward
+ - possibly flush all inode updates for 'sync'.
+
+ We currently force one when
+ newblocks > max_newblocks
+ max is 1000 , newblocks is never reset!
+ probably make that a number of segments.
+ lafs_checkpoint_start is called
+ when cleaner blocks, and space is available
+ at shutdown
+ on write_super is s_dirt
+ __fsync_super before ->sync_fs
+ freeze_bdev
+ fsync_super
+ fsync_bdev
+ do_remount_sb
+ generic_shutdown_super before put_super if s_dirt
+ sync_supers is s_dirt
+ do_sync
+ file_sync !!! is s_dirt
+
+ I think I should move checkpoint_start to
+ ->sync_fs
+
+
+ After testing
+ - blocks remaining after truncate - one index and 1-4 data
+ - truncate finds blocks being cleaned
+ FIXED - move setting of I_Trunc
+ - orphans aren't being cleaned up sometimes.
+ Hacked by forcing the thread to run.
+ - parent of index block has depth==1
+ Don't reduce depth while dirty children.
+ Probably don't want uninc either?
+
+ - some sort of deadlock? lafs_cluster_update_commit_both
+ has got the wc lock and wants to flush
+ writepage also is flushed.
+ Not sure what the blockage is.
+ I think the writepage is the one in clusiter_flush, and it
+ is blocking
+
+ - Async is keeping 16/0 pinned during shutdpwn
+03July2010
+
+ Testing overnight with 250 runs produced:
+ - blocked for more than 120 seconds
+ Cleaner tries to get an inode that is being deleted
+ and blocks, so inode_map_free is blocked waiting for
+ checkpoint to finish - deadlock.
+ Need to create a ->drop_inode which provides interlock with
+ cleaner/iget
+
+ But this is hard to get right.
+ generic_forget_inode need to write_inode_now and flush all changes
+ out and then truncate the pages off so the inode will be
+ empty and can be freed. But flushing needs the cleaner thread
+ which can block on the inode lookup.
+ Ahh.... I can abuse iget5_locked.
+ If test sees I_WILL_FREE or similar, it fails and sets a flag.
+ if the flag was set, then 'set' fails
+
+
+ - block.c:504 DONE (I trink).
+ unlink/delete_commit dirties a block without credits
+ It could have been just cleaned..
+ It looks like it was in Writeback for the cleaner when
+ unlink pinned and allocated it....
+ or maybe it was on a cluster (due to writepage) when
+ it was pinned. Then cluster_flush cleared dirty ... but
+ it should still have a Credit.
+ Maybe I should iolock the block ??
+
+ On reflection it wasn't cleaning, just tiny clusters
+ of recent changes which were originally written as tiny
+ checkpoints. Maybe lots of directory updates triggered the clusters.
+ I guess writepage is being called to sync the directory???
+ Or maybe the checkpoint was pushed by s_dirt being set.
+
+ So use PinPending and iolock to protect dir blocks from writepage.
+
+ - dir.c:1266 DONE
+ dir handle orphan find a block (74/0) which is not
+ valid
+ This can happen if orphan_release failed to reserve a block.
+ We need to retry the release.
+ - inode.c:615
+ index block and some data blocks still accounted to deleted file.
+
+ No theory on this yet. Always one index block and a small number
+ of data blocks. Maybe the index block looked dirty, but was then
+ incorporated with something that was missed from the children list...
+ Or maybe I_Trunc is cleared a bit early...
+ Or trunc_next advanced too far?? or too soon
+ ??
+
+ - segments.c:640 DONE
+ prealloc in the cleaner finds all 2315 free blocks allocated.
+ no clean reserved.
+ Need to be able to fail CleanSpace requests when cleaner_reserve
+ is all gone.??
+
+ or just slow down the cleaner to one segment per checkpoint when
+ we are tight.. Hope that works.
+ - super.c:699
+ async flag on 16/0 keeping block pinned
+ Maybe clear Async flag during checkpoint. Cleaner won't need it
+ No, just ensure to clear Async on all successful async calls.
+
+ orphan file 8/0 has orphan reference keeping parent pinned
+ [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
+ Orphan handling is failing to get a reservation to write out the
+ orphan file block? Not convincing as there should be lots of space
+ at unmount, and 'orphan sleeping' has become empty.
+
+ - Show State
+ orphan inode blocked by leaf index stuck in writeback:
+ [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5)
+ [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0]
+
+ This is in the write-cluster waiting to be flushed
+
+
+9July2010
+ Review B_Async.
+ If a thread wants async something, it
+ - sets B_Async
+ - checks if it can have what it wants.
+ + if not, fail
+ + if so, clear B_Async and succeed
+
+ If a thread releases something that might be requested Async,
+ it doesn't clear Async, but wakes up *the*thread*.
+
+ This applies to
+ IOLock - iolock_block
+ Writeback - writeback_donem iolock_written
+ Valid - erase_dblock, wait_block
+ inode I_* - iget / drop_inode
+
+ orphan handler, cleaner, segscan - all in the cleaner thread.
+
+ 107 runs,
+ 2 hit 'Show State' with a blocked orphan inode.
+ Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
+ Both NoPhysAddr
+
+ Several runs blocked in cluster_flush or waiting for writeback.
+
+ - first case: looks like cluster flush should run but doesn't.
+ cluster_flush runs:
+ checkpoint, cleaner, cluster_allocate when full, update,
+ writepage, sync_page
+ So we have no timeout or other flush.
+ I guess if we are waiting for writeback, we need to trigger a
+ cluster_flush.
+
+ - other case - cluster_flush was called but is waiting for pending count
+ to go down.
+ Looks like cluster_reset shouldn't be changing pending_next
+
+ New hang. Orphans not being processed:
+ inode, because InoIdx is on leaf and checkpoint isn't pushing
+ it along.
+ dir block 0 is Dirty leaf
+
+ Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
+
+10July2010
+ Over night it looks *very* good.
+ Have one infinite loop with 31770 repeates of
+ ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
+ Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+
+ So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
+ lafs_add_orphan too short.
+ tracing shows after truncate_inode_pages.
+ must be blocked in inode_map_free - maybe use AccountSpace??
+ But why isn't the the truncate progressing?
+ Probably same reason: No ReleaseSpace available.
+ Maybe we aren't cleaning because there is a free segment, and
+ we aren't checkpointing because there aren't enough yet...
+
+ Probably the cleaner has halted while CleanerBlocks - fix that.
+
+ - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
+ Need a checkpoint to release the orphan?
+ ditto for 0/331 - 331/0
+ XX/0 is InoID
+
+VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice
+day...
+This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
+,UninCredit,PhysValid leaf(1) intable(6) release(1)
+ [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
+Valid leaf(1) intable(6) release(1) Leaf0(0)
+------------[ cut here ]------------
+kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
+
+Forgetting 0 0
+724 != 7 (st->free.cnt afte segdelete, close_segment, close_all)
+------------[ cut here ]------------
+WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
+
+we called segdelete on something that was on the freelist.
+This happens when the final cluster starts a new segment.
+Need to improve the fix though.
+
+
+ lafs_inode_handle_orphan can make progress without leaving
+ anything async. Maybe we need a return status:
+ -EAGAIN - try after async
+ -ENOMEM - try some time soon - hope memory will be better
+ 0 we called orphan_release
+ anything else loops.
+
+
+ - we allocate a segment in last checkpoint we don't
+ take references properly.
+
+ - orphan handle spinning on:
+
+ ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
+ 26402 calls.
+ stuck in delete_inode?? ?
+
+
+ never-ending cleaning? Maybe just computer slow ??
+
+11July2010 - on plane to Prague.
+ How can we safely access ->iblock?
+ normally iolock, but how do we get iolock?
+ - flush data to inode
+ - cluster flush takes private_lock
+ - private_lock is used to set to null.
+ I guess we use private_lock to get a reference
+ then iolock and revalidate
+ but I can probably test for NULL at any time? though that can change under private_lock
+ If we own a reference to a child with a parent, then we can use
+ rcu_dereference to get a ref which might change
+
+12july2010
+
+ ->write_inode is called by write_inode() called by __sync_single_inode
+ to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
+ Do we care?
+
+ change to addresss we already handle with checkpoints
+ change due to setattr we can handle directly if we want
+ that just cleans mtime/ctime and atime.
+ mtime/ctime calls ->dirty_inode
+ as does atime
+
+ So:
+ getattr changes set I_Dirty so that when cluster_allocate
+ happens all the changes get saved.
+
+ when dirty_inode is called, we set I_Dirty but don't dirty
+ the inode block.
+ If anything happened to justify an inode write, it will
+ be dirty anyway. If it isn't, this is just atime
+
+ So on dirty_inode we check if atime has changed and if so
+ we schedule change to atime file
+
+ sync_inode should write an update for the inode if I_Dirty
+ but sync_filesystems should not
+
+ Simple. fsync calls ->fsync. We get that to write an
+ inode update, but nothing else does.
+
+ Possibly all directory updates could be chained onto a
+ directory and only written when fsync is requested before
+ a checkpoint.
+ both sides of a rename ??
+ leave that for later.
+
+WritePhase - what is that all about?
+ We must not change a block while it is being written to previous
+ phase, else we corrupt causality.
+ But we probably don't want to change it any way as that would
+ mess up any checksum or duplication.
+
+ So we want to ignore WritePhase - scrap it.
+ Before changing a block, we must iolock_written
+ - all dir updates
+ - inode update in fsync
+ - orphan file
+ - segusage?
+ - quotas?
+
+ But what about regular data. If prepare_write finds a block in
+ writeback, do I need to wait, or can I just mark it dirty in
+ commit_write? If no checksum and no duplication applies, this should
+ be fine.
+
+16July2010
+ BUT e.g. dir operations are in particular phases. If the dirblock
+ is pinned to the old phase, we need to flush it, then wait for io
+ to complete. So we need lafs_phase_wait as well as iolock_written.
+ This is already done by pin_dblock.
+ I wonder if we need a way to accelerate pinned blocks that are being
+ waited for - probably not, they should be done early.
+
+ So we probably want to iolock after phase_wait in pin_dblock.
+ Though dir.c pins early.
+ I need to review all of this and get it right.
+
+ So:
+ - we aren't allowed to block much holding checkpoint_lock as
+ checkpoint_start waits for that. However phase_wait will only
+ block if a new checkpoint has started already, so there is not
+ chance of phase_wait ever blocking checkpoint_start.
+ So it is safe to call phase_wait in checkpoint_lock.
+ phase_wait will wait until block is written, added back to
+ the lru clean, then found and flipped... I wonder if that is
+ good - it keeps parent from being a leaf, and so written, until
+ child write has completed.
+ We want to phase-flip a block as soon as it is allocated by cluster_flush.
+
+ With directory blocks, i_mutex stops other changes, so an early iolock_written
+ will leave the block clean and phase won't be an issue.
+
+ With inode-map blocks.. we:
+ set B_Pinned to ensure no-one writes except for phase change
+ do that after lock_written so it starts safe.
+ once we have checkpointlock, wait for phase if needed.
+ then lock_written again which should be instant but ensures
+ that block is locked while we change it...
+
+ I think I want
+ - refile to call phase flip if index is not dirty and is in wrong phase
+ and has no pinned children in that phase.
+ - Only clear PinPending if we have i_mutex or refcnt == 0
+ - before transaction:
+ lock_written / set PinPending / unlock
+ the inside cluster_lock
+ lock_written pin / change / dirty / unlock
+ it will only wait for writeout if phase changed.
+ so don't need phase_wait
+ but want pre-pin then pindblock
+ Transactions are:
+ dir create/delete/update - DONE
+ inode allocate/deallocate - on inode map DONE
+ setattr DONE
+ orphan set/change/discard
+
+ Orphans are a little different as when we compact the
+ file, the orphan file block 'owned' by the orphan block
+ can change. As along as we keep them all PinPending it
+ should be fine though.
+ I think that every block in the orphan file will always be
+ PinPending ???
+
+ OK - done most of that.
+ Early phase_flip is awkward. We need an iolock to phase_flip,
+ and we don't have one. The phase_flip could cause incorporation
+ which cannot happen until the write completes. So I guess
+ we leave it as it is.
+
+
+ FIXME what about inode data block - cluster_allocate is removing
+ PinPending after making them dirty from the index block..
+
+ If all free inode numbers a B_Claimed, don't think we allocate
+ a new block... yes we do, as 'restarted' is local to caller.
+
+ Also
+ each device has a number of flags
+ - new metadata can go here
+ - new data can go here
+ - clean data can go here
+ - clean metadata can go here
+ - non-logged segments allowed
+ - priority clean - any segment can be cleaned
+ - dev is shared and read-only - no state-block updates
+
+ state block needs a uuid for an ro-filesystem that this is
+ layered on.
+
+ Is metadata an issue?
+ We might want it on a faster device, but ditto for directories
+ and for some data. So probably skip that.
+
+ Have separate segment tables for:
+ - can have new data
+ - can have clean data but not new. (this often empty)
+
+ Clean data can go to new-not-clean if nothing else
+ new data can go to clean-not-new ?? if not sync??
+ Maybe call them 'prefer clean' and 'prefer new'
+
+ I think we want:
+ 'no sync new' - don't write new data, unless it is in big chunks and
+ can wait for checkpoint to be 'synced'
+ 'no write' - never write anything - this is readonly.
+ used for removing a device from the fs.
+
+ A 'no sync new' device can have single-block segments.
+ This doesn't allow compression, but avoids any need to clean
+ In this case we don't store youth and the segusage is 32 bits per segment.
+ That means - for 1K block size - 0.5% of devices used for segusage. That
+ feels high. For 4K, 1/1024 so a giga per terabyte.
+ Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
+
+ Other segusage for 29 snaps is 1/million of space used.
+ So we 'waste' 0.1% of device for no secondary cleaning.
+ Can still do defrag though.
+
+ clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
+ as does creating a snapshot.
+
+18jul2010
+ If lafs were cluster enabled we would want multiple checkpoint clusters,
+ one for each node. When a node crashes some node would need to find and
+ roll-forward. For single node failure, it is enough to broadcast cluster
+ address to all others. For whole-cluster failure, need to either list all
+ in superblock or link from main write cluster.
+
+ When writing to multiple devices we may want multiple write clusters
+ active for new data. These all need to be findable from checkpoint cluster
+ so linking sounds good.
+ Having a single 'fork' link in cluster head might work but does scale to large
+ cluster. I doesn't need to be committed to other not does checkpoint end, so
+ that should be ok.
+ Could have a special group_head to list other clusters for roll forward.
+ If we put fsnum first, a large value - 0xffffffff - could easily mean
+ something else
+
+ Or every cluster head could point to an alternate stream, and if we want many
+ quickly, each simply points to another, so we create a chain across all writers.
+
+
+ Another issue...
+ When we 'sync' we don't wait for blocks until after the checkpoint is started,
+ and we know that will be driven through to CheckpointEnd which will commit and
+ release everything.
+ However 'fsync' doesn't have the same guarantee. The sync_page call will ensure
+ the data has been written, but we don't know it is safe until the next
+ header is written. So we need to push out the next cluster promptly.
+
+ So if sync_page is called on a page in writeback, then we mark the cluster as
+ synchronous. When a sync cluster completes, the next (or even next+1) clusters
+ are flushed out promptly. Hopefully they won't be empty on a reasonably busy system,
+ but it is OK if they are.
+
+ If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
+ as the write completes the block will be released.
+
+ So: to clarify sync_page:
+ This can be called when page is in writeback or locked.
+ If locked there is nothing we can do except maybe unplug the read queue.
+ If page is in writeback and block is dirty, then it is probably in
+ a cluster queue and we should flush the cluster and the next.
+ If page is in writeback and block is not dirty, but is writeback,
+ just flush one cluster.
+ But we don't want these cluster flushes to start while the previous is
+ still outstanding else we stop new requests from being added.
+ So as soon as the cluster can be flushed we flush, but no sooner.
+ I guess we use FlushNeeded and make that be less hasty.
+
+19June2010
+
+ superblocks....
+ We currently have a superblock for each device.
+ I cannot see a good reason for that.
+ We can just bdev_claim for 'this' filesystem.
+ Rather we should have a number of anon superblocks,
+ one for each fileset, then one for each snapshot.
+ Do we use different fs types? probably yes
+ lafs - main filesystem made from devices
+ lafs_subset - subordinate fileset, given a path to fileset object
+ can have 'create' option when given an empty directory.
+ lafs_snap - snapshot - given a path to filesys and textname.
+
+ Cannot create a snap of a subset, only of the whole filesystem
+ Is it OK to mount eith snap of subset or subset of snap?
+ It probably does, so need to use the same filesystem type for both.
+ Maybe lafs_sub or sublafs. Needs path to directory.
+ can be given 'snap=foo'.
+ No: a given filesystem may not exist in a snapshot. You need to
+ mount the snapshot first, then the subset of the snapshot.
+ So we have three types as above. All subsets as 'lafs_subset',
+ whether they are subset of main or of snapshot.
+
+ Should we be able to create a snapshot or subset without mounting it?
+ It doesn't really seem necessary but might be elegant..
+
+ remount doesn't seem the right way to edit a filesystem as it forces
+ some cache flushing.
+ What do we want to edit?
+ - add device, remove device
+ - add/remove snapshot by name
+ - add/remove subset? Not needed, just mkdir/rmdir and mount to convert
+ empty dir to subset.
+ - change cleaner settings??
+ Could have remount as an option. If problem find other option.
+
+ While cleaning (which is always) we potentially need all superblocks
+ available as we might need to load blocks in those filesystems to
+ relocate them.
+ Unfortunately each super needs to be in a global list so there is a cost
+ in having them appear and disappear. I guess that is not a big deal. They
+ are refcounted and will disappear cleanly when the count hits zero.
+
+ So:
+ DONE - change all prime_sb->blocksize refs to fs->blocksize
+ DONE - create an anon sb for the main filesystem
+ DONE - discard the device sbs, just bd_claim the devices and add to list
+ - use lafs_subset for creating/mounting subsets.
+
+ Changed s_fs_info to point to the TypeInodeFile for the super, but
+ for root/snapshot that doesn't exist early enough to differentiate the
+ super in sget.
+ So we make an inode before the super exists and attach it after.
+ Need to do all that get_new_inode does.
+ inode_stat.nr_inodes++ - just don't generic_forget the inode
+ add to inode_in_use - seems pointless - just set i_list to something
+ add to sb->s_inodes - if we don't it won't flush - maybe that is good?
+ add to hash - don't want
+ i_state == lock|new - only really needed if hashed.
+ but there is lots of initialisation in alloc_inode that we cannot access!!
+
+ Problem is that we need s_fs_info to uniquely identify the fs with something
+ that can be set in the spinlock, so allocating an inode is out.
+ And also to get to the filesystem metadata which is in the inode.
+ I guess we allocate a little something that stores identifier and later inode.
+ for lafs we use uuid
+ for subset we use just the inode
+ for snapshot we use fs and number
+
+
+25July2010
+ superblocks:
+ - sget gives us an active super_block. We need to attach to a vfsmnt
+ using simple_set_mnt, or call deactivate_locked_super.
+ - sget's set should call set_anon_super
+ - kill_sb (called by deactive_super) should then call kill_anon_super
+
+ If we have a vfsmnt, we have an active reference, so we can atomic_inc
+ s_active safely. So use this to allow snapshots and subsets to hold a
+ ref on the prime_sb and thence on the 'fs'.
+
+26July2010
+ - DONE need to set MS_ACTIVE somewhere!!
+ - FIXME if an inode is being dropped when iget comes in, it gets confused
+ and the inode appears to be deleted.
+
+ We cannot really break the dblock <-> inode link until after write_inode_now,
+ but there is no call-back before generic_detach_inode is complete.
+ The last is write_inode which is only calledif I_DIRTY_something.
+ Maybe when writeback completes on an inode dblock, we should check if
+ the inode is I_WILL_FREE and if so, we break the link...
+
+ Or maybe when we find my_inode set we can check the block and if it isn't
+ dirty or being deleted we break the link directly... That makes more sense.
+
+ So... what is the deal with freeing inodes???
+ ->iblock is like a hashtable reference. It is not refcounted
+ It gets set under private_lock
+ iblock is freed by memory pressure or lafs_release_index from
+ destroy_inode
+ when refcount of iblock is non-zero, ->dblock ref is counted,
+ else it is not.
+ dblock is set to NULL if I_Destroyed, or when dblock is discarded,
+ (under lafs_hash_lock)
+ and set to 'b' in lafs_iget and lafs_inode_dblock
+
+ We can drop the dblock link as soon as iblock has no reference
+
+ probably get clear_inode to break the link if possible, which it should
+ be on 'forget_inode'. Then lafs_iget can wait on the bit_waitqueue.
+ or maybe do clear_inode itself
+
+ FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
+ dblock is not NULL.
+
+28July2010
+ So: ->dblock and ->my_inode need to be clarified.
+
+ Neither is a counted reference - the idea is that either can be freed and
+ will destroy the pointer at the time so if the pointer is there, the
+ object must be ... but we need locking for that.
+ ->dblock is reasonably protected by private_lock, though if ->iblock exists
+ we hold a ref of ->dblock so we can access it more safely.
+
+ Need to check getiref_locked knows ->dblock exists when called on iblock
+ and lafs_inode_fillblock
+ yes, both safe!
+
+ But ->my_inode needs locking too so the inode can safely disappear without
+ having to wait for the data block to go. After all data blocks some in sets,
+ and one shouldn't keep others with inodes.
+ So something light-weight like rcu might work.
+ We use call_rcu to free the inode and rcu_readlock to access ->my_inode
+
+ Yes, that will work. Occasionally we will want an igrab to, but not
+ often.
+ Should look into rcu for index hash table and ->iblock as well.
+ Current ->iblock is only cleared when the block is freed .. I guess that is fine...
+
+
+31Jul2010
+ rcu protection of ->my_inode
+ A/ orphan inodes - are they protected?
+ B/ orphan blocks - are the inodes of those protected? Probably...
+
+ inodes are 'orphan' for two reasons
+ 1/ a truncate is in progress
+ 2/ there are no remaining links, so inode should be truncated/deleted
+ on restart.
+
+ The second precludes us from holding a refcount on any orphan inode,
+ else it would never get deleted.
+ So we must assert that an inode with I_Deleting or I_Trunc has an implied
+ reference and so delete must be delayed... not quite.
+ If we set I_Trunc but not I_Deleting, then we igrab the inode until
+ I_Trunc is cleared. While we hold the igrab, I_Deleting cannot possibly
+ be set as that is set when last ref is dropped.
+
+01Aug2010
+ FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
+ .. and in lafs_orphan_release
+ Well... only iolock_written can be a problem, and our rules require that
+ only phase-change writeout can set writeback. So the cleaner can never
+ wait for writeout here. Maybe it can wait for a lock, and maybe we don't
+ really need a lock, just 'wait_writeback'.
+08Aug2010
+ So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
+ It is writeback waiting on 74/BIGNUM fromm file.c:329. So writepage
+ tried to write a block in a directory .. but it is PinPending so that
+ must have been set after writepage got it...
+ lafs_dir_handle_orphan gets an async lock, then sets PinPending.
+ If write_page is before that, it will have the lock and dir_handle will try later.
+ If write_page is after it will block on the lock, or see PinPending and
+ release the lock.
+ So someone else must be clearing PinPending!
+ - checkpoint clears and re-sets under the lock, so that is safe
+ - dir.c clears under i_mutex
+ dir_handle_orphans always hold i_mutex ... or does it.
+ - refile drops when the last non-lru reference goes.
+ - inode_map_new_abort clears for inode
+ No, not that - just bad test on result lof iolock_written_async ;-(
+
+ Now have an interesting deadlock.
+ rm in lafs_delete_inode in inode_map_free is waiting for the block to
+ flush which requires the cleaner.
+ The cleaner thread in inode-handle_orphan is calling erase_dblock
+ on the same inode which blocks while inode_map_free has it locked....
+ no, not same block - just waiting for writeout which requires cleaner.
+ lafs_erase_dblock from inode_map_free must be async!
+ pin_dblock in lafs_orphan_release must too.... no - only the setting of
+ PinPending needs to be async or out side of cleaner, which it is.
+
+ Ok, got that fixed. All seems happy again, time for a commit.
+
+
+09Aug2010
+ 14b/ What backing-dev to show the filesystem.
+ backing-dev holds:
+ congested state
+ unplug function
+ read-ahead info
+ throughput measurements
+
+ Much of that is for generic code to use. We need to:
+ - provide an unplug funtion that unplugs all devices
+ - provide a congested function that which checks all devices,
+ or for 'write' - at least the device we are writing to.
+
+ How do we set the backing device?
+ The 'struct address_space' point to one, as does struct super_block.
+ set_anon_super establishes a null bdi, set_bdev_super gets it from the
+ bdev->queue
+
+ We need to bdi_init and bdi_register (if no error) our bdi.
+ bdi_destroy calls unregister and reverses bdi_init
+ or just bdi_setup_and_register
+ but bdi_register_dev gives a better name - isn't this sick!!!
+
+ Partly done ... but I'm hitting more bugs :-(
+
+ -Checkpoint cannot complete because...
+ Lots of dirty inodes that are orphans are not pinned!! I
+ guess the InoIdx is ??
+ Most of them don't have InoIdx(?) Only '8' does.
+ 8/0 is also an orphan and is on wc[0]
+
+ It seems that this block keeps getting re-written and stays in
+ Phase0.
+ Is that because it is a data block with PinPending.. No, that works
+ as long as it become un-dirty: we drop pinpending, refile, and set again
+
+ It is being dirtied again during writeout for the checkpoint
+ so it doesn't get to changed phase when we lift PinPending.
+ I gues we mustn't dirty it if it is in the old phase.
+
+ -And twice inode 17 is deleted without B_Orphan being set!
+ That is the only file that exists before we mount.
+ Problem was orphan_release instead of orphan_forget
+ I wonder why it only affected 17...
+
+ -at shutdown we drop an inode and try to invalidate pages, but
+ root inode is still dirty - I wonder why.
+ The dblock is in a different phase to the iblock.
+ In checkpoint we wait until root iblock changes phase, but
+ not root dblock!
+
+
+ UP TO:
+ I'm testing subordinate filesystems, which don't work yet.
+ I need to create the root directory and inode map.
+ Obviously I cannot record the inode map file in the inode map....
+ inode_map should ignore everything less than 16? 8? 2?
+ Need to make sure creating with a given inode number works.
+ Need to make make sure auto-allocate inum is never less than 16.
+
+11Aug2010
+ How to map from filesys inode to superblock?
+ Need in
+ lafs_iget_fs
+ choose_free_inum - to get inode-1
+ ditto in inode_map_free
+ lafs_put_super has something odd with i_sb
+
+ Could do an sget search..
+ Or could just store it in the inode (but not in i_sb!!)
+ inode already a bit large though.
+ Do it for now, but make a note to trim the fs_md part of inode
+ into a separate allocation.
+
+ lafs_new_inode should take an 'sb' not a 'filesys'.
+ In fact, get rid of filesys. It is
+ MAP(i->i_sb->s_fs_info)->root.
+
+ 15f - timestamps for roll-forward.
+ The writeout can be much later, but logging the mtime is fairly
+ boring ... we could log mtime in the group head, which might be cheap
+ enough. How much precision is needed, and against what base?
+ probably mtime of last checkpoint from superblock. That should
+ be not more than 2048 seconds ago, so 16 bits gets is 30msec...
+
+14Aug2010
+ 15l - decay youth info.
+ Need to decay:
+ youth_next and checkpoint_youth in 'struct fs'
+ all blocks in youth files on storage
+ all scores in seg-tracker.
+ - not needed, they'll get updated in normal progress
+ and being wrong for a while is no cost.
+ ensure correct youth is stored in lafs_free_get
+ check little-endian conversion of all youth accesses
+
+ checkpoint_youth only used by thread, so no locking needed
+ youth_next protected by fs->lock
+
+ 15m - share orphans and cleaning list_heads in datablock
+ It certainly is possible to clean an orphan but it is very unlikely
+ as it will have changed recently, or be changing soon.
+ The cleaner could just dirty any B_Orphan it finds.
+ But if orphan finds a block on the list, it must be careful...
+ I guess when cleaner drops a cleaning ref, it should check if the block
+ is an orphan, and re-queue if it is.
+
+ 15o - async blocks just have an extra refcount.
+ This could:
+ - keep PinPending set
+ - keep an index block pinned - will phase-flip
+ - keep ->parent link
+ not not get in the way of a checkpoint.
+
+ Should we clear any that we find though?
+ Normally async is only used by cleaner, orphan processing, or segscan
+ So it should all be finished when we do a checkpoint.
+
+ So if checkpoint, or release_page, finds an async block, drop it.
+
+ 15r - further optimisations in cleaner to avoid lookups.
+ We have fsnum,inum,blocknum and cluster seq number and trunc num.
+
+ I want to introduce more async though. Currently it only loads
+ one inode at a time.
+ To do more, I need to mark inodes as 'done' when they are and always
+ restart from the start of the cluster (only do one cluster at a time
+ for now).
+ So if we get all the way though a cluster with no 'EAGAIN' we finish
+ with the cluster.
+
+ 15y - when could a directory block become an orphan?
+ - when deleting that last entry - we don't know if it can be fully
+ deleted until we look in next block
+ - when deleting an entry follows a chain back to the first block
+ - when deleting the last entry in the block.
+
+ So it could be an orphan if the entry found:
+ - is at end of block
+ - is first entry
+ - is only entry
+ or first entry is already deleted.
+
+15Aug2010
+ looking at flushing etc when run out of space.
+ We often force a checkpoint when it won't do any good as
+ nothing has been cleaned.
+ In fact we write lots of dead checkpoints to 0/0 until it is full,
+ then move on, clean 0/0 and suddenly have space.
+ We shouldn't do that. sync should be what pushes us forwards.
+ Maybe that is fixed..
+
+ InoIdx blocks still cause confusion. Should they ever have credits?
+ or do only the data block have those? Certainly they cannot have
+ SegRef.
+ And there is confusion in my mind whether data blocks can be pinned
+ while the InoIdx block is - need to clarify that.
+
+
+13Sep2010 - now, where was I...
+ - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
+ - test run: block.c:660 - no credits available while dirtying an InoIdx block during
+ orphan handling. lafs_reserver_block (under checkpoint lock) should have set credit.
+ Only I just changed reserve_block to do that dblock instead - I wonder why.
+ OK, I think I cleaned that up...
+
+
+ - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
+ so pin_dblock did too. So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
+ returned error.
+ Problem is that we don't push a checkpoint when cleaner runs out of things to do.
+ But we don't want to go back to pushing a checkpoint too often.
+ Maybe the problem is that we only force the checkpoint when we have enough space to do
+ new allocations, but we need to force it earlier if nothing new can be cleaned.
+
+ Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
+ we need to wake 'checkpoint_wait' then.
+ But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
+ ...anyway, that is sorted now at commit 95b6b05e460
+
+
+ So: InoIdx blocks.
+ - These never get SegRef as that is meaningless - done.
+ - These can have credits. It possibly isn't necessary bit it makes things
+ easier. They are 'written' by transfering the credits to the data block, or discarding them.
+ - I think dblock and iblock can both be pinned
+ The problem this caused was that the dblock might get processed as a leaf before iblock.
+ We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
+ is pinned to the same phase.
+ lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
+ it unpins an iblock
+ So lafs_pin_dblock doesn't need to pin the inode instead.
+ OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000
+
+ 15u - I don't need to get a segref there, but I need to have one from the original dirty block,
+ so fix that up - commit Mon Sep 13 15:28:08 2010 +100
+
+ 15v - What do we have?
+ lafs_dirty_dblock: set Dirty, clear Credit clear NCredit
+ set Uninc, clear Icredit clear NICredit
+ lafs_dirty_iblock: set dirty, clear credit
+ test uninc, clear ICredit, set Unincredit - not essential
+ mark_cleaning: test realloc, / alloc / set realloc
+ test dirty / clear realloc/ set credit
+ set uninc clear icredit
+ cleaner_flush: set dirty, clear realloc, clear credit
+ test dirty, clear realloc set credit
+ flush_data_to_inode:
+ lafs_cluster_allocate - there is some odd code ther!!
+ flip_phase
+ lafs_allocated_block
+
+ all rather different really.
+ Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock
+
+ 15w/ Space used by cluster updates??
+ It is all fine - just some confusion of function names.
+
+ 15z/ logging symlink creation.
+ Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
+ symlink itself. So I guess we queue the block for writeout so it will go with the
+ dir update.
+ Yes, that works: Mon Sep 13 17:33:54 2010 +100
+
+ 15ab/ already did that in commit f90959e6f492b6
+
+
+ 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
+ be freed after the write completes? A checkpoint could do it, but that would write out index block
+ that cannot be freed too. A checkpoint would only be good after lots of data pages had been written.
+ We could just wait and let other processes kick in..
+
+ I don't think we need to do anything. lafs_shrinker doesn't really know how tight memory
+ is, and periodic checkpoint will free up any memory that we are pinning.
+
+ .... but something is needed. We need some trigger to write dirty index blocks
+ Maybe:
+ - a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
+ DONE THAT.
+
+ Not sure this is a complete solution. I might want to incorp/flush index block when they
+ have no dirty children, but I'm not sure about that.
+
+14sep2010
+ 15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
+ failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
+ This could split the parent which means the block could have been re-parented - it would have been in the
+ child list and so found and fixed.
+ lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
+ Inf this case, realloc isn't an issue, only dirty. lafs_incorporate must have made it dirty and
+ it won't get written while it has these in-phase children, so all is happy.
+
+ 15ae - refile race? Someone might set B_IOLock before removing from lru, so
+ onlru is 0 and refcnt is elevated so it doesn't seem to be unused.
+ But then whoever has the refer will refile again when dropping it and
+ so the right thing will be done.
+ But more generally, do we really want the lru etc to own a counted reference?
+ If it didn't:
+ - we would need to refile when removing from any list
+ - we would need to get a ref when removing from list.
+ uhmmm..
+
+ lafs_refile does:
+ clear PinPending if refcnt is low
+ unpin if not PinPending, or dirty etc and data or refcnt is low
+ place on leaf list - if pinned etc - this can be earlier
+ drop parent linkm if refcnt is low, and not pinned etc
+ handle dblock issues
+
+ if lru was not refcounted, then the only things we might do when refcnt isn't zero are:
+ unpin a dblock once it is not dirty
+ add to lru
+
+ But if we don't count lru, then we can lose the refcount on dblock
+
+ Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and
+ don't get freed.
+
+ I think I want things on 'leafs' list to not hold a counted reference.
+ Things *only* get removed while walking the list.
+ InoIdx blocks hold a ref on the dblock both when counted and some other time. Possibly
+ when pinned. This ensure they are held InoIdx is while a real leaf.
+ But: When we take that first ref, how do we know the dblock even exists?
+
+ What is the lifetime of ->dblock?
+ removed when page is released
+ set by lafs_import_inode
+ set by lafs_inode_dblock
+ removed by clear_inode
+ So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock
+ This is currently callers of getiref_locked
+ - erase_dblock_locked ?? shouldn't need a lock
+ - ihash_lookup - never on InoIdx
+ - lafs_make_iblock - already have dblock
+ So none of those really need lafs_inode_dblock
+ What about when we set Pinned
+ only really from set_phase ... messy.
+ What about when we set ->parent
+ grow index tree - not relevant
+ ditto do_incorporate_*
+ block_adopt
+ Can be called on InoIdx from:
+ lafs_make_iblock only!!
+
+15sep2010
+
+ I have tidied lafs_refile up a lot but I need to make locking a lot cleaner.
+ In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref
+ is taken until I have finished my cleanup. I suspect the inode private_lock is the one to use.
+ I also need to clean up getiref_locked and getref_locked - having both is awkward.
+
+ So: when are they called?
+
+ getref_locked:
+ lafs_get_flushable - hold fs->lock
+ first_in_seg - holds private_lock, but shouldn't need _locked as hold a ref through child.
+ (getiref_locked)
+ pin_all_children - hold private_lock
+ find_better - private_lock
+ getdref_locked
+ lafs_invalidate_page - to get a ref on each block to either erase or invalidate it
+ presumably page is locked
+ lafs_get_block - holds private_lock - plus once with only page_lock
+ lafs_release_page - holds private_lock
+ (getiref_locked on dblock) - no locking
+ lafs_inode_dblock - private_lock of my_inode...
+ lafs_delete_inode - private_lock of my_inode
+ lafs_destroy_inode - ditto
+ lafs_drop_inode - ditto
+ getiref_locked
+ erase_dblock_locked - private_lock
+ lafs_get_flushable - fs->lock
+ ihash_lookup - lafs_hash_lock
+ lafs_make_iblock - private_lock
+
+ So private_lock looks like a good choice. Issues are:
+ - what is the story with dblock on my_inode->private_lock
+ - what is the lock ordering
+ - what can refile negate that we need to be careful of.
+ i.e. we want to keep things stable while refile does its tests, but what do we need to keep
+ stable for others?
+ + we break the parent link?? and so the siblings link
+ + move things to freelist
+ + can put_page
+ + free dblock if not page_private
+
+ Lock_ordering. private_lock, then fs->lock, then lafs_hash_lock
+ So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock
+
+ This is getting messy - I need something nice and clear.
+ So:
+ Index Blocks.
+ If Pinned, either has references or is on a leaf list - possibly both
+ If no references and not pinned then not on leaf list, so can be on free list
+
+ Pinned can only be set when there are references, and can only be cleared under private_lock
+ This is violated by phase_flip, which badly reads refcnt
+ If refcnt is zero and not pinned, then can be moved to free_list
+ If on freelist and refcnt is zero under hash_lock, can be freed
+
+ So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore.
+ Someone else must hold a ref and will put it and it will refile. but that is pointless as
+ it could immediately be cleared after we test Pinned.
+
+ lafs_get_flushable should get a reference before deleting from list. This ensure it won't be freed
+ by lafs_shrinker, though it could be on the free list. If it is, then it isn't pinned so it is not
+ interestin to us.
+
+
+ Data Blocks:
+ These are removed from lru when freed - we just need the extra refcnt check after removing from list.
+ No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt
+ so they cannot disappear
+
+ What is the story with my_inode->private_lock though? This is used to protect ->dblock accesses.
+ I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared?
+ dblock is cleared in refile for the dblock,
+ or in clear_inode under the inode rivate lock.
+
+ So:
+ There are various places that hold a non-counted reference to a block.
+ These include
+ - index hash table lafs_hash_lock
+ - index free list lafs_hash_lock
+ - phase_leafs / clean_leafs fs->lock only if pinned
+ - inode->iblock lafs_hash_lock
+ - inode->dblock inode->i_data.private_lock
+
+ Each of these is protected by its own lock, but not all the same lock.
+ When we turn one of these into a counted reference, we increment refcnt under the local lock,
+ then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has
+ finished. This must be done before changing/using the block in any way.
+ To free an index block it must first be removed from _leafs list. Then if the refcount is still
+ zero it can be freed - or put on freelist and subsequently freed.
+ An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference.
+ To free a data block we similarly need to recheck refcnt after removing from leaf list.
+ If it is in an inode file we also take that inode's private_lock to clear dblock.
+ We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero.
+
+17sep2010
+ review lafs_refile - are some of those tests redundant? - yes, one is gone.
+
+ So:
+ 15ah - What about truncated blocks sitting on an uninc chain?
+ I don't see the problem. It will eventually get incorporated and do the right thing...
+
+ 15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in
+ a stable way.....
+ No, I don't think that is really a problem. It only gets written out in the tail of the checkpoint after
+ the root. I guess it could then get a youth number for a segment that it has no count for, if the root is
+ written at the end of one segment and the segusage/youth written at the start of the next.
+
+ But I think roll-forward is missing something. Blocks in the next phase need to be counted into segusage.
+ Are they? oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space,
+ but the important blocks picked up by the roll-forward are handled.
+
+ So....
+
+ A checkpoint could cover multiple segments. We need to be sure these each get a valid youth number.
+ Probably most of them will, but we need a consistent approach to be sure.
+ They don't need to be added to the segtracker, except the last needs to be active, and it already is.
+ So as we find a new segment we want to do much like was lafs_free_get does youth_update.
+ But the data block - isn't that youthblk? When it that set?
+ segsum_find sets if it ssnum == 0
+
+19sep2010
+ 15ak - run the orphan file at mount time.
+ After roll-forward when we have a working filesystem, we need to read the orphan file, load each block
+ mentioned, and register each as an orphan.
+ This involves:
+ - setting the orphan_slot
+ - setting B_Orphan
+ - lafs_add_orphan
+ Just like at the start of orphan_commit
+ We also need to initialise nextfree and possibly 'reserved'.
+ But: can orphans be created during roll-forward? They certainly can. We currently hide that in a re-use of
+ the orphan list.. But directory updates are possible too, and not handled.
+
+ I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot
+ change the orphan file. Then after roll-forward, we read the original part of the file and set up
+ any orphans that aren't yet.
+ So we want to read once to get the size. Then read again to process content up to that size.
+
+ 15am - filesystem name.
+ This is only used for identifying snapshots
+
+01oct2010
+ - mkfs is done to an initial version of lafs-utils. !!!
+
+ So: 15am - filesystem name - used to identify snapshots
+ So the name is pointless in subordinate filesets. So I could just shrink
+ the metadata. The primary metadata needs to be big enough to get a name
+ easily though.
+
+ 15aw..
+ When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'.
+ But we have the same B_UnincCredit bit for both. Is that safe?
+ Processing the cleaner could absorb the UnincCredit while the blocks is
+ reserved but not dirty. Then when it gets dirtied, there may be not
+ enough credits to split.
+ We set Dirty from Credit, and use ICredit for UnincCredit.
+ But when only Realloc (not dirty) we don't use those bits. We allocate
+ fresh credits or set Dirty if that fails.
+
+03Oct2010
+ Need lafs_iget_fs to work on other filesystems. And other snapshots?
+ We use it:
+ in cleaner when parsing cluster head
+ in orphan handler when loading orphan file or when rearranging it.
+ in roll forward
+
+ Each of these might need to kern-mount the fs - so we need to hold the ref
+ somewhere.
+ Cleaner also needs to explore snapshots.
+
+ Don't want kern_mount - that is too heavy weight and includes a vfsmnt.
+ Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need
+ to hold.
+ Similarly for snapshots. Cleaner needs to consider all snapshots, so they
+ all need to be mounted.
+
+ So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops.
+ Subset 'sb's can be attached to the parent inode and then only dropped when
+ the inode goes... only sb currently references inode.
+ So maybe the first ref to an sb doesn't ref the inode but others do - is that
+ possible? No, as we don't see them being dropped.
+ Every inode in the subset could ref the filesys inode. That would keep it active
+ the right amount of time, but release/destroy could still be racy.
+
+ I guess cleaner/orphan/roll need to explicitly ref the fs.
+ cleaner already refs inode when B_Cleaning, so hold fs too.
+ B_Orphan seems to own and inode ref too.
+
+ So:
+ lafs_iget_fs gets a ref on the inode and the sb.
+ need lafs_iput_fs to drop both references
+ B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref.
+
+ cleaner holds refs on all snapshots
+
+ FIXME I probably need to hold inode/fs for B_Async too.
+ No. Async only refs the block, not the inode or fs.
+ Something else would normally ref the inode - e.g. cleaner.
+ When the inode is free, the page invalidation will notice the
+ B_Async flag and release it.
+
+ So that is all done now, except I don't hold refs on snapshots in the cleaner
+ yet.
+
+11oct2010
+ DescHole
+ - When is this used? directory etc don't need it.
+ - a regular file might, but there is no API to punch
+ a hole.... yet I guess.
+ - So we just want to allocate these blocks to 0.
+
+15oct2010 - happy birthday Daniel...
+ Looking at 36:
+ a/ files with nlink==0;
+ If we happen to find them, we hold a reference until all roll-forward
+ is done, incase a name is found - it is important not to start deletion
+ early.
+
+18oct2010
+ 36g - write roll_mini for directories.
+ We get a name, an inode number, and one of:
+ LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET
+
+ The REN_SOURCE is linked with a REN_*_TARGET which could be in a
+ different directory, so we need to stash the SOURCE until the TARGET
+ arrives.
+ We simply impose the implied change on the directory and update the
+ link count in the target inode.
+ So:
+ load the inode
+ possibly record REN_SOURCE for later
+
+ calls prepare/pin/commit as appropriate.
+ Put the inode on orphan list if appropriate - needs care
+ as we retarget orphan list.
+ update inode link count.
+
+ (28Feb2011)
+ Just a refresh on the purpose of these updates.
+ 1/ They allow us to fsync a directory without performing a full checkpoint.
+ As directory blocks are not processed in roll-forward we need the update
+ for data to be safe. As fsync of directories are rare in some common
+ situations we could avoid actually writing these. Simply queue them
+ internally and discard them on a checkpoint. If an fsync comes before the
+ checkpoint, only then do we write them out. If there are any cross-directory
+ renames then the preceeding updates in both directories need to be flushed
+ before the cross-directory rename. It might be easier to always flush on
+ a cross-directory rename.
+ 2/ They ensure consistency of inode link-count wrt to names in the filesystem,
+ but as link count is only updated by these (or a checkpoint) there is no
+ problem with delaying.
+
+ So: when replaying these we must update the directory content and the inode
+ link count.
+ It is OK to delay the write-out of these until an fsync, and not bother
+ if a checkpoint happens.
+ So add that to th TODO list - item 66.
+
+28feb2010
+ - roll forward directory updates ... I wonder if I got it right :-)(untested).
+
+
+ I don't seem to have easy-access notes about the various meaning of
+ 'width' and 'stride'
+
+ width: The number of independent devices across which the (virtual) device
+ is placed. The normal goal is to write 'width' blocks on every single write.
+ On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
+ and it will keep all devices equally busy with writes.
+ The 'width' blocks probably aren't consecutive.
+
+ There are two different layouts - one with width*stride <= segment_size
+ and one with width*stride > segment_size.
+
+ width*stride <= segment_size
+ This is a traditional striped layout like RAID0/4/5/6.
+ The 'stride' is the chunk size, so 'width*stride' is the stripe size,
+ and segment_size must be a multiple of this.
+ In this case all addresses in a single segment are contigious. We don't
+ necessarily write them in order if we want to write less than one stripe.
+ segment_offset will normally be a multiple of width*stride though this isn't
+ enforced as one could have a partition with an non-aligned start.
+
+ width*stride > segment_size
+ This implies a catentated layout. If parity-redundancy is in use when the
+ blocks which combine to form a stripe are 'stride' blocks apart.
+ The benefit of this layout is that an extra drive can be added by simply
+ zeroing it and joining it to the array - no re-stripe needed.
+ This will make all stripes slightly larger so at first the space will not
+ be available. As cleaning happens the space will gradually become
+ available. This still requires restriping, but unlike a normal
+ raid5 restripe, the space becomes available in small amounts immediately,
+ when there is no demand for more space, the re-striping (cleaning) can happen
+ at a very low priority with no cost.
+
+ In this case the blocks in a segment are not contiguous.
+ 'segment_size/width' are, then there is a large gap (in virtual address
+ space) to the next chunk.
+
+ The segment_offset is an amount of space which is free at the start of
+ each device. 0..segment_offset and stride..stride+segment_offset etc
+ do not contain data and can be used for metadata.
+
+ When width > 1 it makes sense to replicate each state block across
+ every device - as we want to write the whole stripe anyway.
+ For now we only write and read the first two copies at the beginning, and
+ the last two at the end...
+
+ Question: what do we want to do about metadata on flash devices? We really
+ don't want a small number of locations to store the metadata, but a large
+ number that we search through - possibly a binary search.
+ These could be all at start/end or scattered throughout the device.
+ The later would make it impossible to find efficiently - there is no way to
+ create useful linkage without writing something else at start of end.
+ As many devices optimise for random writes where the FAT table would be,
+ it make sense to just put the metadata there and not at the end.
+ We should allow one 'page' for each metadatum, which probably meanss
+ 32K.
+ So we should allow all state blocks to be near the start.
+
+01mar2011 - Autumn arrives.
+
+ Time to add handling of 'atime' and non-logged files.
+
+ The idea is to have a separate file for storing only 'atime'
+ This is separate from the inode file because the volatility of the data
+ is very different and one of the principles of log-structured-fs is that
+ differently volatile data should be kept separate.
+
+ This does mean that an inode lookup requires getting data from two files,
+ but it is hopped that the 'atime' file will mostly be in cache as each
+ block contains the atime for lots of different inodes.
+
+ The atime file contains 2 bytes for each inode, so with a block size of 4K,
+ each block would hold info for 2048 inodes. 1 million inodes would require
+ 2 megabytes.
+
+ The 16bits are treated as a positive floating point number which
+ gets added to the atime stored in the inode. The lower 5 bits are
+ the exponent, the remaining 11 bits are mantissa. Though there is a
+ little complexity in interpreting the exponent.
+ If the exponent is 0, the mantissa is used as milliseconds -
+ so shift left 5 and multiply by 1000000 for nanoseconds.
+ The smallest change that can be recorded in 1 millisecond.
+ and values up to (2^11-1) milliseconds - or 2seconds can be stored.
+ If the exponent is 1 to 10, the mantissa has a '1' appended as a
+ new msb, and is shifted by the exponent-1 and then treated as milliseconds.
+ This ranges up to 2^(12+9) milliseconds or 30 minutes, where
+ the granularity will be 2^9 millisecs or 0.5 seconds
+
+ For exponents from 11 up to 31 we add the 1 msb and treat
+ the number as seconds after shifting (e-11). So at e==31,
+ we shift a number that is
+ up to 4095 by 20 to get nearly 2^32 seconds or 136 years.
+ At this point the granularity is 2^20 seconds or 12 days.
+
+
+ So overall we can update the atime for 136 years without needing to
+ update the inode, and can record differences of 1msec for the first
+ couple of seconds, then gradually less granularity until we are
+ down to one second an hour after the last change, and 4 hours a
+ year later.
+
+ To convert a number of seconds to this format:
+
+ If >= 2048 seconds, we shift down until less than 4096 seconds
+ counting the shift. We add 11 to that number to form exponent,
+ and shift the resulting mantissa up 5, or with exponent, and mask
+ out bit 16.
+
+ Otherwise we convert to milliseconds (divide nanno by 1000000 and
+ multiply seconds by 1000, and add). Then if < 2048, we shift up by
+ 5 leaving a zero exponent and use that.
+
+ Otherwise we shift down until < 4096 counting shifts, add 1 to the
+ shift to form an exponent, and combine with mantissa as above.
+
+ So that is the format - how do we implement it?
+
+ We don't want to expose to user-space numbers that we cannot store.
+ So any 'utimes' call updates that the inode directly can clear the
+ value in the atime file. Only updates due to accesses go to the atimes
+ file.
+ We define a 'getattr' function which looks at the atime stored in
+ the vfs inode and if it has changed we need to deal with it.
+ - if the inode is still dirty we simply update the lafs inode
+ and use the number as-is, clearing the atimes entry
+ - else we subtract the stored atime from the new atime. If this
+ is negative or exceeds 136 years we mark the inode dirty and
+ store it there. It we cannot mark the inode dirty for some
+ reason we just store all 1s in the atime file.
+
+ The same operation is needed when dirty_inode is called to make
+ sure atime updates get saved even when no getattr is called.
+
+ As we always need to be able to update the atime file, it needs to
+ be permanently pinned whenever an inode is read in. For
+ non-logged files this should be cheap but we must do it anyway as
+ the file might not be non-logged.
+ So we need to keep a permanent reference to each block while the
+ inode is loaded. That can keep it pinned.
+
+
+ We don't want updates to the atime file to be flushed in any great
+ hurry, especially if it is a logged file. We would be quite happy
+ to only write at 'unmount' and probably 'sync'.
+ So we want to stop the pages from appearing dirty in the page
+ cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty
+ (I_DIRTY).
+ We can still keep them dirty in lafs metadata so if release_page
+ is called we can schedule a write out then.
+
+
+ So some steps:
+
+ 1/ load atime file at mount time - there is one for each
+ filesystem. It has inum of 3 and type of TypeAccesstime (6).
+ Also release it on unmount.
+
+ 2/ loading an inode must take a ref to the block in the atime file
+ if it exists. A new inode flag records if this has happened.
+ Unless mounted noatime, we pin the block and reserve space.
+
+ 3/ getattr and dirty_inode must resolve any issues with the
+ atime. So lafs_inode probably needs an extra field to be able
+ to check for changes
+
+
+
+ Hmm.. this is getting confusing...
+ When atime is changed the only way we find out is by ->dirty_inode
+ being called. But that is called when anything is changed.
+ Filtering out whether or not we need to update the inode itself
+ is awkward... maybe there is some context we can use.
+ ->dirty_inode is called by mark_inode_dirty which is called:
+ - by touch_atime, if something changed
+ - file_update_time - at which time we also update iversion
+ - setattr ... which has changed recently (2.3.37ish)
+ - page_symlink
+ - generic_file_direct_write - which increasing size of inode
+ - set_page_dirty_nobuffers
+
+ So either the inode is pinned, or it isn't.
+ If it isn't, then this *must* be an atime-only update.
+ If it is, then it could be anything, but in any case we update the
+ atime directly.
+ So: dirty_inode should try to get dblock and check if it is pinned.
+ If it is pinned, then update the atime immediately and the offset
+ in the atime file too.
+ If not, just update the offset
+
+
+03mar2011
+ ARGggg... checkpin is interfering with unmount - it keeps an
+ s_active count so unmount 'works' but doesn't release anything.
+
+ checkpin is needed is needed to ensure that inodes remain safe while
+ we are cleaning. Particularly, while the inode index block is
+ pinned, we keep the inode and fs referenced as well. I guess the
+ theory is that they won't stay pinned for long - but they do.
+ e.g. segusage blocks are permanently pinned.
+
+
+ We could have a rule about the prime filesystem always being mounted.
+ Then we don't need refcounts, but kill off the cleaner before
+ unmount... which we sort-of do..
+
+ All subordinate filesystems have references on the prime_sb so the
+ prime_sb must be the last one to go. When it goes it kills
+ everything off...
+ So we don't need checkpin to take a ref on the prime_sb.
+
+ There might be still an issue with files in subset filesystems
+ being permanently pinned so they stay around longer than they
+ should... need to check on that somehow.
+ The idea is that a quota file block is permanently pinned so it
+ will keep the fs pinned. That in turn will keep everything else
+ pinned... Worry about that when we implement quotas FIXME
+
+04mar2011
+ I really need to sort this out, and it isn't easy...
+ We really want to know when "all" filesystems have been unmounted
+ so the block device(s) can be released and the cleaner stopped.
+ But we don't have a count for that. We could if that was all
+ we counted - but that would mean that we only have a single
+ struct super_block for all filesystems.
+
+ So that is what I have to do. A single super_block for all parts
+ of the filesystem. I probably still need to allocated other
+ dev numbers stat->dev, but I don't need to use them internally.
+ Maybe I even allocate superblocks... Yes - we need to use
+ set_anon_super and kill_anon_super to allocate the numbers.
+ lafs_inode will need a pointer to the filesystem - we use that
+ instead of the sb.
+
+ -------
+
+ Testing...
+ bug at block.c:658. Block not B_Valid in lafs_dirty_iblock from
+ lafs_allocate_block from cluster_flush.
+ Block is 74/0: InoIdx block of a newly created file I think.
+ '74' was /f23, then /mnt/1/adir. We are creating file in that
+ dir.
+ This is a depth=0 InoIdx block - i.e. the data is in the
+ dblock, so there is no index info, so it kind-a makes sense for the
+ index block to not be Valid.
+ yes- commit d268a566605bf006cf33c confirms that.
+
+ So why are we trying to dirty it?..
+
+ Maybe:
+ We create a couple of directory entries, then flush and end up
+ with an in-line data block.
+ Then we add more, flush again and so try to dirty parent...
+ Where to we turn depth=0 inodes to depth=1??
+ - erase_dblock_locked - don't want that
+ - lafs_incorporate
+ So I guess the 'bug' is in error - it is OK to mark that invalid
+ block as dirty.
+
+04mar2011
+ So - back to the super_block reworking. We want only one
+ superblock.
+ So we use the TypeInodeFile inodes a bit more to hold the details
+ of different filesystems. We need to store a unique 'dev' number in
+ there use set_anon_super/kill_anon_super on a local 'struct
+ super_block' and copy s_dev in/out.
+
+ As we only have one sb, we can only have one fstype, so we cannot
+ use the fstype to choose what to do.
+ - if dev_name is a block device we try an normal mount
+ - if dev_name is a Inode file, we perform a subset mount
+ - if dev_name is a lafs dir and '-o snapshot=name', we mount that
+ snapshot
+ - if dev_name is a lafs dir in root with perm zero and
+ '-o subset=MAXSIZE', create a subset filesystem.
+
+ - lafs_iget needs an inode rather than a superblock
+ ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free,
+ choose_free_inum, inode_map_new_prepare
+ - lafs_iput_fs,lafs_igrab_fs, ino_from_sb
+
+ - NFS filehandles need careful thought
+ They are 'per-super-block', not 'per-vfsmnt' which might be
+ better.
+ We could change that but.....
+ For non-snapshot files it is easy - just record two inodes, the
+ fs and the target.
+ For snapshots there is nothing that is really stable.
+ Maybe we could have different superblocks for snapshots.
+ The snapshot doesn't need the cleaner as it is read-only, though
+ the cleaner can need the snapshot...
+
+ So the cleaner might automagically mount a snapshot, but a
+ snapshot will never invoke the cleaner or any other thread stuff.
+
+ So I guess we want one superblock for the fs and one for each
+ snapshot.
+ The filehandle is then either inum+gen or inum+inum+gen where first
+ inum must be TypeInodeFile
+
+07mar2011
+ ... though I could just put a snapshot number and partial timestamp
+ in..
+
+
+08mar2011
+ This isn't a new to-do list, it is a list of the main features that are
+ still not implemented:
+ - full 2D layout
+ + at very least I don't pad with zeros yet
+ + if stripe size were multiple of 3*3*5*7*2^N, then changing
+ width might be managable.
+ e.g. stripe size: 40320 blocks.. But with megabyte chunksizes,
+ we really want 32bit segsizes and 322560 block segments.
+ - non-logged files - with interface to request access-time file
+ - quotas
+ - snapshots: particularly cleaning
+ - error handling
+ - metadata (inode/directory/etc) CRCs and duplication
+ - fsck / debugfs
+
+
+ What would fsck do?
+ - locate and validate device and state blocks.
+ - locate and validate checkpoint cluster.
+ - locate and validate filesystem root
+ - roll forward to collect segusage and quota blocks.
+ - load inode map, read inode file, validate each inode and make sure
+ map is correct.
+ - explore each file, following all indexing, count segusage for each
+ segment and make sure segusage file is consistent.
+ - check no block is allocated twice. This might require multiple passes,
+ each time we examine a different collection of segments.
+
+ - checking a file requires:
+ - checking inode is consistent
+ - checking index blocks are consistent with depth
+ - checking index/extent blocks are sorted with no overlaps
+ - checking block/iblock counts are correct.
+ - checking all cluster headers in the current segment to ensure they
+ look consistent and agree with file information. i.e. if cluster_header
+ identifies a block, the block must live there, or later in the segment.
+
+ - scan all directories looking for consistency of hash etc. Count links
+ for all inodes. This might need to be multi-pass too.
+ Could use a bitmap for single-link files, and table for others.
+
+ How to fix errors.
+ - First must find segments which are not in use according to segusage file
+ or according to block search.
+ If there are none, require a new device be provided.
+ - If anything looks incorrect, write corrected version to new segment
+ Then write out new segusage files
+
+ In some cases we might need to search all write-clusters for missing blocks??
+ That could take a very long time!
+
+
+ What do I really want to do about CRCs and hashes.
+ It might be nice to store a hash for each block in the index block.
+ But that wastes precious index-block space.
+ If I store a CRC together with address info in the block, then I could
+ be fairly sure it is the right block. So e.g. inodes store the inode number,
+ Index blocks could hold inode+depth+address.
+ Last 8 bytes of each block could be a 4byte CRC and a 4byte identity.
+ identiy is XOR of fsinum inum blocknum generation - or a CRC of these.
+
+ Actually, we don't need to store the identity info - we just need to
+ include it in the CRC. That either saves space, or allows more bits to
+ be used for the CRC, which is probably the best use of bits for detecting
+ errors.
+ Though it might be nice to store phys-addr in the CRC too, we cannot as
+
+21mar2011
+ My short-term todo list is:
+DONE - get 'lafs' to the stage where I can create an fs requiring roll-forward
+DONE - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
+DONE - Make lots of 'layout' changes - see 15cb
+
+02may2011
+ - 'run' goes to completion, but segusage isn't updated in the final cluster
+ and the number left over from before looks wrong.
+DONE - 'ls -l' on a subset file gets confused.
+ - fs created by 'lafs' has wrong Blocks and Inodes counts
+ - we lose a ref to a segsum and sometimes put it too often.
+REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
+REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
+
+
+03may2011
+ Once I have these bugs sorted out I want to make some format changes.
+
+ DONE - fs_metadata need a 'parent' link
+ rename needs to be careful about what is updated!
+ so does roll_mini
+ lafs_get_parent needs some thought.
+
+ DONE - roll-forward should get exact mtime stamps, and ctime.
+ So each data block must have an exact timestamp
+ of when the change actually happened. Or the group_head
+ has a timestamp for the most recent update to the file
+ As we use nanosecond timestamps (pointless though they are)
+ we need 30 bits for the nanoseconds and at least 11 for the seconds.
+ So 48 bits (6 bytes) is plenty.
+ So include a 64bit timestamp in the cluster_head and 48bit
+ number to subtract in the group_head
+ But saving 2 bytes per file isn't really worth it, and we may
+ well lose it in padding. So just store a 64bit timestamp in
+ the group_head.
+
+ DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
+
+ DONE - state block flags for inconsistencies found
+ If any inconsistency found, fsck is advised.
+ For some it may be imperative.
+ Things that can be wrong include:
+ - generic read error
+ - segusage negative
+ - index block incoherent
+ - dir block incoherent
+ - link count negative
+ - cluster header incoherent
+ -
+ 64 bits should be adequate and simple for this.
+ Any unknown bit requires a full fsck.
+
+ DONE - 32bit segment size
+ With 16bit at 4K blocks we are limited to 256Meg segments.
+ 64Meg with 1k blocks. This takes about 1 second to write on
+ a modern drive. On an array it will take even less time.
+ 24bits gives 16 to 64 gigabytes which is plenty.
+ However 24bits is awkward to access. a 1K block holds 341 1/3.
+ A 4K block holds 1365 1/3.
+ But this wastes less space than 256 or 1024 and so causes less IO.
+ But then we probably want to size segments to be very big.
+ A few thousand segments should be OK, which is tens of blocks.
+ I don't think the savings with 24bits are worth it, and I do
+ think v.big segments could be useful, so lets go with 32bit segments.
+
+ Youth is currently tuned to 16bits. Let's leave it there and
+ maybe waste some space.
+
+
+ - parallel new-data write clusters.
+ I think it is sufficient to include a second 'next_addr' in the
+ cluster_head - or maybe two. alt_next_addr[2].
+ When a thread wants to start a new stream of clusters it allocates
+ the segments then attaches to the next outgoing write cluster.
+ Once that is written everything in the new cluster is safe.
+ On a checkpoint every stream writes at least one checkpoint cluster
+ and these are linked together through alt_next_addr.
+ The 'next' cluster for each must be the checkpoint cluster and must
+ carry linkage but unlike with first-link, there is no need to wait
+ The data is already safe as long as the state block isn't updated
+ until every cluster_end block is written.
+ So really, one is enough. I had though 2 would enable quick fan-out
+ but there is no real need for that.
+
+ As 0 is a valid write-cluster address we use 'this_address' to signify
+ that there is no alt-next.
+
+ It is possible that a block of a file could be written to two
+ different streams at different points in time between two checkpoints.
+ We need to ensure that roll-forward gets these in the right order.
+ 'seq' can be the same in two different streams so we cannot use that.
+ timestamp could possibly be used, but as times can go backwards it
+ is not ideal.
+
+ NEW IDEA. Just use one stream of clusters. However it can
+ bounce from one device to another easily. So two different
+ threads can be building up two different write clusters at the
+ same time as long as they synchronise at some point to pass
+ addresses around. They also need some other Verify mode as
+ VerifyNext or VerifyNext2 will destroy any parallelism.
+ As the point of this is two write to multiple devices in
+ parallel, maybe VerifyDevNext{,2} meaning the next header on
+ the same device serves to verify this.
+
+ - policies.
+ This includes
+ maximum number of segments written between checkpoints
+ whether data can be cleaned to a particular device
+ whether a device can receive new data
+ whether metadata duplication is needed
+ whether an RO device from a different array is allowed.
+ Some of these are per-device policies. Some are per-array.
+
+ The 'RO Device' thing is special. I think I want an alt_uuid.
+ It works like this: You assemble the RO array when you
+ mount a new filesystem identifying the old as a component.
+ So that 'state' block on the new devices must identify the alt_uuid
+ and state seq number.
+
+ Do we want to record more info about which devices are in the
+ array? Currently we just record how many. If we find enough
+ with the right UUID/seq, they must be it.. what else would we
+ want?
+
+ For all the other policy statements it is probably simplest to
+ allow a set of simple strings. e.g. "noclean", "nonew",
+ "dup=2" "maxseg=5"
+ devblock currently uses 146 bytes, so room for 878
+ stateblock uses 112 plus some for snapshots, so much the same.
+ We currently don't use 'version' and have no concrete plans.
+ The vague idea is to allow lafs to *know* that it cannot mount
+ the array, so any incompatible feature gets set.
+ We could keep those in the policy sets. From that perspective
+ there are 3 types of things.
+ - if you don't understand, don't worry
+ - if you don't understand, don't try to write
+ - if you don't understand, you cannot even read.
+
+ That last is really best avoided. We have version info
+ elsewhere in the tree so that a new index style will simply
+ make that block unreadable.
+ So I think make the dev and state blocks a simple incrementing
+ version number which apply to that block, and have "don't
+ worry" and "don't write" policies distinguished by first
+ letter.
+ Capital is "If you don't understand, don't write"
+ Lower is "if you don't understand, don't worry".
+
+ These are space separated strings
+
+ - etc.
+
+ - what about i_version? Include in timestamp?