So, let's try to write a kernel module that implements this filesystem. It would be good to have a plan. - Mount filesystem, providing empty root directory o parse mount options - DONE o find/load superblocks and stateblocks - DONE o present empty directory - DONE o Compile external module - DONE o test DONE - Mount filesystem read-only with no roll-forward o IO address mapping sync_page_io or bread? - not bread I think o Index blocks management o search cluster-header for root inode o file read o Directory lookup/read o test - Support roll-forward for blocks, orphans, whatever o manage segusage files o manage quota files - Support writing o inode bitmap o cluster creation / block sorting - Support Cleaning - Interface for snapshots and other admin ------------------------ FIXME If a device is removed from the filesystem, we cannot reliably tell from the other devices or state that this is so. Maybe we need to update all devblocks with a new 'seq' number... FIXME How do we specify mounting subordinate filesets? What superblock do they have? I suspect we do a -F lafs-sub mount from the original filesystem. FIXME If mount fails, we seem to be leaving a super lying around, and sync_supers dies on it. - DONE FIXME Umount appear to work, but a sync_supers dies. - DONE FIXME subordinate supers aren't being locked as much - is that a problem? FIXME index pages never get put on an LRU - how is this supposed to work? -------------------------- Thoughts: Inodes live in an address-space, much like a file. To load the first inode, we need an address-space, so may as well have an 'struct inode' as we may want to expose it to user-space. Loading an inode, need fs (lafs filesystem structure) which subfs (maybe a lafs inode) which snapshot - this is implied by the subfs inode. and fs can be obtained from inode, so just inode, inum UPTO 03nov2005 review block_leaf_find and make_iblock need to do setparent and block_adopt next 10nov2005 need to resolve locking for ->siblings list 24nov2005 peer_find lock_phase lafs_refile I can read a file.....!!!!! Code review / tidy up. resolve locking buffer vs page Export on a web page somewhere?? 16feb2006 (I spent a while getting large-directories to work again in prototype.. and some holidays). - Priority: clean mount and unmount - large directories - multiple devices. FIXME how do we record and handle write errors??? The iput in lafs_release - which is needed - is oopsing at iput+0xe! 23feb2006 Ok, I finally have a clean mount/unmount. .. not quite. blocks being freed at unmount still have a refcnt, which is bad. Next: - make sure we can handle 'large' directories. - make sure we can handle files with indexes - handle filesystems that span devices. 02mar2006 Hurray - clean unmounts!!! There is a nasty circular reference of the root inode which is stored in a block that it manages. Maybe this should not happen, rather than having to be explicitly broken - the root-block can live elsewhere, not in the inode. Next multi-level index blocks. But first, need to understand memory pressure and pageout. How are dirty pages found to be cleaned? How is pressure put on a filesystem to clean up? How are clean pages reaped? - call pagevec_lru_add{,_active)(pvec) to put the page on an LRU lru_cache_add{,_active}(page) might be easier, but isn't exported. - call mark_page_accessed(page) to keep the page 'active'. 09mar2006 - make sure indexes work... lafs_load_block+0xf eax,bx,cx,dx,s1 all zero from block_leaf_find 203 ... OK, indexes seem to work. But 'lafs' have problems creating some large files. Try 'tt' This is due to not handling error properly.. fix it later FIXME 16mar2006 Must make sure the index address-space gets clearred up... I wonder how we find all the pages to free. This might be one reason to keep them in a radix tree. Though we should be able to walk our own data structures. Then work on mounting a 2-device filesystem. FIXME dir_next_ent always starts from the beginning rather than remembering where it is up to... can this be fixed?? 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games) Mounting snapshot needs a way to identify that it is a snapshotmount and which snapshot, and which filesystem. We could use a different filesystem type, but that isn't really needed mount -t lafs -o snapshot=name /original/mount/point /new This grabs the named snapshot of /original/mount/point and places it at /new The 'snapshot=' option is the trigger. For a control FS, we mount -t lafs -o control /original/mount/point /new To grow a filesystem, we initialise a device (super/state blocks) and mount -t lafs -o remount,new=/dev/name whatever /original/mount/point as the dev_name isn't passed to remount So, mount options are: snapshot=name dev=/dev/device new=/dev/device control and various name=value pairs matching what is exposed in the control filesystem 23mar2006 - factored out super-block finding preparatory to finding snapshots. Thoughts: superblocks for snapshots and sub-ordinate filesystems do not get stored in the 'state'. There is, however, a usage count so that the prime filesystem cannot be unmounted until all snaps and subs are gone. This should just refcount the prime_sb I suspect. So: a snapshot sb points to the 'struct fs' but doesn't .... what??? 30mar2006 - remove the super-block finding code by changing the layout to store superblock locations explicitly :-) - teach 'mount' to mount snapshots. - need to audit for bad use of ss[0] - need to find better way to map 'sb' to snapshot number. - need to make unmount work. 01apr2006 (no, really!!) - rewrite index to kmalloc index blocks and use a shrinker to free them. This means that indexblock no longer has a 'page', which makes sense. It also means they cannot live in highmem, which is sad, but could be fixed. Notes: superblocks and refcounts. Each device holding the filesystem gets a superblock. One of these (arbitrarily) is the 'prime' superblock and gets to manage the whole filesystem. Each snapshot also gets a superblock, as does each subordinate filesystem. These are anon sbs - using anon dev. Each anon sb takes a reference to the 'struct fs', and also to the prime sb.... how about the reference relationship between fs and prime_sb??? Need to ponder this, - problem with getting parent superblock due to semaphores... - when unmount, put_super isn't being called, so inode 0 isn't released! 13apr2006 (Took a week off to play with rt2500 wireless cards) - Use different filesystem type for snapshots and subordinate filesystems. This removes the semaphore problem + OK, mount and unmount works for snapshots... what next? - review index block - worry about himem? - review ss[0] usage - OK - general code review FIXME - what should leaf_lookup/index_lookup return on format error? The currently return '0' which will quietly make an empty block. Many '-1' would be better to make an error block. FIXME check how other filesystem lock the setting of PagePrivate Maybe just need to lock_page FIXME combine find/load/wait into one operation Review dir, super, roll, link FIXME module refcount increases on failed mount! 18may2006 I've been sick for too long, and not much has happened... However I think more than the above comment says. I started looking at roll-forward and have the basic block parsing in place so that it reports what it sees in the roll. Also, the format has been changes a little: the address in the state block is the CheckpointStart cluster, and we simply roll forward to the CheckpointEnd, and then keep going beyond there - there is no longer any walking back to find the start. Next step is to start incorporating rolled elements into the filesystem - data blocks: shouldn't be too hard. Don't need to update the index pages just yet - inode updates: should be straight forward enough, but care is needed as the data might be in multiple places - directory updates: these are probably most interesting.. Question: how are symlinks created? Currently we: log the inode creation commit the new inode log the directory update. This allows the 'value' stored in the inode to appear after the directory update. That might be OK for files (Which are created empty and then extended) but is bad for symlinks (which are created atomically). So, options include: - ensure inode is in a previous cluster to directory updates. This slows things down too much I think - log the content as well. This is awkward if it is big, certainly if more than a block, which is possible. - directory updates could be dependant on the inode being valid. This is ugly. - log content if it is small, else write inode, flush, then create link. So the fast option is: log inode create, log content, log filename and the slow/safe option is log inode ceate, sync file, log filename So on roll-forward if we see the inode we just save the data. Saving the whole inode seems attractive, but we want minimal order dependance: an inode update in the same cluster as the new inode should still over-ride, even though it is earlier. Ok, rollforward is proceeding slowly. I think I am now incorporating new blocks into the tree properly, though the code probably won't compile. It will be nice to test this and see the file have the right data. Next step would be to include the index incorporation code. Then - directory updates - segusage summary - quota - stuff.. 08jun2006 - what exactly should happen when rollforward finds a file with a linkcount of 0? Currently all updates get lost - I wonder if they are lost safely? - rollforward is getting the size right, but not the content - do I need to flag a block that ->phys is valid? : Ok, roll-forward picks up new blocks in a file OK, but umount has stopped working. Presumably because there are pages attached to the inode which aren't getting released. What do we want to do here? Normally those pages, or their addresses need to be recorded before they are lost. But on a read-only mount we don't care so much. 22jun2006 continuing above thought.. When we roll-forward and pick up the pieces of a file, we don't want to allocate pages to hold those pieces (and definitely don't want to read them all). We just want to attach the addresses to the parent for incorporation. Similarly after writing dirty blocks in a file we want to be able to release them immediately rather than waiting for the addresses to be incorporated (as incorporation can be more efficient when delayed). We could just allow the page associated with a block to be released, except that the page provides the indexing to find a block. We might be able to live without the indexing, and hunt down the indexblock tree, but living without the mutual-exclusion provided by block indexing would be more awkward. And the 'struct datablock' still contains a lot more than is needed. So maybe we should just have a completely separate structure attached to the indexblock which lists fileaddr/physaddr. This could include extent information. The trick would be guranteeing allocation. We could either allocate-late with a fallback of attaching the 'struct block' or performing an immediate incorporation, or allocate-early and block the dirtying of a page until there is space to record the new address. This last is bound to be easiest. So: what exactly do we use to store addresses? Probably a linked list of tables. Each table contains a link pointer and an array of fileaddr/physaddr/extentlen But we would need to allocate lots of these if there are hundreds of dirty pages, but possibly only end up using a few if they made extents very nicely. That might be wasteful. Or we could allocate just one. When it is full we perform an incorporation. But if that causes a page split we are in trouble. We could have a spare page, split to it, write out one and wait for the spare page to be written and free. But we cannot just release the index page as it might still have children. (I think I've been here before). A worst-case scenario involves writing one block and that requires spliting every index up the tree to the inode. This requires arbitrarily many pages to be allocated. To accomodate this we either pre-allocate a spare page at every level of the tree down to the data block (a bit like storage space allocation) which seems very wasteful, or we make sure we can release one of the split pages, which seems impossible. I could decide not to worry about it. Have a pool of index pages and hope it always works. Afterall, most pages are data pages, and they can be freed successfully. We would only have a deadlock if all dirty memory were index pages, and that seems unbelievably unlikely. If we trigger a checkpoint when the count of locked-pages hits some limit we should be safe. So: Keep one table per index block. Use simple append and sequential search. When table gets full, force an incorporation Do we allocate the table separately, or embed it in the indexblock?? Probably embed it. indexblocks that don't need it can be freed at any time so that space waste hopefully isn't significant. How big? If the file is written sequentially, then everything should gather into extents, and so it doesn't need to be enormous. If the file is written randomly then the index block can be expected to be 'indirect', so incorporation will be cheap. So 'small' seems ok in both cases. Let's say 8. But wait a minute..... On a checkpoint we can be getting phys updates for prev and next phases. next-phase updates cannot be incorporated until the indexblock has passed on to the next phase. So in that case, I think we still keep a linked list of unincorporated blocks and live with the fact that we cannot free them until the phase change passes. That shouldn't be a big problem as it is a limited time frame - especially for data blocks.. But does this solve our initial problem?? During roll-forward we want to keep the addresses but not the blocks, and we don't want to force incorporation. That means an arbitrary list of addresses attached to an index block. I guess we could possibly allow incorporation, but I would rather not as I want the fs to be able to be read-only nicely. So that means we need to have a list of address tables. Maybe the normal approach is 'add a table if possible, else incorporate'? OUCH... we may write a block a second time before incorporating the new address, so when adding an address to the table we need to check if it already exists. That could be expensive. For index blocks might it even be a different address? I think not but the vague possibility (in the future?) does complicate things somewhat. Maybe we just keep thing in chron order and don't worry about duplicates until incorporate time, when we have to sort anyway. todo: lafs_find_block DONE free_block must free tables DONE Unmounting still doesn't work. Problem is that an index block is holding a reference on parent, and parent references aren't getting cleaned up. On read-only unmount I guess we need to walk the list of leafs, discard any address info, and unlock the blocks. So that should be the first task for next time. 27jul2006 Leafs are locked blocks which have no locked children. So any locked data block (non-inode) is a leaf Any locked index block with lockcnt[phase] 0 is a leaf. OK - fixed numerous bugs, but I can unmount now!! I can even rmmod and insmod and all is cool. TODO: - review refile and get all the code in there from prototype DONE (I hope) - write a combined find/load/wait function and use it DONE - allocate inodes in single memcache and avoid generic_ip HALF DONE. (still using kmalloc, not doing initonce well) - review recording of new block addresses + make sure we lookup there on index lookup - YES + make sure ->uninc_next gets tranferred to table at phase change. + write incorporation code as it is tricky - review how directory updates can be incorporated into a RO filesystem. No, they cannot. We need to update the directory. - write directory update code - write cluster construction code - make sure indexblocks with unincorporated addresses get on to inc_pending ?? or is locking them enough? INCORPORATION - ARgggghhhhh. The current uninc_table doesn't really lend itself to building index block... though maybe.... Question: what happens when an index block disappears? i.e. it has no addresses in it? We clearly need to remove it from the parent. This should be trivial, a direct operation on the parent index block. etc some number to 0. Then the next incorporation pass with simply lose that entry. OK, that might be all well and good, but how do we sort unincorporated addresses so we can merge them? A linked-list merge sort is nice and open-ended, but does waste quite a bit of space in pointers. Or maybe I should just always do small-table incorporations. Is there a way that a bad ordering of writes could force very bad index layout in this case? i.e. cause a table split every time, but new blocks go in the first (full) table. OK Decision: always do small-table incorporation. i.e. not a list of blocks: just a table of addresses. FIXME check validity of index type when it is first read in, and reject early if it cannot be recognised. 24aug2006 Took a break from incorporation. Looking at directories. Wrote dir.doc in module to sum lots of stuff up. Issue: dir blocks have an info structure attached. This included a counted reference to the parent. How long does this need to hang around for?? - when there is any orphan issue happening, it must stay, via the 'pinned' flag. - when actually performing a dir op, we need to create and maintain this info. When last ref of a dir block is dropped, should drop the parent reference. Status: free list management mostly done. Next: create/delete prepare/commit/abort orphan handling dirty_block lock_block FIXME should dir_new_block zero out the block? How will commit_create know what to do with this block? NOTE another type of directory orphan is a free leaf block which is on the part-free list. ------------------------------------------------------------- 09spe2006 0 on the plane to Frankfurt Don't tell me I am rethinking preallocation again ??? TODO dirty_inode needs to record the phase it is dirty in inode_fillblock needs to check current phase and act accordingly. we inode.doc Make sure the B_Orphan flag is set and used - or discard it. How do we commit creating a symlink? If it is a full block in size we cannot make an update record. - maybe have two update records? We cannot guarantee they are in the same cluster. ... but if we put the 'make dir entry' last it should work. Change 'struct descriptor' definition the 'block_type' aka 'length' 16 field becomes 0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K. 0x8001 -> 0xc000 -> miniblock upto 16K+ 0xffff -> index block. Need to write IO routines which decrease pending-block-count in 'wc'. Thinks. a 1TB filesystem with 1K blocks and 4096 blocks/seg gives 4Meg segments. That would be 256K segments which at 2 bytes per segment - 512 segments per block - is 512 blocks in each seg usage file 12oct2006 Need to write - lafs_lock_{d,}block DONE Make sure the block has parents and allocation and set the locked flag and the phase. - lafs_flush Given a datablock, wait for it to be written out This is needed before updating a block that is still locked in the previous phase. - lafs_inode_init Used when creating a new object/inode Given a datablock which is to hold the inode and a type (Type*) and a mode, Fill in the data block with appropriate data so that when lafs_import_inode looks at it, the right stuff happens. - phase_flip - lafs_prealloc - lafs_seg_ref - lafs_lock_inode lafs_dirty_dblock lafs_cleaner_pause lafs_dirty_inode lafs_seg_flush_all lafs_write_all_super lafs_quota_flush lafs_space_use lafs_cluster_update_abort lafs_cluster_update_commit_buf lafs_cluster_update_commit lafs_seg_apply_all lafs_cluster_update_prepare lafs_inode_phase_check lafs_seg_dup lafs_dirty_block lafs_cluster_update_lock lafs_checkpoint_unlock_wait lafs_orphan_drop lafs_free_get lafs_find_next 2nov2006 - I need to know if a block is undergoing write-io so that I can avoid modifying it in certain circumstances. But I don't track this information. Options: 1/ track the info. This means an extra field in the 'struct block' because I still need to know which wc has had a write. 2/ For blocks that we care about copy the data on write... But we care about all inodes and directory blocks. That is a waste. I think we put extra info in the block. We need to know which wc was used (0,1,2) and which pending cluster in there (0-3) which comes to 4 bits. But we only care about the block for wc=0. and we could include the which-pending in the b_end_io, or maybe put it all in low bits of the block pointer.... Need max 4 bits. Can only be sure of 2... Maybe: 'which' goes in bottom two bits of bi_private 'wc' goes in ->flags 4apr2007 (What a long gap !!) - lafs_cluster_update_* How do we prepare for a cluster update? How do we lock it. The important thing is that the update can be written. That requires that there is space available. So we need to preallocate space and then release it. It is possible that each update might go in a different cluster, so maybe we need to preallocate one block per update. That sounds a little expensive. After all, we aren't preallocating a cluster block for every data block that is dirty. So: prepare does nothing lock preallocates the space - a full block. commit copies it in. For now at least. 24May2007 - Can now create and delete lots of files. This is cool. But: Orphan slots just grow and grow - never to be reclaimed - why? After rm f*, 7 files remain. but rm f* again and the go. FIXED - readdir wasn't returning them Size of directory remains large. And sometimes, files become ghosts... (try just removing one after first rm f*). TODO - process those orphans to clean up the directory. 20June2007 (Happy Birthday Dad) - Creating lots of file and then deleting them leaves 5 orphan slots for the directory busy, and one for inode 0?? Directory handling uses the following orphans: CREATE: A new index block is created by splitting. This needs to be linked in. DELETE: The dirent block we are deleting from If it becomes empty, it needs to go on free list The index block we are deleting from If it has lots of free space it might need to be rebalanced. The inode that was deleted. - When a file is fully deleted, we need to drop any orphan info... DONE - Need to do orphan handling of free blocks in directory, and unmerged parents - but there doesn't seem much point as I am going to change the directory layout (again). So: writing to a file. We need prepare_write, commit_write, and writepage. Prepare loads and links the page and checks there is space. commit marks it as dirty so writeout is possible. writepage chooses a page to write out 25June2007 - HACK week, thanks Novell!! - write - DONE - sync Somewhat done. Need to revise the process whereby async completion clears PAgeWriteback, We need locking in there, and need to worry about 'which' wrapping too soon. Need to not start IO before we set page writeback - chmod Maybe, but syncing to disk needs more thought. - 'df' Partly done, need actual content. - mkdir Can make directory, but creating first entry fails. - FIXED - symlink - readlink - new directory structure. 27Jun2007 - More HACK week :-) - new directory layout done - much easier!! - If I delete a file that was created, the blocks still have a ref-count and we crash. - mkdir doesn't increase link count on parent. - FIXED TODO: Orphan handling. Infrastructure to process orphans Handle specific cases flush orphans at key times. load orphans at roll-forward checkpoint Write out a checkpoint (when?) Make sure refcount goes back to zero on blocks I write. Check on inode_phase_check and checkpoint_unlock and inode_dirty in all directory operations. FIX: Writing a small file leaves something non-dirty but due to be written, and lafs_cluster_allocate complains. - seems to work now. FIX: dir_handle_orphan doesn't lock the orphan transaction required. FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages. FIX: lafs_allocate hasn't been written!!! FIX: before updating any block in a depth=0 file, we must first load and 'lock' block 0. 29Jun2007 - still HACK week. Summary of how incorporation works. Each index block has a small table for unicorporated changes. i.e. blocks number and their addresses. This supports efficient storage of extents, and is extensible by allocating more tables. This last is done rarely. When a block gets a new address, this is added to the table or, if there is a phase missmatch, it is added to a list until a phase change happens (so the whole block is pinned pending the phase change). If the table is full then: - if the filesystem is read-only (including during roll-forward), a new table is allocated (else rollforward fails). - otherwise we incorporate the table into the block, then add the new address to the (now empty) table. If incorporation requires that we split the index block we allocate one from a pool. If there are none in the pool, we wait. As the table is much smaller than a block, the incorporation into two block will always succeed. The 'uninc_next' and 'children' lists will then need to be shared between the two blocks before the new address is added to whichever table is appropriate. When looking for a block address, we must always check the table and then children lists. We do not need to check uninc_next as they will always be children. How to ensure that the pool always has sufficient index blocks and we don't deadlock? We have two halves of the table, one for each phase. Before we allow a block to be dirty in a phase, we ensure that the pool has adequate index blocks for that phase. e.g. twice the depth of the block. If it doesn't we block the dirtying until space becomes available. For syscall writes, this is easy as we catch in prepare_write. When we perform a phase change, we must be sure there are enough index blocks for the deepest bloc that will stay dirty. If there aren't, we need to flush all dirty block, and unmap all writable mappings before starting the checkpoint. FIX: need to work out life time rules so that inodes hang around while they have blocks. currently have an igrab that is never put. FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it. 3Jul2007 Checkpoint flushing is getting close. Current problem. InoIdx blocks are not changing phase. Phase change should happen when all children have been incorporated, and then the write has been triggered marking us clean. For InoIdx blocks, we need to be marked clean when the data block completes. 5jul2007 - a week off Checkpoint flushing seems to work !!!! FIX: what should filesize of symlink be? other filesystems use len, but still zero-terminate for vfs. Problem. A chmod is followed immediately by an unlink then a checkpoint. The chmod update gets into the checkpoint cluster, but the unlink completes before the checkpoint is finished so the new superblock sees the file as gone. Roll-forward find the update and want to update a missing file. This isn't a big problem, but with slightly different details, it could be. One option is to ignore updates that preceed the updated block. That might be awkward with e.g. directory updates and checkpoints that cross multiple segments. Another option might be to prohibit updates once a checkpoint has started unless they are known to be after the phase change. FIX: unlink isn't punching a hole in the inode file. Inode usage map isn't being updated. - FIXED (For create, not unlink). FIX: roll forward does not pick up inodes, only data blocks. But tiny files are synced to inode, so they might not be picked up. So we must process a level=0 inode like a data block. 6July2007 Time for lots of clean up. DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid. DONE 2/ rename 'lock' -> 'pin' 3/ Review and fix up all locking/refcounts. See locking.doc DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check B_Alloc inside the semaphore Also lock inode when copying in block 0 and probably when calling lafs_inode_fillblock (??) DONE 3b/ lafs_incorporate must take a copy of the table under a lock so more allocations can come in at any time. NotYet 3c/ cluster_flush should start all writes before calling _allocate as _allocate might block on incorporation/splitting. No. We really want _allocate to not block, but to queue... I think this is too hard to get perfect just now, so I will leave it. DONE 3d/ introduce PinPending for data blocks. remove fs->phase_depth. LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems so that locking of lru doesn't have to be too global DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the lru system. DONE 3g/ revise refile lru handling based on new understanding 3h/ Utilise WritePhase bit, to be cleared when write completes. In particular, find when to wait for Alloc to be cleared if WritePhase doesn't match Phase. - when about to perform an incorporation. 3i/ make sure we don't re-cluster_allocate until old-phase address has be recorded for incorporation. 3j/ Check that index blocks cannot race when getting locked.... k/ Check what locking is needed to set PagePrivate exclusively. DONE l/ cluster_done needs to call refile, but is called in interrupt context. We need to get it done in process context I think and lock ->waiting access with fs->lock after changing it to ->lru DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback only when *all* have finished. DONE n/ on phase change, uninc_next blocks need to be shared out. NO 3o/ Make sure lafs_refile can be called from irq context. 3p/ lock all lru accesses. 3q/ Lock those index blocks!!! 3r/ Can inode data block be on leafs while index isn't, what happens if we try to write it out... FIXED Why are extent entries only grouped in 4s? If InoIdx doesn't exist, then write_inode must write the data block. 4/ resolve length of symlink FIXED - long symlink followed by 'sync' crashes. FIXED - rollforward isn't calling 'allocated' on blocks, or something FIXED - I cannot find 'bfile'. (inode isn't written) SEEMS OK...- Must flush final segment of a cluster properly... 5/ Review what does, and does not need to be initialised in a new datablock 6/ document and review all guards against dirtying a block from a previous phase that is not yet safe on storage. See lafs_dirty_dblock. 7/ check for proper handling of error conditions a/ checkpoint_start might fail to start a thread! b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot. 8/ review checkpoint loop. Should anything be explicit, or will refile do whatever is needed? 9/ Waiting. What should checkpoint_unlock_wait wait for? When do we need to wait for blocks the change state. And how? DONE 10/ rebase on 2.6.current DONE - use s_blocksize / s_blocksize_bits rather than fs-> 11/ load/dirty block0 before dirtying any other block in depth=0 file 12/ Add writecluster flag for old-phase updates. Why is this needed? updates should always go in the new phase??? 13/ use kmem_cache for 'struct datablock' 14/ indexblock allocation. use kmem_cache allocate the 'data' buffer late for InoIdx block. trigger flushing when space is tight Understand exactly when make_iblock should be called, and make it so. 15/ use a mempool for skippoints in cluster.c 16/ Review seg addressing code in cluster.c and make sure comments are good. DONE 17/ Make sure create inherits uid etc from process. 18/ consider ranges of holes in pending_addr. DONE 20/ Implement rest of "incorporate" DONE 21/ Implement staged truncate DONE use for setattr and delete_inode DONE 22/ block usage counts. 23/ review segment usage /youth handling and make a todo list. a/ Understand ref counting on segments and get it right. 24/ Choose when to use VerifyNull and when to use VerifyNext2. 25/ Store accesstime in separate (non-logged) file. 26/ quotas. make sure files are released on unmount. 30/ cleaner. Support 'peer' lists and peer_find. etc 31/ subordinate filesystems: a/ ss[]->rootdir needs to be an array or list. b/ lafs_iget_fs need to understand these. 32/ review snapshots. How to create how they can fail / how to abort How to destroy 33/ review unmount - need to clean up checkpoint thread cleanly - be sure it has fully exited. 34/ review roll-forward - make sure files with nlink=0 are handled well. - sanity check various values before trusting clusters. 34/ Configure index block hash_table at run time base on memory size?? 35/ striped layout. Review everything that needs to handle laying out at cluster aligned for striping. 36/ consider how to handle IO errors in detail, and implement it. 37/ consider how to handle data corruption in indexing and directories and other metadata and guard against problems (lot of -EIO I suspect). - check all uninc_table accesses are locked if needed. And more: 1/ fs->pending_orphans and inode->orphans are largely unused! 2/ If a datablock is memory mapped writeable, then when we write it out, we need to with fill up it's credits again, or unmap it. 3/ Need to handly orphans asynchonously. --------- 22nov2007 Free index block are on two lists, both protected by the global hash_lock. 1/ The per-inode free_index, so they can be destroyed with the inode 2/ The global freelist so they can be freed by memory pressure. 11feb2008. Where was I up to again? reviewing phase_flip and lafs_refile. UPTO Reading through modify.c, at 'add_indirect'. Plan to fix all this code. Need to thnik about how index block really change. How old blocks get dis-counted from segment usage, and what optimisation are really good for re-incorporating index blocks. Operations to consider are: i)Append new block, ii)truncate, iii)over-write, iv)fill-hole. i/ leaf block splits, index block gets new entry at end, and replacement for other entry. Easy to handle ii/ trailing entries are zeroed. Should be easy, but isn't yet. iii/ probably caught in leafs. May cause internal split so we add new index address, which is easily handled if there is space. iv/ same as iii, though split more likely. What about merging index blocks. That just makes addresses disappear, which we handle the slow way. Do we ever re-target index blocks? Would need to be careful about that. Make it look like a split where one block ends up empty as a hole. Need to write grow_index_tree (DONE - untested) ib is a leaf inode that is getting full. Copy addresses into 'new', and make 'ib' an index block pointing at new. add_index/walk index (DONE - untested) end of do_incorporate (DONE - untested) new contains the early addresses. Some remain in ib and/or ui. the buffers much be swapped, so ib has the early address. ui needs to be attached to new return 2; - then new uninc needs to be split lafs_incorporate case 2 - horizontal split case 3 - vertical split 12feb2008 Bother - uninc_table is a problem (again). We can currently add at any time with just a spinlock. So when we split a block horizontally, Still need to share out children and uninc_table in do_incorporate share out credits in do_incorporate 14feb2008 Still need to do incorporate as above but took a break to... Counting allocated blocks now works - stat show right info, hopefully storage is correct too. - DONE next: truncate? orphan thread? Then segment usage and the cleaner. thoughts: truncate - removing blocks doesn't need to erase them... - nothing forces a cluster_flush promptly!!! We need a timeout or at least we need a flush before truncate_inode_pages... - in lafs_truncate we need to make the block an orphan an pin in all in a checkpoint. 21Feb2008 (Research morning) Discard checkpoint thread created on demand in favour of a cleaner thread that runs all the time. It cleans and checkpoints and orphans and scans. want to: do segment scan and get a real list of free segments and free-space info! 25Feb2008 - segment usage scanning to count free blocks - fix up re-reading of erased blocks - FIX truncate can still block waiting for writeback to complete. - FIX allocations aren't failing when we run out of free space - FIX df doesn't agree with du. problem: Truncate when an index block has addresses in uninc_table. The summary for the new address has already been performed. We need to deallocate the new without disturbing the old. However a simple allocation may not be possible. I guess we can prune them all to zero, then incorporation can proceed. TOFIX: when truncating a recently created file, it is still depth=0 so nothing happens. We really need to increase the depth to 1 as soon as we dirty any block, then reset back to 0 if it fits. 26Feb2008 We have a file that we have written to, and the data blocks have been written out and the addresses stuck in uninc_table. We then truncate the file. Who releases the usage of those blocks? And who removes them from uninc_table? OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'. I really should make sure that inodes are getting freed properly and the inode map is clean and everything. BIG QUESTION Do we reserve segment-usage blocks. We cannot do it naively as we get infinite recursion. But we need it to be allowed to dirty the segment block. But we cannot pin them to this phase as we want to write them out after this phase This still needs more thought. I avoided the recursion by setting SegRef before getting the ref. But that isn't safe. 28Feb2008 The table of cleanable segments is not working out. Each segment appears multiple times which wastes space and adds confusion. We really want to be able to lookup by dev/seg and also find the least. 'Find least' sounds like we want a heap but then we cannot discard the bottom half. We could have a skiplist for dev/segment lookup and do a merge-sort on a different link when we want to find the best segment. We then remember the best number found since a sort, and re-sort if the top is worse than the best. We keep all this in a fixed size table. Each entry has seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some addr-sort-skip links. This is 32+32+16+16+16+16 bits, or 16 bytes or bigger. Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty). One page of 16byte entries (256 of them) 2/3 page of 24byte entries, 1/3 of 32byte entries. Total 2 pages, and 256+113+43 = 412 entries. But deleting random elements is awkward... but not too awkward. We can delete lots of entries by marking them as old, then performing a single pass of the skip list deleting them. We should keep free segments here too, on a separate list. So how about: 2 pages of 16byte entries 1 page of 24 1 page of 32 free list randomly threads through all. When using from 24 or 32, randomly choose height of 2-5 or 2-9 Two lists run through the skiplist entries. One for cleanable, one for free. Remember the nth element for some small n (10, but it decreases as we pull things off the front) and if we add something less than that, we trigger a mergesort on the next time we want to clean.... maybe. Remember end of free list and add to there. Maybe merge-sort the free list by addr occasionally. Quesitions: When can we clean, when can we free wrt checkpoints? - we an clean a segment as soon as we have a checkpoint after it. So we record the youth of the segment holding the (start of the) checkpoint, and can clean any segment with a lower youth. - we can free a segment after the checkpoint after itfs usage has reached zero. So if usage is zero and youth.... We could offset the usage by one (say - for the first cluster header..) then when we find a segment with usage of '1', we schedule an update to 0 in the next checkpoint... Have about segments with different sizes - they get different weights. Need to divide by segment size: usage * youth / size. TOFIX - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking) - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush) - Lots of 'creates' makes lots of little clusters - need to optimise! Or it could be deletes as we currently cluster_flush for each delete. - I think this is fixed 29Feb2008 Started looking at the cleaner. Need to understand how much to clean each checkpoint Need to track free-space-in-active-sectors while scanning. 3Mar2008 TOFIX - the cluster head is currently limited to one page. This is not good. - Should the cleaner start before the scan is complete after a checkpoint? Probably it can, but while the scan is still happening it might be best to be cautious ?? STATE: try_clean is taking shape and has a few FIXMEs. need to write async find_block code and get it to watch for block in a cleaning segment. 28Mar2008 - where can padding appear in a cluster? between miniblocks? at end of device blocks? - need to track phys block while parsing headers for cleaning.. why? - determine rules for avoiding block lookup during cleaning based on youth/snapshot age, and truncate generation. We need to load the inode from each snapshot Can we optimise based on snapshot age? only if we know the block is newer than the snapshot. So when we relocate blocks (cleaning) they must go in a segment that is marked as being old. we cannot really guarentee that. I guess blocks that are marked as 'new' can safely be skipped if segment is newer than snapshot. This 'age' is not the youth, but is the cluster_head->seq which is stored in creation_age. - Store the rootdir for a filesystem in the metadata for the root inode. Then 'struct snapshot' doesn't need rootdir. It can have a root 30Jun2008 Looking at lafs_find_block_async. Needs async flag to make_iblock. Check that. Can we block_adopt if there was an error? iblock will exist. setparent has async flag. lafs_leaf_find has async flag lafs_wait_block_async FIXME I wakeup the cleaner every time an IO completes. Do I really want that? Maybe only when number of async IOs hits half the recent maximum?? FIXME need to ensure that lafs_pin_dblock flushed committed B_Realloc blocks. FIXME when we incorporate a dirty (non-realloc) address to an index block, we need to clear B_Realloc on the indexblock. FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without giving it any credits. Where should they come from? We don't seem to scan for free/cleanable segments often enough. FIXME we shouldn't start a checkpoint while cleaning is happening. FIXME need to be careful when cleaning about finding inodes that don't exist any more. FIXME give credits to realloc blocks. FIXME think about/document transitions between realloc and dirty, and what locking is needed. 2Jul2008 Allowing for the FIXMEs above, the cleaner is now identifying blocks that need to be cleaned and marking them B_Realloc (I think). We now need to gather these into a write cluster and write them. They will all be on the clean_leafs list, so we can iterate that allocating or incorporating as needed. This will be similar to do_checkpoint. Important question is: when? Ideally we would have some auto-flush mechanism. The cleaner just keeps finding blocks to clean and when we start running out of resources we flush the cleaning queue. However we will still want to flush the cleaner always before a checkpoint, so for now we cna implement that bit and wait for a need for the other to arise. FIXME: cleaner lookup of 0/0/0 has interesting consequences as we don't record that location the same way.. how to handle? Should check that 'adopt' doesn't do the wrong thing with this block. Realloc blocks need to be pinned. That makes sense. Only that way will they get onto the clean_leafs list. When checkpointing we should probably examine clean_leafs to be on the safe side. Realloc and Dirty: Both of these hold a Credit. Both can be set at the same time. Cleaner ignores Dirty and sets Realloc anytime the block is in the wrong segment. It also Pins the block. When the cleaner is flushing to the cleaning segment, it ignores Dirty blocks. They get their Realloc cleared, but the remain pinned. So they will get moved at the next checkpoint. How do we know whether an indexblock should be Dirty or Realloc? The Dirty/Realloc bit is cleared before we get to incorporation. Maybe we lafs_dirty_iblock the parent of any block we write out. Then after incorporation, we set Realloc if it is not dirty. STATUS: I think I'm pinning cleaner blocks now. Need to make sure the dirty ones are dropped. DONE Need to make sure the usage is transferred Need to get free segments back into use Need some more 'dump' options. Maybe youth/usage files. Maybe tree. Need to make sure scan etc are triggered often enough. FIXME lafs_prealloc walks up ->parent without locking I think we want i_mapping->private_lock like lafs_pin_iblock. TODO: 1/ a 'dump' option that triggers a scan and prints everything out. 2/ scan must mark freeable as such, then subsequently free them. 3/ Look at code that decreases usage of old segments. 4/ Review lafs_cluster_wait_all and decide exactly how long we need to wait. 5/ Review 'FIXME that is gross' HZ/10 thing. 6/ Review 'wait for checkpoint to flush' msleep(500); Maybe remove that altogether. FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush FIXME BUG in lafs_allocated_block fired. from lafs_erase_dblock from invalidate_page from .. vmtruncate from lafs_setattr Current problem: An inode data block is dirty and pinned, but the inoidx is no longer pinned. Presumably it isn't dirty. Recheck what 'dirty' means on the two blocks and see how this can happen. 10july2008 Tree gets very big! Lots of 'Realloc' blocks that should be long gone. WE are spinning in cleaner again, and not in try_clean. Is it a problem that 'Pinned' is used for Realloc and dirty blocks? In general it shouldn't be. The flush_cleaner process will remove the Realloc bits so the blocks fall off clean_leafs. They then either go onto phase_leafs or get unpinned. But I currently have a problem with InoIdx/data. The Pin is transferred to the Data block, but it doesn't go from the InoIdx block because it has a pincnt. Now that is probably a bug, but what if it weren't? What if, while we were cleaning, a block got dirtied. That would pin the whole tree. I guess the rule about not allocating an inodedata block while the InoIdx is pinned needs to be revised. If the inodedata block is Realloc (and not Dirty) while the InoIdx is not Realloc, we can go ahead (in a cleaning segment). FIXME to check: adir/big1 is garbage.... big1 was removed, so why is it even there? FIXED. echo tre > dump # still too much stuff. Put cond_sched in checkpoint loops! Thoughts about cleaning and pinning. When cleaning we need to know how many dependant blocks are being cleaned so that we know when *this* block can be written - i.e. when the could hits 0. We cannot use the pincnt for this phase because there may be dependant blocks which are dirty. They, and therefore this, may get flushed at next checkpoint, but they may not. If we could be certain they would, we could just write to the clean-segment blocks which can become unpinned. However if there is an index block being cleaned, and no dependant is being cleaned, but some are dirty but not pinned, then the checkpoint can go past without the block being moved.... but maybe we can detect that. Try this: We set B_Realloc precisely on blocks found in segments being cleaned. We pin these blocks and leafs which are Realloc go in clean_leafs. If a block is both Realloc and Dirty we clear Realloc but leave pinned. That way it gets written at end of checkpoint, but to main cluster. When we incorporate Realloc blocks into an index block, it gets marked Realloc. When we incorp dirty blocks, mark dirty. Then see above. On a checkpoint, we process both phase_leafs and clean_leafs FIXME do inode reads async better when cleaning... FIXME if a realloc inode has been allocated to a cluster when we try to dirty it, confusion can ensue as the writeout won't mark it clean, but will use up the credits. Maybe we need something similar to phasewait to not set PinPending... But normal dirtying doesn't phasewait. I think we just need to detect this case and wait for the clean-cluster to flush. Messy... FIXME make sure incorporate is doing the right thing with credits. FIXME lafs_write_inode. We need to be careful about clearing Dirty when making an update. Need some sort of locking. Need to review all inode dirty stuff and make sure we do write thing no matter when it is called. FIXME when blocks are attached to uninc_next, they don't have 'dirty' anymore so we don't know how to flag the index block. 2008jul13 UPTO: unlink etc don't prealloc the inode that will be modified. And a warnon inode.c:579 is very noisy. 2008jul22 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set, but it doesn't get set until AFTER lafs_reserve_block is called. Here I am... Cleaning cleans an InoIdx block which schedules the data block. Subsequent the InoIdx block gets pinned again. Now when we go to write the data block, we cannot because InoIdx is pinned in same phase. Maybe given that data block is pinned, we write it anyway... FIXME: when we realloc an block embedded in the inode, don't pluck it out and put it back in again. Just realloc the inode. FIXME: when cleaning a directory that has shrunk, we think we have blocks that don't exist any more. FIXED - we thought '0' was in segment '0'. 2008jul23 FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster flush finds no credit. for InoIdx block of 8501 FIXME: do we do SEGREF on all the index blocks? do we need to? 2008jul24 FIXME: seg usage for segment 0/5 isn't dropping to zero. Part of a file got moved off, but count is still there. FIXED - seg_move wasn't being called. FIXME: segusage file has inconsistent extents: Extent entries: 0 -> 694 for 2 1 -> 1291 for 1 1 -> 15 for 1 FIXED several bugs in walk_extent FIXME qphase: any locking between that changing and lafs_seg_move?? I don't think so. Just that seg_apply_all must be called after qphase is set. FIXME make sure we don't try to clean the current segment!! FIXME 'Available' goes negative! Creating large file doesn't instantly reduce 'Used'. Deleting files plus sync doesn't increase Avail? FIXME a segment is in the table but doesn't print out! FIXME we don't cope with running out of free segments (not that we ever should). FIXME check all Credit usage and make sure credits are returned when ->parent is dropped. provide visibilty into credit counts. Make sure we are keeping enough space for cleaning. We should always have a few segments unallocatable. 2008jul25 FIXME cannot do io completion in cleaner thread as it can block on a i_mutex which might be waiting for completion. FIXED (keventd). FIXME as ->iblock isn't refcounted we need to be careful accessing it. If we 'know' we have a reference, e.g. a child with a ->parent link, we can access it without locking. So: lafs_make_iblock should return a counted reference. If we own an (indirect?) reference to iblock, we can access both iblock and dblock for free... but iblock can change??? If not, we need to get a reference to on or other under a lock. FIXME block->inode should be a counted reference? lafs_make_iblock OK lafs_leaf_find OK lafs_inode_handle_orphan OK inode_handle_orphan_loop FIXED __lafs_find_next OK find_block FIXED __lafs_find_next OK lafs_find_next FIXED dir_lookup_blk dir_handle_orphan lafs_readdir lafs_inode_handle_orphan choose_free_inum find_block - FIXED FIXME root->iblock should always be refcounted. Is it? FIXME walking siblings - what lock? 2008jul28 FIXME several times we clean PinPending without refiling, in dir.c in particular. that looks wrong. FIXED Maybe lafs_new_inode should return a reference to the dblock Or pin it. or something. FIXED And pinned (when needed). FIXME lafs_inode_dblock might return a block without valid data... Need to get valid data, then load block 0 in find_block rather than load_block. FIXED FIXME we really should own a reference to ->dblock before calling lafs_pin_inode. We don't want IO during a pin request. FIXED FIXME review use of PhysValid FIXED lafs_orphan_abort - what if lafs_orphan_pin not called? or if 'b' is NULL. FIXED Do I Need to clean PinPending when retrying?? Well, we need to be phase-locked when we set PinPending, so it must be Pinned to the current phase. So when we unpin a datablock, we must clear PinPending. FIXED we now clear PinPending in do_checkpoint. Does phase_wait do the right thing when pinning an inoidx block for an inode? FIXED Pending Need to understand and document the lifetime of a page with datablocks. who hold what refcount, and when can it be freed? Then fix up locking in lafs_refile, __putref. FIXME how keep what refcount on orphan blocks/inodes?? FIXME should dirty/pinned/etc hold a refcount? they don't. Later: FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually) FIXME make sure empty files have depth of 1. FIXME Truncate proceeds lazily. All data blocks need to be gone 26aug2008 If I call lafs_erase_dblock while a write is underway, we have a problem. We need to wait potentially for a checkpoint to let go of the block and a write to complete. This should be done with waiting for PG_writeback on the page to disappear. Check this out. When end_page_writeback is called, we must have dropped all references to the page. When we commit to writing a block, we have to set PG_writeback on the page so that truncate et al can wait for it. Before we have committed, truncate can just remove the page. Internally we differentiate by B_Alloc. So before setting B_Allocated we need to test_set_page_writeback(page). Be careful of races. I don't think we can ensure all references are dropped. After all, that is the point of refcounts. So dblock array must exist without page! But we need to ensure that we don't start a writeout after truncate has done wait_on_page_writeback. This is done with the page locked so when we want to write a page in a checkpoint, we need to lock the page first. Once we have the lock, we check if the page is still dirty. If it has been truncated it will be clean. But how do we safely reference the page if b->page can be cleared? How about: When we clear PagePrivate, we take a counted reference to the page for db->page. This is dropped when the page is freed by lafs_refile. But while it is held, it is still safe for db->page to be dereferenced. So before we commence writeout we have to lock the page and set PG_writeback. After locking, we need to test if writeback is still appropriate. Maybe not. I think we can submit blocks for writeout without setting the page to writeback. If we do, then we need to be sure those writes finish before invalidatepage calls releasepage (block_invalidatepage calls discard_buffer which calls lock_buffer which waits). In our case invalidatepage need to make sure that no new write commenses. Maybe we should lafs_iolock_block before we allocate to a cluster and check again if the block is dirty. So: lafs_cluster_allocate does: lafs_iolock_block check if still dirty. If not, unlock and return set allocate flag allocate and write when write completes, allocate is cleared. unlock block invalidatepage does lafs_iolock_block clear Valid,Dirty,Realloc lafs_iounlock_block 2008 aug 28 - happy birthday. FIXME segsum_find calls lafs_reserve_block without a checkpoint lock. lafs_prealloc complains. mark_cleaning does too, but cleaning only happens well away from a checkpoint lock. segsum_find is being called to reference a new segment when we flush a cluster. segment usage blocks are special. Their index information doesn't need to be written out in the current checkpoint. We can do that, but the backstop is to write just the data block in the tail of the checkpoint and write indexing information later. 2008sep10 unlink is getting "No space left on device". This is when trying to pin the directoory block, the physaddr is 0, so it looks like we want NewSpace. But we should even be trying to prealloc in that case becase there should already be a prealloc on the block. i.e. there should be credits. Hmmm. after multiple 'syncs' how can the block not be written out. Maybe it is embedded in the inode? When we pin a block that was embedded in the inode it isn't clear what to do. If we might grow the file so it doesn't fit any more, we need to allocate NewSpace. If we know it won't grow. we use Release. This still needs a proper fix. Cleaning seems to be working nicely. However we don't get all the space back that we should because lots of blocks still have credits that aren't being returned. So when should credits be returned? They are set when a block is pinned. It then gets dirtied which consumes a credit. Then gets unpinned. I guess if it isn't pinned, then it doesn't need any credits. It seems that cluster_flush is not always writing things in the correct order. Root gets written before some other things below it. Maybe they are temporarily out of the loop?? No. There are dirty blocks which one checkpoint doesn't pick up, but they aren't holding the index block pinned. so they lose allocation. But they must hold the indexblock pinned, even though they aren't pinned themselves. We maybe do this just with the refcnt... maybe. That will cause it to phase-flip rather than drop pinning, which I think is right. So: too many credits remain allocated. Where are they? There are 1464 outstanding credits. 290 are in the tree so 1200 or so are elsewhere?? But things removed from the tree have credits removed. FIXME roll forward ignores inodes. But what about an inode that contains data. Should that be ignored? I think not. FIXME delete adir/big2 then delete adir and it cannot release: Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc presumably there is orphan processing or something to complete??? FIXME when files are deleted, the space isn't returned! This seems to be mostly fixed - need to test. FIXME when I "rm [b-z]*" it waits for writeback on something??? zfile again!!! OK, I think that is fixed. 12sep2008 Current problem: seg_apply_all dirties dblocks. When should they be reserved? The originally get reserved by a lafs_reserve_block call in segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block. However: that block might get written before *and* after a checkpoint. So we need N* Credits. These are usually only used for Index blocks. We can set these easily enough if inode type is TypeSegmentMap. We move them across to Credit in seg_apply_all. But when to we clear them if they aren't needed? I guess when we drop the last segref. Yes, we already do that. FIXME need to make sure these get flushed on next checkpoint if we cannot allocate new credits after a checkpoint. New Problem. The 'cleanable' table reports a size of 3, but it is empty! Think that is fixed. Some problems. 1/ see above: rm x/y; rmdir x -> BUG - FIXED 2/ Spins on 'CURRENT=1' ?? 3/ if alloc_space gives EAGAIN while deleting, we don't survive. 4/ When I create/delete a file, ablocks_used increments by one. The inode hasn't been allocated yet, so it seems the deallocation isn't adjusting ablocks_used?? 5/ open_namei (for dd) got caught on a mutex_lock. 6/ When a large file is shrunk we don't reduct the level of the InoIdx block I'm not sure where we should and am not thinking very clearly. Will fudge something in flush_data_to_inode for now, but it MUST be fixed. 7/ unlink (at least) can get stuck in iolock_block. Who could be holding the lock? Writeout that hasn't completed? Yes. writepage calls lafs_allocated_block without calling flush. So the block could be sitting waiting for a flush. How long do we wait?? 8/ It seems that some datablock can need NCredits. Make sure these are handled properly re flush-or-refill after checkpoint and flip_phase rather than unpin. 9/ Maybe after lafs_writepage cluster_flush isn't getting called soon enough, and we lock up (see 7). Need to flush the first block straight away, and the next one as soon as the first finishes, etc. Or something like that. Then remove the comment from lafs_writepage. 8th December 2008 I seem to be getting only 4 blocks to a cluster at the moment. This is good as it motivates the code to handle block splitting in the Btree. But it shouldn't happen. .... Block spliting might work - it doesn't crash at least. But After deleting all files, the tree is full of stuff. Lots of inode data/InoIdx blocks. Many but not all a Pinned. The others are OnFree The Pinned ones have outstanding references. Others .... Problem with the block splitting, when adding an index block. The index block is initially empty - we need to find things by looking at children. But we don't. We BUG_ON the iphys==0. In general, when we add a block below and index block and before we incorporate, the block must be found by finding the first indexed block and looking to see if there is a 'next' block that contains the address we need. FIXED But if we truncate a file while an index block is pinned and dirty, we spin on trying to incorporate it, which should make it empty. 11th December 2008 deadlock. sync is trying to get lock in lafs_cluster_flush pdflush holds the lock and is stuck in cluster_flush_0xa40 some wait_event I expect. Maybe we need an unplug ?? - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits. This is in clean_free. We try to update the 'youth' to mark the segment as free, and we don't have a reservation to do it. Maybe just reserve it there and then. 12th December 2008 When doing a lookup in an index block, we need to check the unincorp address list. It isn't enough to look for unincorp blocks as they might have disappeared. For INDIRECT and EXTENT this is easy enough as full information is in 'uninc'. For INDEX it is a little tricky as we need to look at the full set of addresses to know where a particular address fits. We could force and incorporate first, but that has awkward implications if it requires a split. Maybe if we get from the lookup "start+range".... That is not enough as the 'start' might get zeroed by an update. rm adir/* doen't work as readdir doesn't get all the entries for some reason. Reason is that they are being put in the wrong block. lafs_find_next doesn't correctly find the 'next' block if it hasn't been incorporated yet. Block can be: in index tree -- easy to find in uninc_table -- not too hard in only in the ->children list, or attached to a page. It would be nice to use find_get_pages but that isn't exported so try something else for now. For index blocks Look in index block for 'next 15th December 2008 FIXME when we split an index block, we need to hold a reference to the original so it doesn't disappear until the split-off copy is written. This is because we search from an index block to find split-off copies. [ note from Feb09. This should be OK now. Both will need incorporation, and we now hold on to blocks until they are incorporated.] 23rd February 2009 - index block. What changes are allowed exactly. - splitting certainly makes sense. - merging two adjacent blocks is fine, of which a special case is finding that a block is empty and so removing it. - What about a 2->3 split which would require removing a block and adding another at the same time? or noticing that the first blocks addressed are all missing, so moving the index forward? In each case, searching down by indexes will find a block that has been replaced by a later address. We could manage that as long as the new block is attached after the replaced block. So we cannot move a block. We must delete and replace. - unincorporated index blocks.. unincorporated data blocks are not pinned in memory. Once they have been written out, they can be freed. Their address is stored in the uninc-table. This means we can delay incorporation while many extents are written out and freed. When we come to incorporated, we may have many hundred of address in a few extents that can be incorporated efficiently without holding all that data pinned in memory. The same scale doesn't apply to index blocks. An index block can reference only 102 blocks (for 1K block size). And the uninc table can hold far fewer so we will naturally incorporate more often. So keeping index/indirect/extent blocks pinned until they are incorporated is reasonable. And it makes lookup a lot easier, as we have guarantees about ordering of block in the children list that we don't have in the uninc table. Incorporation could have some atomicity issues. There is no concern about bad stuff appearing on disk as the phase-change process handles that. In memory it might be awkward if we split an index block before incorporating a block what would span them. That could conceivably happen if we only incorporate 8 blocks (size of uninc table) at a time. So maybe we should incorporate a full uninc list (not table) at a time. This means quite different code paths for incorporating leaf and internal index blocks.... - uninc_table lists are a real problem. They can only be created during roll-forward so they hardly ever happen. But if the block is split while processing earlier things on the list, then splitting an uninc table would be very messy. Is there any way around this? Why not just do incorporation during roll-forward? We only need to incorporate leafs, not internal blocks because we don't use uninc_table for internal blocks any more. So during roll forward, all index blocks that are touched need to be held in cache... I think we live with that. If it every becomes a problem, we will need to perform the roll-forward twice. The first time collects the usage information so that we know where we can start writing, then the second just applies all the changes. to the rest of the filesystem. So: uninc table only used for leaves, and has no linked list unincorporated index block are stored on a list, which we sort before applying. All uninc index blocks are therefore kept in the index tree. Their order on the children list allows us to find the correct index. Each block for which the fileaddr is in the parent is followed by any blocks that have been split off and end after this one starts. Blocks that have been emptied are Hole and are skipped over when looking for a block. When we split an internal block, the remaining uninc blocks must not start with a Hole. FIXME: what locking do I need around lafs_incorporate? i_mutex?? i_alloc_sem?? i_alloc_sem is imposed by truncate (inode_setattr) and direct_io possibly. So it is really about adding/removing blocks. Not updating internals. Maybe our own mutex. Could even be per-index-block !! Whatever it is, we need to protect walking ->children too. 24th February 2008 "rm -r" problem from 12/dec/2008 fixed now. incorporate code got a make-over and is probably much better. New problems: After test runs, cannot create files due to no space on devices!! But directory tree is empty. I can see: free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0 The problem is that we think 1425 has been allocated to data that might still need to be written, leaving not enough room for more. Index Dump shows ====================414 credits ============================== which doesn't explain everything, but does explain a lot. There really should be nothing in the Index tree (except fs-root and tree-root) There is also: Some inodes which are OnFree and hold no credits. 0 DATA (1) 52 [0]ESegRef,Claimed,PhysValid 52 1 (0) 0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid Some other inodes which are pinned with lots of credits and are on the phase_leaf list 0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid 299 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc And that is about it. some are not Valid, some are... checkpoint just wants to 'flip' them. They mostly have a refcnt of 1... I wonder who is holding that.... The reference of on the dblock is held by the iblock. But what is the iblock remaining? Who holds that reference? I restored some code to clean iblock, and now: free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0 ====================244 credits ============================== which saved 130 credits. That helps. There seem to be many fewer of the many-credits blocks Lot of index blocks in tree are 'OnFree' and have a 0 refcnt, but haven't been removed. Why? It seems that the have ->parent == NULL, so lafs_refile never bothers to remove them. I guess it should... OK, lots of InoIdx block have gone now with their DATA blocks. So, remaining blocks are pinned to their phase with lots of Credits, have not pincnt, mostly have physaddr==0. It is just the stray refcnt that keeps them there.. inums are 40, 56, 62-73, 275-278, 280 40 is f22 56 is first adir 63-69 are directories 2/3/4/5/6/7/8/9 70-73 are looooong symlinks 275 is cfile 276 is dfile - same as cfile but truncated. Then some nbfile-X that were big enough. So: what do they have in common: Several only use the in-inode data block, but probably not all Can it be that it is refcounted on the Leaf list, and so cannot get off?? Yes, I think so! We only unpin things that have a zero refcount. So: what to do? checkpoint takes it off the list, then flips the phase and puts it on the other list with refile. During that time it has a refcount it doesn't lose the pinning. Do we want to: 1/ Not have it on the list despite being pinned. 2/ Drop the PIN despite the refcnt. 3/ have refile do the phase_flip so it has a chance to notice the refcount has hit zero. 2 isn't really an option. We need PIN to persist whenver we have a reference. We could possibly use PinPending for index blocks too, but that would require a lot of thinking. 1 requires another criterea for being on the list. I suspect that would get messy fast. 3 we used to do I think... But refile is in a big lock, and we cannot really do a phase_flip under that.. and phase flip calls refile anyway so we would get recursion. So:4 - get lafs_phase_flip to notice and de-pin rather than flip. FIXME use kzalloc where appropriate. FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero. 25th February 2009 Good progress. Only 54 credits in Index Tree now. Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage) plus '74', which seems to be schedules for deletion - root has uninc_table. ... and 'sync' got rid of that and left 44 credits. Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74. 50 link 55 zfile 72 long84 73 long85 74 adir These seem to be the files that used data-in-the-inode They still have a refcnt of 1 (or 2 for adir). ... OK, that's gone now. I fould a refcount leak. So now: 42 Credits in Index Dump. No stray files. df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3 So we still seem to have 1085 blocks allocated. 42 are accounted for, so 1043 still missing... either we lost the count, or lost the tree. create a finy file, remove, and sync, now df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3 so I lost 15, b ut now 48 are in tree. Lets try again... df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3 and 44 in tree and again: df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3 Definitely losing more thant the difference in the tree. Try creating empty files... df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3 very strong pattern there. What about 2 files at a time. df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3 Slightly different pattern - not as bad. Have to try 4 now. df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3 Strange, isn't it.... Making sure we clear UnincCredit... result looks worse. 26th February 2009 I fixed up the credit accounting 'incorporate' and then fixed a couple more little bugs. And now: ====================48 credits ============================== df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1 So we still have 720 allocated credits that aren't accounted for. But we are nicely under 100... .... and now ====================76 credits ============================== df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2 That is different. The count of missing blocks is way down, but there is some extra cruft in the index tree. Quite a few like 0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid 0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid and even one 0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid 330 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc Time for a commit though.... and now ====================46 credits ============================== df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1 so the strays in The index tree are gone. but still have 159 outstanding credits. Now change but now ====================36 credits ============================== df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2 That is a little weird... Hmmm. back to ====================48 credits ============================== df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1 Oh well. ====================34 credits ============================== df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1 It seems that the unaccounted blocks are (or can be) created by writing to a file then removing the file without a sync. ..but why is cb (cblocks_used) so high? 27th February 2009 Got onto a bit of a tangent... What happens if we truncate a block while it is on a list to be cleaned? Clearly we want to cleaner to drop it ASAP. But what if invalidate_page wants to drop it *now* Hopefully it is either still on clean_leafs and we can remove it, or it is now iolocked and we can wait for it. So should be OK. I keep getting caught in "looping on..." We are truncating an inode and some index block which is now empty is not getting removed from the tree because there is an outstanding reference.... 327/0 depth=1. I guess I turn on the tracing. ... and it seems that it is in the process of checkpointing. I guess I need to lock against that ... maybe with the iolock. Credits = -1, rv=2 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] ------------[ cut here ]------------ kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371! ------- Every time I create/delete a file, I get an extra 'ab' which disappears on 'sync'. ablocks_used is: decremented when +ve summary_update on non-index increased on lafs_summary_allocate... should not be done for index blocks. OK: after test run, filesystem is empty, but cblocks_used is around 360. cblocks_used: is loaded at mount time collects pblocks_used on a phase flip is updated in lafs_summary_update (unless pblocks is) So we must be missing a lafs_summary_update when phys->0 Lots of problem: truncating big (multi-level index) seems to be bad Leaves 'pb-338 !!! and cb+689, even after sync. still 'looping on' occasionally Haven't found cblocks_used leak yet. Occasionally non-B_Valid blocks are actted on. I think I need to improve io locking. --------------- 1st March 2009 Need some improvements to iolock locking. We use this lock to wait for a block to be written out (if that is happening) before we allow lafs_invalidate_page to complete.n It is also use in lafs_erase_{d,i}block (Similar purpose) We take the lock in lafs_cluster_allocate, and then make sure the block is still dirty. Also lock in lafs_new_inode as initing the inode is a form of IO ?? load_block takes the lock We only clear_bit(B_Valid, ) under this lock. So the issue is this: A block that is going to be written is passed to lafs_cluster_allocate. This happens either after taking it of a _leafs list, or when lafs_writepage requests the write. lafs_invalidate_page needs to be able to release the page, so there needs to be no transient references. In particular, once the block has been removed from a _leafs list it must already be iolocked. Invalidate_page can then either remove from that list and erase the block, or use io_lock_block to wait for the IO to complete. So when a datablock comes out of get_flushable it must be iolocked, and must remain iolocked until after Dirty and Alloc are clear Index blocks belong entirely to the fs, so we can be more relaxed with them. If get_flushable finds the block already iolocked, it is either being invalidated or already has IO pending, so it can be dropped. 16th Match 2009 FIXME When we sync a small file, we just write out the inode. rollforward currently ignores data in inodes I think. Thanks needs to be fixed to ensure this data is safe. - stop iblock from disappearing so much. - I think... While cleaning a file, I truncate it. This makes it appear to fit in the inode but it is very big and we get confused. We cannot allocate block 0 until all the others have been allocated to 0 and forgotten. But what if we truncate a file to 10 bytes, then fsync? We need to write the data promptly, but we like doing truncate in the background. When we extend a file we already need to wait for truncation to complete (FIXME do we do that?) We could wait on fsync too. We cannot just delay block0 as it might be part of a checkpoint that has to complete promptly while truncation can take a long time. i.e. we have a very large file. We update the first byte, then truncate to 2 bytes.... we don't need to write until fsync which will wait... Directory?? delete lots of entries so it shrinks to one block? There is no delayed truncate there. ?? Never clean an I_Trunc file. If we try to allocate a file with other indexes: clear Realloc if Dirty and Pinned, just do normal alloc if Dirty and not pinned, skip. Sometimes I run out of credits while truncating a file. I need credits - maybe only briefly - to dirty the index blocks. -- FIXED I think. An indexblock remains pinned while the refcount is non-zero. A pinned index block can be on a _leaf lru The _leaf lru holds a refcount. This is an awkward referential loop. We break it at checkpoint time with special code in phase-flip. But there are other awkward times such as truncate. We cannot use PinPending like we do with data blocks because there could be multiple pending Pins (from different children). We could possibly treat checkpoint_lock like pinpending, but that might be racy. We could not count the _leaf lru, but that might just make the race harder to find. I think we want to explicitly drop the pin when we truncate a block. Normally, once we Pin an index block is will become dirty so we don't want to de-pin before a checkpoint anyway... Just to clarify: an index block gets dePinned: - during checkpoint on a phase_flip if it is no longer dirty etc - on truncation when we erase it - during pre-emptive write-out which is a bit like an early phase_flip not sure that we implement that one yet. 17th March 2009 Deadlock? - checkpoint calls incorporate call erase_iblock calls iolock_block - rm calls orphan_pin calls phase_wait The problem is in lafs_incorporate. It expects the block to be iolocked, but can call erase_iblock which try to get an iolock itself... ...fixed that and it still happens. checkpoint calls phase_flip calls allocated_block (on uninc list) calls iolock_block before calling incorporate Maybe all of these should assume an IO lock. FIXME truncate assume truncate-to-zero. We need proper ftruncate support. It nearly works.... Things to do: - sort out individual patches and review DONE - allow compilation without refcount tracking DONE - don't hold a 'leaf' reference. NO - clean up *ref calls - differentiate those that can be called when zero DONE - use enum for B_* DONE - support truncate to non-zero offset DONE - "looping on" found an 'OnFree' block! - clean out lot of debugging Hmmm.... deadlock. rmdir is holding i_mutex and waiting for a phase change to pin a dblock. checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate. Not cool. i_mutex must not be taken by checkpoint Fixed that, though it is a bit of a hack.... New deadlock: checkpoint calls phase_flip which calls allocate_block, to move the uninc_next across, and that tries to iolock the parent to perform a partial incorporation. But that seems to be iolocked. Generally that is ugly as ->uninc_next might be very long and require multiple splits, and direct-driving that from phase_flip is bad. I should just move the list across 19th March 2009 Spent too long trying to remove refcount help by *_leaf lists. This leaves InoIdx block with zero refcount so Data block can get lost and bad things happen. I might be able to fix it up, but it is probably better to try the checkpoint_lock approach if I can only remember what that is. Locking: Available locks: Spin: lafs_hash_lock Used in: lafs_shrinker lafs_refile ??? Protects: ib->hash ->lru when on freelist i_data.private_lock Used in: lafs_shrinker Protects: ->iblock / refcnt ->dblock / my_inode ->children / ->parent within an inode setting ->private fs->alloc_lock fs->allocate_blocks fs->stable_lock segsum hash table segsummary counters (in blocks) fs->lock _leafs lru ->pending_blocks lru - should this be wc->lock ??.. not in 'bh' Pinned consistent with lru ->checkpointing / ->phase_locked fs->pending_orphans ->uninc and ->chain ?? Should use parent->B_IOLock ?? uninc_table - should use B_IOLock free list / clean list segtrack Mutex: fs->wc->lock wc[0] .. something in prepare_checkpoint ->remaining etc cluster_flush mini blocks i_mutex inode_map orphans Other: B_IOLock erase_block incorporate cluster_allocate allocated_block IO Phase flip Initialising new inode B_IOLockLock IOLock across a page -------------------- This is a list from 18 months ago, with updates - Understand how superblock 'version' should be used. - Review and fix up all locking/refcounts. See locking.doc Also lock inode when copying in block 0 and probably when calling lafs_inode_fillblock (??) - lafs_incorporate must take a copy of the table under a lock so more allocations can come in at any time. - We don't want _allocated to block during cluster flush. So have a no-block version and queue blocks on ->uninc if we cannot allocate quickly. Find some way to process those ->uninc blocks. - Use above for phase_flip so that we don't need to _allocated there. - Utilise WritePhase bit, to be cleared when write completes. In particular, find when to wait for Alloc to be cleared if WritePhase doesn't match Phase. - when about to perform an incorporation. - make sure we don't re-cluster_allocate until old-phase address has be recorded for incorporation. - allocate multiple WAIT_QUEUE_HEADS for 'block_wait' - Can inode data block be on leafs while index isn't, what happens if we try to write it out... - If InoIdx doesn't exist, then write_inode must write the data block. - document and review all guards against dirtying a block from a previous phase that is not yet safe on storage. See lafs_dirty_dblock. - check for proper handling of error conditions b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot. - review checkpoint loop. Should anything be explicit, or will refile do whatever is needed? - Waiting. What should checkpoint_unlock_wait wait for? When do we need to wait for blocks the change state. And how? - load/dirty block0 before dirtying any other block in depth=0 file - use kmem_cache for 'struct datablock' - indexblock allocation. use kmem_cache allocate the 'data' buffer late for InoIdx block. trigger flushing when space is tight Understand exactly when make_iblock should be called, and make it so. - use a mempool for skippoints in cluster.c - Review seg addressing code in cluster.c and make sure comments are good. - consider ranges of holes in pending_addr. - review correct placement of state block given issues with stripes. - review segment usage /youth handling and make a todo list. a/ Understand ref counting on segments and get it right. - Choose when to use VerifyNull and when to use VerifyNext2. - implement non-logged files - Store accesstime in separate (non-logged) file. - quotas. make sure files are released on unmount. - cleaner. Support 'peer' lists and peer_find. etc - subordinate filesystems: a/ ss[]->rootdir needs to be an array or list. b/ lafs_iget_fs need to understand these. - review snapshots. How to create how they can fail / how to abort How to destroy - review unmount - need to clean up checkpoint thread cleanly - be sure it has fully exited. - review roll-forward - make sure files with nlink=0 are handled well. - sanity check various values before trusting clusters. - Configure index block hash_table at run time base on memory size?? - striped layout. Review everything that needs to handle laying out at cluster aligned for striping. - consider how to handle IO errors in detail, and implement it. - consider how to handle data corruption in indexing and directories and other metadata and guard against problems (lot of -EIO I suspect). - check all uninc_table accesses are locked if needed. - If a datablock is memory mapped writeable, then when we write it out, we need to with fill up it's credits again, or unmap it. - Need to handle orphans asynchonously. - support 'remount' - implement 'write_super' ?? - pin_all_children has horrible gotos - remove them. - perform consistency check on all metadata blocks read from disk e.g. don't assume index blocks are type 1 or 2. 23rd March 2009 + looking at cleanup for unmount. - various more refcounts fixed up - B_SegRef is never dropped! and we take a ref on a segment when we start a cluster on it, but never drop that reference. THIS is next thing - review all setting and clearing of B_SegRef. 30th March 2009 - SegRef and lafs_reserve_block... There is room for recursion here, I need to be careful. To dirty a data block, all parent index blocks must be Pinned and must be able to be written. That means their segusage blocks must be available for update. And Pinning a segusage block for update requires all its parents. So the segment for the block, the indexes, and the segusage and indexes and so-on must all be pinned. When we pin a block, we do it from the root down to avoid recursion. We probably wany whatever reserve_block calls, to return an unreserved block rather than call reserve_block itself. When do we clear SegRef?? We set it when Pinning, so I guess we clear it when unpinning. pin_dblock, mark_cleaning, prepare_write, truncate seg_move clean_free We it is really when Pinning, or Dirtying or Reallocing. So we clear when unpinning, or when a dblock gets written... Maybe just when we lose ->parent 6th April 2009 - sometimes sugsum counter goes zero for random data block Something is going wrong in roll-forward. The block looks transiently valid so doesn't get read, but has no good data in it. - After deleting a directory, the block might still have incorporation to happen, but is not marked dirty - at unmount, there are various blocks that are still dirty. - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush) 12th April 2009 - that rollforward problem above: When rolling the checkpoint, if we find segusage blocks we want to include them directly into file. But by pinning the block we might preread a segusage block.. but we must be sure not to update it. So during the early stages of rollforward while still in the checkpoint, seg_inc must be called with in_phase == 0. so seg_move is called with phase != qphase. ditto for summary update. So the block must be pinned to the previous phase... Normally 'phase' changes at checkpoint-start, qphase changes at checkpoint-end So we probably want to start with qphase being 0 and phase being 1. When we reach the end of the checkpoint, we flip qphase to 1. - blocks still in phase_leafs at unmount: After we force a final checkpoint we still have Pinned: root InoIdx ino==8 InoIdx due to Dirty block0 ino=16 InoIdx due to dirty block0 and dirty: inode block 1, inode usage map 2, root directory 8, orphan 16 seg usage Problems: inode blocks dirty but not pinned? No InoIdx... Segusage dirty - probably by seg_apply_all - disable that at umount orphan dirty ??... but not pinned! This is possible - we don't pin for clearing entries, just for setting. The inode problem stems from the datablock being dirty while the InoIdx block isn't. That is, at best, confusing. 13th April 2009 segusage blocks aren't being pinned They need to be pinned whenever dirty. and youth blocks aren't even made dirty some times. They need to be pre-pinned in many cases. So: segusage gets changed when we write out a cluster, and when we delete/relocate blocks. In the first case we pin the block when it becomes part of the free list, and need to keep it pinned across checkpoint changes. In the second, we pin when the block is dirtied and again must keep it pinned. Youth gets changed when a segment becomes free and again when we allocate a segment to it. Keeping a datablock pinned across checkpoints is awkward - we currently need to repin for each dirty... I guess we can re-pin for each checkpoint in lafs_seg_apply_all. That might work for segusage, but not for youth! If segsnum for ssnum==0 held a reference to the youth block, that might help. Segstat on 'clean' or 'free' would imply a reference to that segsum. Is it OK to keep all youth/usage blocks for free/clean blocks pinned? We can currently have 810 entries. Only half will be clean/free. For each entry there can be two blocks, youth and usage. So that could be 810 blocks. 1Meg? Normally much less. If it became a problem we could reduce the number dynamically I guess. maybe segusage blocks need to get phase_flipped, as other blocks do depend on them, pin_all_children wouldn't be able to find them though.. 1/ Any address on 'clean' or 'free' segtrack implies a refcount on the Youth block. 14th April 2009 I think I want to link dirty block to the space in free segments that we actually know about. Each of those segments has youth and usage blocks pinned (at least parent pointer is active). So we have everything we need to write everything that is dirty. So 'free' or 'clean' implies a segsum reference which holds youth block. When we get low on space, we wait for cleaning/finding to progress. This would limit us to 400 segments, say 16Meg each, so 6Gig of dirty memory. I guess that we need to scale the 'free' list based on available memory (FIXME). When cleaning needs a segment, it needs to load the usage blocks for other snapshots too. When cleaning in the presence of snapshot we need to be careful never to duplicate a block that is shared. To allow for v.many snapshots, we don't even want to duplicate in memory. So we need to choose a 'primary' copy - probably first one found - and follow the peers link when possible... 18th April 2009 (continuing). So clean and free segments in the list carry a SegRef. But it could be excessive if all of them did - we shouldn't be required to pin more data than we need. So for segments with a usage of 0, we use the score to record if a segref is held. 0 means 'no', 1 means 'yes'. When space_alloc wants more space we need to find an entry and segref it. Maybe we want free lists - reffed and not-reffed. Then again, SegRefs are fairly cheap as they are heavily shared. maybe 512 to a block. If we hold 400 refs they could easily all be in one block. We could possibly encourage this by sorting the list and discarding from one end if it is too full. Sorting is a good idea definitely. It keeps youth/usage updates together. Just check the numbers. a 1TB device with 1K blocks might have 32M segments of which there would be 32768. 512 per block means 64 blocks or 16 pages (64K). So total segusage files is 128K plus snapshots. Not worth worrying about surely. For 16TB, that is 2Meg plus snapshots. So - keep a SegRef for all free and clean blocks. This must include a youthblk reference. - sort the free list when 'clean' is merged or when a pass finishes. sort clean list fix youth value merge as many as fit into free sort How is the code flow... add_cleanable is called during the periodic scan. It could hold a SegRef easily. add_cleanable calls add_clean as does lafs_get_cleanable during clean. That might block getting a segref, might even deadlock? add_free is also called by seg_scan So seg_scan should get a segref and leave it with everything! BUT..... A SegRef implies a 'struct segsum' for each segment. We don't want to allocated one of these for every segment in the table. We only want a reference to the youth and segusage block, which are heavily shared. But these blocks need to be Pinned and SegReffed etc so we can write them at any time. 20th July 2009 The refcount held by the 'leaf' lru is a problem. While it holds a count we do not unpin an index block, so it cannot be removed from the list. Thus we can only remove from the leaf lru on a phase change..... Or when doing lru based flushing... Maybe we can remove from the lru while holding the checkpoint lock. This happens when truncating.. No, that is just too messy as it is too easy to get put back on the list. Maybe the leaf lru should not imply a reference count ... or maybe we need to split the refcount: 'inuse' and 'active'..... How about we test refcnt against list_empty(->lru)... .... During truncate, we need each index block to get unpinned so they can all be cleaned up. But the InoIdx block is held pinned by by the inode block being dirty. In this particular case, the InoIdx block is Invalid as the file is empty. But.... InoIdx should always be valid until after Inode is destroyed?? umount I need to stop the cleaner and flush everything before trying to clean up. This is awkward though. The 'sync' of umount is done by kill_block_super, but I call that rather late, after checking that the tree is empty. There are pinned/dirty bits left after sync that we want to magically clean. We have: - segusage/youth blocks. Maybe if we don't seg_apply_all... - orphan block. Maybe don't mark it dirty when we remove things? - inode map?? why is that dirty - root directory is dirty still?? But it has been erased. InoIdx is valid-but-empty. Inode Data is dirty Data block 0 is Dirty at block 0. ...... Ahh... need to mark page dirty when block is marked dirty !! The seg usage blocks are now flushed out but not incorporated. I feel that might be correct - we don't want to care about incorporation as we will never use it. For this, segusage and quota are very special cases. Inode map is no longer dirty, but is pinned Orphan does have a dirty block still The orphan table contains the root directory. root is now clean and gone Segusage doesn't get incorporated after last checkpoint now so that is better. But now we have a circular reference for SegRef. This should not be surprising given the circular problems we had setting SegRef. I guess we just erase the references in the segsum table... 22nd July 2009 Hurray!!! I can unmount without crashing! Now I need to sort through all the fixes required to achieve that and make discrete patches, and be sure it is all OK. DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup. DONE - (block.c) Mark page dirty when block becomes dirty DONE - (checkpoint.c) print orphan_slot with Orphan flag DONE - Don't incorporate segcount etc after final checkpoint DONE - Don't apply seg changes after final checkpoint. DONE - Don't start opportunistic checkpoint after final. DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared DONE - (cluster) be more flexible about credit usage when flushing InoIdx DONE - (dir) do add_orphan when we abort as well as on success DONE - use inode_dec_link_count, not i_nlink-- DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate DONE - change %d/%d to strblk DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU DONE - (index) refile: when unpinning, remove from lru - lafs_refile: ->iblock can be non-null for inode 0. DONE - Make sure I_Deleting gets cleared when deleting finished. DONE - phase_flip should have something separate to call, not lafs_allocated_block - inode.c: lafs_dirty_inode: getref_lock used to get dblock NONO - ?? getref_locked allowed if PagePrivate DONE - segment: lafs_seg_put_all needed at unmount DONE - segdelete_all: need to put intable references DONE - lafs_free_get: put the intable references DONE - lafs_get_cleanable: put the intable references DONE - fix sort splitting in add_cleanable DONE - add lafs_empty_segment_table for unmount DONE - lafs_release: flush all dirty blocks DONE - lafs_release: force a final checkpoint DONE - lafs_release: move kill_block_super before final check DONE - lafs_put_super: release orphans and segsum files. DONE - lafs_destroy_inode: putref should be 'iblock' - lafs_destroy_inode: allow for iblock to be present but no ref held.... DONE - can roll forward call lafs_allocated_block without dirty??? 27th July 2009. - I've re-arranged lafs_release so that the flush is all done in generic_shutdown_super. However it calls invalidate_inodes, and that has problems with pinned inodes. So we need for fsync_super to checkpoint out all inodes that we don't hold our own reference to. If we do hold a reference, then invalidate_inodes will skip them, and ->put_super can be used to drop the references and perform the final checkpoint. fsync_super calls ->sync_fs. after syncing call files. Maybe I can do some sort of checkpoint there... There almost is a checkpoint in there.... But only when called without 'wait'.... I need to understand 's_dirt'. This is controlled entirely by the filesystem, common code only examines it. If it is set: file_fsync (the generic 'fsync' method) will call ->write_super fsync_super will call write_super generic_shutdown_super will call write_super sync_supers will call write_super sync_filesystems(0) will call ->sync_fs sync_fs is called: twice from 'sync', once with '0', once with '1' for 'wait'. (though in emergency_sync, both are '0'). once from unmount and remount with 'wait' set to '1'. We don't want two checkpoints for a 'sync', but we want to start on 'wait=0'. Maybe if we get called with '0', we set a flag and treat the '1' differently.. There is no locking to make this really safe, but it will probably be OK... I could take a process_id, but then parallel 'sync's could race. write_super is called before the syncs. So it could start the checkpoint, and sync could wait for it. write_super is called multiple times at shutdown, We really need to utilise sb_dirt to avoid some of these. We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1: - when we pin a dblock or dirty a this-phase iblock. 29jul2009 at unmount, we iput the root inode which de-references the dblock before clearing ->iblock, which fails an assertion ... why? Apart from the shinker, ->iblock is only set to NULL in refile when we find an I_Destroyed inode... I guess the root block isn't getting Destroyed... The protocol for freeing iblocks is bad. Should be: - it only gets freed by the shrinker - when inode dies, set ->inode to NULL - when InoIdx iblock dies, set ->iblock to NULL ...??? 30Jul2009 So, what exactly is the protocol? - index blocks live either in the parent/sibling tree, or on the inode's free_index list - when refcnt is 0, they live on 'freelist.lru'. When refcount is elevated they stay on lru until they need to be added to some other lru (leafs or cluster) - when shrinker finds block on freelist.lru with non-zero refcnt, it just removes from lru - when shrinker finds free block, it removes from free_index and discards the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ?? I think not as such would either have children or be on an lru - When we destroy an inode, all index blocks get disconnected from the inode and freed. This must include the ->iblock - When an index block becomes free due to index tree shrinkage, we set the ->depth to -1 so that it cannot be found by mistake, and leave it for shrinker or inode destruction. Confused about inode<->dblock dependence. We don't want the inode to refcnt the dblock as that wastes space. We don't want the dblock to refcnt the inode as that stops it from being freed. So each must disconnect from other when freed. What locking? inode takes private_lock, then checks dblock dblock cannot take private_lock before checking ->my_inode.. Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then drops ref 1Aug2009. Tracking down the 'credit' count and making sure it stays correct. It seems that I have a Dirty InoIdx block which is not pinned. Due to this it has no refcount and so the data block disappears so the InoIdx block is not visible in the tree. This isn't a definite bug but it means I cannot count credits properly. And surely Dirty index blocks must always be pinned!!?? When as small file is flushed to the inode we were dirtying the iblock. That seems wrong - should dirty the dblock? Need to check that is valid I got a hang in 'rm adir/4'. rm is in lafs_cluster_update_commit_both getting a mutex. cleaner is in lafs_do_checkpoint+0xe4 pdflush is in writepage/lafs_cluster_flush waiting on a lock so I guess cleaner is holding a mutex and waiting for something that wont happen? Hang again at 'seq 1 200' in 'cd /mnt/1/adir'. cleaner is at some point, holding a mutex to stop 'sh'. 0e4 == 228 ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint to be allowed. So when something locks the checkpoint and needs to flush, we have problems.... I seem to have fixed the above. Now: Free space is a real problem. When I remount after the successful unmount, we find a usage pattern like: CLEANABLE: 0/0 y=10 u=34179 CLEANABLE: 0/1 y=0 u=65144 CLEANABLE: 0/2 y=0 u=65535 CLEANABLE: 0/3 y=32773 u=32910 CLEANABLE: 0/4 y=32772 u=149 CLEANABLE: 0/5 y=0 u=0 CLEANABLE: 0/6 y=32770 u=16529 CLEANABLE: 0/7 y=32769 u=35084 CLEANABLE: 0/8 y=32768 u=31877 Which is ridiculous. Better fix up what I have first... ... In rm /mnt/1/nbfile* we hang.. rm is in lafs_phase_Wait from pin_dblock in unlink wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1) cleaner is in lafs_iolock_block from add_block_address in phase_flip iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1) So cleaner is probably deadlocking against itself via iolock_block. This is taken: - in lafs_invalidate_page just to wait for any io - it isn't held long - in lafs_erase_dblock while we erase and 'allocated_block' - in lafs_get_flushable to protect blocks being checkpointed - in lafs_writepage to call cluster_allocate (which releases), both for data block or for inode when data was flushed there. - lafs_add_block_address to process pending incorporations to make room. This is what is trapping the cleaner. - lafs_inode_handle_orphan when truncate finishes to erase_iblock - lafs_inode_handle_orphan again to incorporate all removal - and again to erase_iblock - and for partial truncate to incorporate some removals - and again.... - lafs_new_inode to keep it from being cleaned while being created - roll_block to add addresses - lafs_load_block during IO So: who holds it?.... let's use the code to find out... And the answer is : lafs_get_flushable. So get_flushable iolocks the block then calls phase_flip which tries to incorporate other-phase children which try to iolock the block. Deadlock. Do we need to hold iolock during phase_flip ??. Not for all of it.. 02August2009 FIXME When erasing a block, do I need an uninc credit? I usually don't have one and the need certainly isn't as great... Now... let's try to get free space accounting right. Observed problems: - unlink sometimes failed with ENOSPC - usage scan shows segmetns with enormous usage - 23039!! no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1) no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1) no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1) after umount/remount df says "4608 7 1544" but cannot create anything. df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0 ============= Cleanable table (7) ================= pos: dev/seg usage score 0: 0/0 1 0 1: 0/5 1 64 2: 0/6 6 384 3: 0/7 2 128 4: 0/8 3 192 5: 0/3 1 64 6: 0/2 2 128 ...sorted.... 0: 0/0 1 0 1: 0/3 1 64 2: 0/5 1 64 3: 0/2 2 128 4: 0/7 2 128 5: 0/8 3 192 6: 0/6 6 384 --------------- Free table (1) --------------- 12290: 0/4 0 0 --------------- Clean table (0) --------------- CLEANABLE: 0/0 y=10 u=1 CLEANABLE: 0/1 y=32775 u=3 CLEANABLE: 0/2 y=32774 u=2 CLEANABLE: 0/3 y=32773 u=1 CLEANABLE: 0/4 y=0 u=0 CLEANABLE: 0/5 y=32771 u=1 CLEANABLE: 0/6 y=32770 u=6 CLEANABLE: 0/7 y=32769 u=2 CLEANABLE: 0/8 y=32768 u=3 03Aug2009 Current issues: FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free. 4/ invalidate_page find Realloc set, even after iolock .. This is during umount in generic_shutdown/lafs_put_super/iput 5/ Thoughts: If we flag a block for Realloc then Dirty before it is allocated, then all is fine. But if we have already allocated to a cleaning cluster... what happens? We need to treat this like it was dirties after being written, so it gets written to a regular cluster as well. As we only have one uninc bit for both Dirty and Realloc, we need to *not* incorporate the Realloc update if the block is still dirty. So: - block gets chosen for cleaning and allocated to a clean-cluster - block gets marked dirty. This must not clear Realloc - cluster is flushed, block is dirty, so don't call lafs_allocated_block - Return the Realloc credit, but keep dirty and Uninc. Is there a race if Dirty is set after we enter lafs_allocated_block? As long as the index block gets marked Dirty, not Realloc we might be safe... though it gets awkward if the Dirty writeout falls in to the next phase. But reserve_block will have provided NCredits for that. So: 1/ don't clear Realloc when setting Dirty 2/ do clear Realloc if cleaner finds the block is Dirty 3/ avoid calling lafs_allocate_block when cleaning a dirty block. This is an optimisation. Almost... A B_Realloc block no longer has B_Credit so B_Dirty cannot be set. Thoughts3. When cleaning blocks we hold no reference to the inode and it can disappear. We don't want to hold the inode active, but need a reference much like the truncate code has. I think we need a subordinate refcount for both cleaning and truncate. These hold inode present but not active. Maybe every block->inode should be counted like this. And this might simplify the my_inode->dblock inter-relationship. For later.. We need to ensure that if a new iget is called on an inode that still exists, we don't allocate a new one but just reuse the old. But that won't work as we cannot add an inode back into the hash table. So I think when cleaning a block we need to ref the inode. i.e. B_Realloc implies an i_grab 05aug2009 So I have a problem with the cleaner wanting to hold and inode that the VFS is destroying. I don't want the cleaner to hold i_count as that delays truncate etc. So we need a second counter subordinate to i_count. This is held by the cleaner and by delayed truncate, and by i_count. Possibly ->my_inode holds this, which means it can be a single bit... When a lookup wants an inode, we need to load the inode data block and see if it has my_inode. If it does, we insert that inode in to the hash table. If not we fall back to regular inode creation.... On reflection, that is too complicated and hard and error prone. When relocating a file we need the data so it had best be in the page cache so the filesystem really needs to know that the inode is still active. So cleaning needs to keep a reference to the inode. The cost of this is that if an inode is being deleted while it is being cleaned the truncate cannot happen until the cleaning completes. This means that space usage will be wrong. When nlink becomes zero we can drop the cleaner reference. When the inode is dropped/destroyed we can tie the cleaning in with the delayed truncate so that the final destruction doesn't happen until the cleaner has let go. So: how to track that the cleaner has a reference to the inode? Maybe every B_Realloc block owns a ref on the inode.... but dropping those references when i_nlink hits zero would be difficult. They could hold a secondary refcount which, if non-zero, implies a ref on the inode. So: - Set B_Cleaning when we look at a block for cleaning, and clear it when we find Realloc clear and ....???? - Whenever a block has B_Cleaning set, it holds a counted reference on LAFSI(b->inode)->cleaner_ref - When cleaner_ref is non-zero and I_Deleting is not set, we hold a reference on the inode (i_grab). - when i_nlink hits zero, set I_Deleting and drop any reference held by the cleaner. DONE - cleaner must be careful not to process any block that has been truncated, or file that is dead. DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint. - What about filesystem inode... how do they fit in?? Question. When are the index blocks for an inode flushed? We need to have them gone when the inode disappears. For deleted inodes, this happens in background truncate. For memory-pressure inodes it will hopefully happen well in advance, but we need to make sure in destroy_inode that everything is written. - FIXME Thinking again about B_Cleaning, any B_Realloc block will hold a reference through to InoIdx and so dblock will be present and the inode won't be freed. So we only need an extra reference during the first little phase of cleaning when we are collecting blocks. After that a reference can be useful as it will delay flushing so it can be more efficient... Maybe this is all much simpler than I thought. If we hold a ref on the inode whenever the InoIdx block is Pinned and i_nlink is non-zero, then we won't be forgotten until all index blocks are written. We may still be deleted, but as that is one-way we can hold on to the inode at little cost. getting/putting that ref at exactly those times turns out to be messy. It might be best to have a flag to say "We hold an extra ref". Then we occasionally call a function that validates the setting. It is most important to drop the count at the right time, so after unlink/rmdir/rename and when B_Pinned is dropped. B_Pinned is set in: set_phase which is called from: lafs_cluster_allocated when moving 'pin' across to data block so don't need checkpin lafs_pin_block_ph only need check_pin if dropping spinlock pin_all_children only pins data blocks (Index are already pinned if relevant). grow_index_tree where "inoidx block pinning" doesn't change do_incorporate_leaf No InoIdx involved do_incorporate_internal ditto So only need check in lafs_pin_block_ph and maybe pin_all_children... 08Aug2009 - credits get out of sync from lafs_incorporate->refile->space_return from checkpoint. counter is one more than we can find. returning space on i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP Note it in an Index but not InoIdx. The parent is still in the tree. This that is FIXED - and out by 8! at delete_inode -> truncate -> invalidate_page->erase_dblock->space_return FIXED that. - BUG credits<0 in space_return from lafs_incorporate from add_block_address from phase_flip Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1) from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2) msg: (1,3,1)(1,1,-1) Credits = -1, rv=1 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2) This is a predicted but not handled problem. The answer is that not all blocks need ICredit/UnincCredit. The purpose of this credit is to allow for a split in the parent. pre-existing index blocks can never split the parent themselves If an index block becomes full, it will split and this might split the parent. If an index block has free space, then it will only over flow if it gets multiple child updates and this will provide multiple credits. So an index block with space for 3 or more new addresses does not need and ICredit/UnincCredit. So when we split we don't need to provide an uninc credit. In particular. When we have a fully InoIdx block and a single new child with 1 UnincCredit, each block already is either 'Dirty' or has a 'Credit', and the InoIdx has an ICredit, then create a new intermediate such that InoIdx is Dirty and has an ICredit New Index is Dirty with no ICredit - it used the UnincCredit New child looses its UnincCredit When another block in the new index arrives, it's unincredit is used to provide an ICredit When a leaf block cannot fit a single address it will have ICredit. The block is split so that each has 3 spaces and so do not need ICredit, but as soon as ICredit is available, they take it. Worst case is that every ancestor is full and the leaf is split We then get two full branches, each block half empty so not needing ICredit. Then... free data being used in lafs_refile from cleaner. b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it. Answer: lafs_refile was derefering ->inode when it wasn't safe. Need to at least have a parent before it is safe. Hang: soft lockup cleaner->lafs_iget->ifind_fast .... Then (may be caused) Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1) .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1) Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1) ------------[ cut here ]------------ kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656! It seems the cleaner gets confused and goes spinning. So: space problems: After the run, we have -14 used and 2055 available (of 4608), and cannot create anything. 4 segments ar free, one is cleanable. free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0 or free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0 or df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32 free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0 and very little free ablocks_used is going negative - why? Probably we erase a dblock without clearing Prealloc. Then when Prealloc later gets cleared, ablocks_used is wrongly decremented.... no... 10aug2009 (don't forget above problems) Another problem. read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock getiref_lock triggers BUG. This is presumably because I have just fixed it to get the correct iblock and not the iblock of the filesystem. FIXME I hacked around this but I'm not sure the result is right. The question is about when the InoIdx should be dirty and when the inode data block should be dirty. In this particular case we are writing a page of a small file. cluster_allocate calls flush_data_to_inode which tried to dirty the inode dblock but finds that iblock is not pinned... When we dirty a data page we aren't pinning the parent! That might be OK - we only need to count and reserve the parent. We don't need to pin it until it becomes dirty. Still need to resolve when which block gets to be dirty, and also exactly when an index block needs to be pinned. And how does that related to holding a ref on the inode when the inoidx is pinned. Maybe it should be when the inoidx is referenced. FIXME 11aug2009 Another problem. unlink->handle_orphans->erase_dblock->allocated_block and get a zero from lafs_add_block_address but parent is not pinned. And... One unmount, orphan file still has pinned blocks so the inode isn't free. And ... root still old phase after lots of 'rm' then sync. Inode 244 has pinned inode block held by writepage0 and writepage this is adir/170 13aug2009 - lots of bugs introduced by change to marking inode blocks dirty: writepage/cluster_allocate wants to Dirty inode data block with no credits. because I put credit in iblock! - ohhh.... The phase contour is broken. When a block is added to a cluster for allocation it isn't in the phaseleafs any more, but prevents it's parent from joining. So we cannot assume that if dblock is on list then iblock or a child will be too. So when we find dblock we do need to remove it.... done that. - root not changing because Data 1/0 is Pinned and IOPending and held by writepage!! Problem is that IOPending blocks aren't put back on lru. But that should only be blocks on the cluster list..... But that is where I am putting it. Maybe I need exclusion between checkpointing and any other code that writes to checkpoint so checkpoint can wait for that ... can we use wc->lock?? That doesn't lock against cleaner, but that isn't a problem... But now 0/228 is still pinned and in writepage and IOPending So there is more to it than that. When checkpoint finds an IOLocked block, it might be about to join a cluster, in which case we don't really want to wait, or it might be undergoing incorporation in which case we want to wait. or it could be being erased, so wait.. Maybe I wait until it appears on some list.... yes. 14aug2009 At unmount Index 8/0 with child and leaf is still pinned This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) and.. A problem is that something goes wrong in the erase process. We find new children after we erase the inoidx block! This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014) When/how do we erase indexblock and particularly inoidx blocks? Does and inValid InoIdx simply mean there is no indexing and does not reflect on the Data block? .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1) Orphan problem: nextfree = 0 reserved = 0 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid nextfree = 1 reserved = 0 0: 1 0 0 304 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1) badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4) erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1) erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1) ------------[ cut here ]------------ WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x unlink/orphan/erase_dblock_allocated_block ---[ end trace 61b8bd59512ea4da ]--- zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1) [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1) [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1) ------------[ cut here ]------------ kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955! BINGO. When we remove last entry from directory we erase the InoIdx block, then when we add entries, we hit problems. nextfree = 3 reserved = 0 0: 1 0 0 306 1: 1 0 0 307 2: 1 0 0 74 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1) This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3) This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0) [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1) We have stray 'cleaning' references. It is taken - on a data block that was in a to-clean segment at which point we igrab the inode the block is put on the ->cleaning list. It is put: when we get an error finding the block when we find that it isn't in the segment when an error occurs loading the block-to-be-relocated and when we mark that block for cleaning. i.e. always unless we got EAGAIN or some space error. If we still hold some blocks, try_clean returns 0. VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0) [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid NOTE these inode data blocks are not pinned and so did not get written!! FIXME I should wait for the checkpoint to finish nextfree = 1 reserved = 0 0: 1 0 0 301 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0) [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1) 16Aug2009 When I clean and find an inode that is already deleted, I need to be very careful not to resurrect anything.. I wonder if I am.... Yes, I seem to be. lafs_delete_inode gets called a lot, but mostly for dead inodes. BUGS: FIXED orphans don't get cleaned up. It seems a 'create' fails and leaves and orphan block un-released. - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned - Not sure that we handle complete truncation, then adding blocks properly. - what should the state of the InoIdx block be? - On remount, the filesystem contains rubbish. - create fails even when there should be free space. - sometimes BUG in checkpoint.c - not finishing checkpoint properly... - iblock not valid for in 327 under cluster_flush/lafs_allocated_block and 74 has similar issue 327 = adir/big1 74=adir 17Aug2009 Segusage blocks aren't always Pinned when we make them dirty. Yes. That is correct. They are not forced out by phase change but by lafs_seg_flush_all at the end of a checkpoint. So they need to be preallocated, but not Pinned. But, once we have finished the last checkpoint we don't want to dirty Segusage blocks any more.. I wonder if we are. No, but we were Pinning inodes without PinPending and they lost the pinning straight away! OK, other annoyance. InoIdx block and similar are getting erased at the wrong time. We can only safely erase them when they have no children. I guess what we really want is the incorporation leaves them existing but empty, and when we go to write them out, if they are empty we register an address of 0. When we drop the ->parent pointer of an Index block it just goes away... So: When incorporate or truncate produces and empty index block it simply clears B_Valid. When incorporate want to add to an index block, we set B_Valid When cluster_allocate gets a non-Valid index block it call block_allocated with phys of 0. Yes, that seems to work. Mostly 18Aug2009 On remount, check_credits dies: 16/20-0 In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount. 19Aug2009 OK, this index block clearing is a mess. There must be a neat model I can follow that will make it "just work". The key seems to be children. If an index block has children, then it really must exist. If it has no children and no content, then it can be discarded, in which case it needs to be unlinked from its sibling list. What locking do we use here? Probably IOLock on the parent index block. So we need iolock while looking in a parent for children, and we take IOLock while incorporating or pruning. Once the empty index block has dropped out it will never be found again. When we incorporate the zero address, the index block becomes invisible unless it is shortly after it's predecessor in the sibling list. But that is hard to ensure, especially if the first child is the one that is being erased. So if an index block is erased, then it must be discarded quickly and any children need to be relocated... Or maybe not.... maybe if there are children, we just write and empty block? 22Aug2009 We need better locking of the index information. It seems best to use IOLock as that is already held during incorporation. So any code that accesses or updates and index block must hold IOLock. This might be a bit of a restriction if we try to do a lookup while writeout is happening.... Maybe we need a separate writeback flag for that. But I think it is good to use IOLock for now. Places we need this are: flush_data_to_inode needs to lock the InoIdx block - DONE lafs_leaf_find as it recurses down. This should return a locked leaf. - DONE callers of clear_index erase_dblock for depth=0?? - DONE incorporate should lock new blocks for consistency - DONE Locking dependency rule is that if we hold a lock, we are allowed to lock a child index block, but not a parent. IF we hold a data block, we are allowed to lock the an index block. The read/write completion seems all wrong. It unlocks if the page was locked, and that isn't really safe, because it might not have been locked for read.. We need to flag block0 to say if lock or writeback need to be cleared. Given that, I don't need IOPending any more: Read: We submit all reads, then set 'do_unlock', then check if we should unlock. Write: We queue all writes, then set 'do_clear_writeback', then check. Now... can we use a writeback flag to avoid waiting to read while writeout is happening? We would need: set writeback in cluster_allocate wait_writeback after some lock_block clear_writeback when writeout finishes. Extra checks where we already check for IOLock 24aug2009 Lots of progress but.... cluster_flush calls cluster_done calls refile call iput call drop_inode call write_inode_now calls writepage calls cluster_flush and we get a locking loop. I think we need the run that cluster_done from a different thread. We seem to have a refcnt problem with segsum. 25aug2009 Lots more progress but..... orphan_release is finding that the orphan block has no credits. We can allocate credits and simply not do the update if they are not available: having an extra entry in the orphan file isn't a problem. However we need some mechanism to clean up other than waiting for a remount.. I think we leave that until we redo orphan handling. and: adir sometimes loses one block so it and the contents don't get deleted. and: it seems we sometimes try to clean the segment being written to. We must avoid that. (long ago I wrote:: FIXME When pin fails, we need to remove PinPending from everything!!! and never followed up ... I wonder? ) 25Aug2009 Orphan handling. Every orphan block goes on a per-fs list and gets removed only if the B_Orphan bit is clear. There are two times when we want to expedite orphan handling. 1/ on rmdir we need to know if the directory is really empty. This requires that we expedite the orphan handling of all blocks. As soon as we find a non-orphan, we can give up. Then we need to make sure the index tree has collapsed. WE can borrow that code from truncate. 2/ When writing past Trunc_next. We just pass the block to special orphan handling. This requires that orphan handling is re-entrant. For dir, that is protected by i_mutex, but rmdir needs to come in under the radar. For trunc, the iolock on the index blocks should be enough. I wonder if IOLock can be used on dir as well... allowing parallel orphan handling in the one dir even!!. We need to ensure exclusion of orphan handling, including: - only one orphan handler at a time - don't run orphan handler while still processing action that makes it an orphan. Maybe if we just use IOLock for that? Does that work? Maybe but it gets messy for directories (on first attempt anyway). For directories we can just use i_mutex. Maybe i_mutex for files as well? 27Aug2009 Orphan handling is going well... but not perfect. I'm using IOLock to ensure exclusion for orphan handling. However: I'm not really implementing that on directories Inodes go bad because lafs_erase_dblock needs the lock too. The call from rmdir will always faile because we hold i_mutex. Bigger problem. I'm IOLocking inodes across checkpoints to preserve Orphan status. But that might stop the checkpoint proceeding. .. so use i_mutex, not IOLock - find. Now... it seems I've confused myself. Orphans don't get handled immediately. In particular, inodes should not be handled until they final delete_inode. So setting the B_Orphan flag and putting on the list are two separate events. The flag must come first, but the list may come much later. So some of that mucking around with i_mutex is pointless. So: make_orphan makes sure it is in orphan file, sets bit, and removes from list (if present). add_orphan puts it on the list for handling. For inodes: lafs_new_inode sets the bit and delete_inode puts on queue, as does any unlink/rmdir/rename that fails. For directories: put it on list in commit/abort. And... I hit the BUG where find_leaf wants and address of 0. If an index block gets cleaned out it doesn't disappear immediately.. there is no leaf to find in that direction. We probably need to avoid non-Valid blocks or something... And... Orphans 0/299 to 0/329 and 0/280 are still on the list but are not orphans. Maybe I need to catch mutex_unlock to run the orphans?? And... We underflow a segment through orphans are unmount. We are cleaning and truncating at the same time. The same block gets allocated to 0 and to 1225 in quick succession. Problem is that we apply new address while in writeback so a new lafs_allocated_block 29Aug2009 Review of inodes in orphan list: lafs_new_inode makes are orphan for a non-existant inode. If the inode cannot be created, orphan_release is called. If it can, a 'struct inode' is filled in with valid type and nlink==1 (!!) and attached. The inode will only be detached when the refcnt hits 0, and the orphan list implies a refcount, so if we ever find something on the orphan list with a NULL my_inode, it must be very new and can be ignored. When we find an inode block with a my_inode there are a few options: if I_Trunc is set, we must progress truncation providing we can get the i_mutex else if I_Deleting we must delete the inode else if nlink is 0, we remove from the list else nlink > 0 and we must remove orphan status. This means that if nlink is elevated, we need to be holding the mutex... So don't elevate nlink any more... When nlink becomes non-zero the block need to be put back on the orphan list (it must already be an orphan). Also when we set I_Deleting or I_Trunc it must go on the list. .. OK, I think I have all of that. 30Aug2009. I have some wierdness that seems to be caused by the orphan stuff, probably due to it all being async now. - A deleted inode clears I_Trunc and then sets it again. The only explanation seem to be that delete_inode is being called again, so I must be igrabing it again, maybe from cleaning. - bits of directories aren't getting deleted. Sometimes single blocks, though the referred files are deleted. Sometimes the whole directory... More interestingly, those blocks then don't get cleaned, so something about them means that they don't get deleted and don't get cleaned either. Even weird... I just had a case where file 331 had a different index block for every 4 data blocks... FIXME: - What stops pinned blocks from being flushed by bdflush in middle of operation and so losing allocation? Must make sure to set them dirty very late. - orphan_release can fail, so much make sure we can always call it, even if my_inode is NULL.... but how? - make_orphan could fail due to lack of space, which is not OK. I made it loop, but I'm not 100% sure that is right... it isn't. I need to pass down the 'I'm freeing space' flag, and I need to not require Credit of Dirty is set, etc. - I seem to have a deadlock and unmount. umount is waiting for lafs_checkpoint_lock_wait in lafs_put_super pdflush is in down_read in sync_supers lafs_cleaner is iget_locked/ifind_fast/inode_wait This is waiting for I_LOCK to be clear. 31Aug2009 - When a file shrinks and becomes level-0, make sure old addresses get deallocated. I seem to have a directory where they didn't. - Due to the fact that we over-preallocate, we really shouldn't return ENOSPC until we have flushed dirty data and performed a checkpoint?? - When I removed the last index from an inode (Indirect type) it seems that I didn't write out the corrected block..?? 1sep2009 I ran my simple test run repeatedly overnight. It ran 208 times before I stopped it. There are 3 possible failure modes: 1/ didn't completed within 500 seconds 2/ triggered a BUG 3/ appeared to complete, the number of blocks in use was not the correct '7'. 74 (35%) did not fail! 31 () did not complete 40 () triggered a BUG 2 did not complete but did not trigger a bug 94 of those that failed did not have a BUG 92 actually completed. Of these: 1 final blocks 1 1 final blocks 110 1 final blocks 23 2 final blocks 12 5 final blocks 0 6 final blocks 10 11 final blocks 8 21 final blocks 11 44 final blocks 9 of the BUGs, 1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217 1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821! 2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177] 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028! 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351! 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276! 6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529! 7 BUG: unable to handle kernel paging request at 6b6b6bfb 11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655! super.c:655 is "block is still pinned" at unmount time. The block was always an InoIdx with a child. Either inode 0 or 16. child is held by various things: [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130) [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25) [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1) [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1) [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1) [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129) [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105) [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178) The "unable to handle kernel paging request" is always in umount. invalidate_inode_buffers(26/46)/lock_acquire block.c:529 This is iblock valid when erasing a block The block we are erasing is always 0/327 or 0/328. It is an orphan we are handling, iolocked but not always pinned lafs.h:276 Map an iblock which is not IOLocked always in lafs_clear_index for the InoIdx block for a directory which is in Writeback. Call is in lafs_allocated_block from cluster_flush. segments.c:351 seg_inc reduces seg usage below 0 - lots of blocks (inode 327) that were cleaned, where then erased twice. - 2 block (inode 328) were erased twice, both from prune - ditto segments.c: 1028 The free list is empty.... odd as only first segment is currently in use. soft lockup: Still orphan: 0/328 Index(1) is in Writeback and Dirty again inode_handle_orphan2 is in Writeback inode.c:821 inode_handle_orphan are end, child list is not empty. The children seem to be in Realloc - cleaner need to let go. cluster.c:1219 my_inode is null while cluster_flush an inode and want to set WritePhase. block.c:485 no ICredit for unincredit in dirty_dblock from dir_delete_commit from lafs_unlock. spinlock lockup in subsequent to real bug ditto for sleeping function. Of the '44' which claimed final blocks of 9, 14 really had 7, and 4 appear to have other strange values.... A select '9' has two extra block for the directory '74'. But that directory is long gone. These dir blocks are currently fully populated with numbers. This seems to be the pattern with all non-7 blocks. 02Sep2009 Found a problem, possibly related to the dir blocks not being cleaned up. When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode, so that fact is never copied in to the datablock. On further exploration, the I_Dirty bit is set but never used, which isn't good. So: exactly when do we copy inode into datablock, and what do we do when dirty_inode is call (if anything). We could just set I_Dirty when dirty_inode is called, checking that the block is Pinned which it usually will be. Then we copy inode to data just before writing data block. However that defeats transactional properties. We to copy in the same transaction, and that means either straight away, or when the data block's phase changes. So dirty_inode either copies to the block, or sets I_Dirty. When lafs_refile unpins an inode data block, it need to check I_Dirty and possibly re-dirty it. To redirty it we must steal the NCredits. Any further dirty attempt will have to allocate more. The stealing is done automatically by dirty_dblock, so we just flip the phase and call dirty_inode ... making sure it doesn't try to prealloc too hard. Need to review when inodes get dirtied. - commit_write only sets I_Dirty ! We call lafs_dirty_inode: dir_create_commit - a child of inode is PinPending lafs_create - ditto lafs_link - before dir_create_commit lafs_unlink, lafs_rmdir - data block is pinned lafs_symlink - before create_commit lafs_mkdir - before create_commit, or block pinned lafs_mknod - before create_commit lafs_rename - (moved to) before create_commit/update_commit or data block is pinned lafs_dir_handle_orphan - (assured that) child is pinned. choose_free_inum - child is pinned lafs_incorporate - block is pinned So either the data block is pinned, or the index block is pinned. In either case it is OK to set something to Dirty. (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync} this is called from: inode_inc_link_count inode_dec_link_count ..various quota ops... inode_setattr __set_page_dirty (Which we don't use) other buffer stuff other quota stuff we won't use touch_atime file_update_time page_symlink only the time updates are interesting. Others we have locking for. file_update_time is called from generic_file_aio_write_nlock etc before ->prepare_write/->commit_write. So they can pick up the change. Similarly before set_page_dirty is called. touch_atime is called from do_follow_link and readlink and file_accessed which is called all over the place. So what to do? If block is pinned, then dirty it to ensure writeout. If not, don't. But copy data in any case. 4sep2009 OK, I've decided that I don't like clearing B_Valid when an index block contains no indexes. The final straw was that I seemed to need to initialise the index block when I didn't hold IOLock. That was probably fixable, but I'm sure more problems were coming. So: what to do instead? One issue that must be resolved is that an index block can still have valid children even when it become empty. This can happen if we erase blocks from a file, then add them back after a checkpoint, and so in the next phase. The checkpoint writeout could need to show an empty index block, but the next phase will see real addresses. We cannot easily avoid this, so we must handle it. This interact badly with the index lookup algorithm that finds the best index block currently in the parent, and then scans the children. If there is no index block in the parent, we cannot find any children. This could be handled by responding to an empty index block by scanning all children. But that isn't a full solution as if just one index block got erased, it's unincorporated siblings would still be lost. We could treat empty index blocks like orphans. i.e. don't discard them immediately but leave them with possibly real addresses. Then when they have no children we allocate the 0. But we still need to ensure that index blocks off which siblings have been split but not yet incorporated remain present in the tree to mark the place for their siblings. There is another problem. A horizontal split could leave the new block with no addresses and everything in the uninc list. Nothing can be found in there. So maybe we need to revise the lookup mechanism. The goal is to find an index block that starts at or before the target and contains an address at or after the target. Then out search can stop. In rare cases..... 7sep2009 I thought about this more over the weekend and think I have an answer. We need to treat internal and leaf index blocks somewhat differently. An internal index block must never be empty (while unlocked). Any child block which has not had it's address incorporated must be attached (simply in the sibling list) to a block which has been incorporated. This will be the block that it was split off. The uninc block needs to hold a reference so that the primary isn't released. When a 'primary' becomes empty it cannot be discarded, so the addresses in the first dependent index block must be copied across. This is awkward for indirect blocks so they might be allowed to be empty (they aren't internal so don't violate the above). When a horizontal split break a sequence of dependent blocks between two parents, the second parent must be incorporated immediately so that the first block in the second half of the sequence is incorporated. If an internal index block does become empty and it has no dependent blocks to fill from, it must be invalidated immediately. It cannot have any children - even in next phase - as at least one would have to be incorporated and so the block would not be empty. Invaliding involves allocating to address 0. If index lookup finds a block with PhysValid address of 0, it must look to the previous index block. If there was none .... it gets a bit complex. Leaf index blocks can become empty, but we try to avoid it. If a leaf has blocks which have been created in the next phase, and others which have been deleted in this phase, it can be empty but still have children. In this case we just treat it as a real index block that doesn't actually have any addresses. We still write it out even though that is a waste of space. We have been working on the assumption that every address always has a corresponding leaf index block. It is the leaf with the highest index at or below the target address. However this requires the every internal index block has a child with the same address as the parent. Preserving this requirement when the first child of an internal become empty requires either: - loading the 'next' child and reassigning this to the start - changing the address of the parent to match the first child. The former requires possibly reading a block from storage. The latter only involves modifying blocks that are due to be written out anyway, but makes block look up slightly interesting. When lookup finds an invalid block that is 'first', it needs to start again from the top. When incorporation creates an invalid block that is first, it needs to walk down from the top and any index block at the same address needs to be relocated/rehashed. If the block is incorporated, the incorporated address needs to be updated. So: - flag for unincorporated index blocks which implies a reference on primary - after split, immediately incorporate second block - change lookup to retry when finding invalid block - When internal block becomes empty, either merge with first dependent or invalidate. If first in parent, update address and parent and recurse. Need some 'clever' locking here. Before unlocking the invalidated block, we take i_alloc_sem, then walk up the ->parent tree locking blocks as required. The index lookup, when it finds an invalid block will take i_alloc_sem, then drop it, then start again. Or maybe some other lock than i_alloc_sem... - When leaf becomes empty, invalidate only if it has no children. When internal leaf becomes unpinned, check if empty. 21sep2009 That locking doesn't look like it will work, and we can never 'merge with first dependant' as it is not valid to have a index block where the first child is at a different address. And we cannot always change the parent address, particularly if it is zero - increasing it then cannot work. And there is no need to load a block if we are just going to change its start address (not internal index blocks anyway). Let's drop the idea of relocating the parent. If an internal index block becomes empty: If it is last in parent, no loss, just discard If parent would be empty, need to recurse up. If it is not last relocate the next sibling to this location, rehashing it and updating the parent. If a leaf index block becomes empty we cannot just delegate to next as it might be indirect... not a problem if address is stored. But that requires a format change... now might be a good time! So: If we hold an index block locked and it becomes empty and we choose to invalidate it, we need to ensure that doing so does not break any indexing paths. So we take a separate lock (i_alloc_sem??) and flag the block as invalid by setting physaddr to 0 while PhysValid is set, and unlock the block. Any lookup that finds such a block must take and release i_alloc_sem, and then restart from the top. - If the block was not incorporated, we just remove from sibling list and all is done - the space in implicitly included in previous block. - If the block has a different fileaddr than the parent then update the parent directly, either removing the entry, or changing it to point to the first unincorporated sibling (if there is one). This requires taking the lock on the parent of course. That is why we dropped the lock on the child. Then all done. - If the block has the same address as the parent we need to find a 'next block' to relocate to the start of the parent. It is either the first unincorporated sibling, or the next block in the index block, or nothing, meaning the parent is about to become empty. We lock the parent (still holding i_alloc_sem), and rehash the chosen child. If it doesn't exist, or is not dirty, we need to update the phys address directly in the accordingly, erasing or replacing the first address. Then we need to rehash the index block, but we need to lock the parent for that. So set a 'busy' flag on the block, unlock it, lock parent, rehash, clear busy flag, and repeat. - We can never relocate a block with fileaddr of zero, as the InoIdx block cannot be relocated. So leaf index block 0 must never be erased unless the file is empty. So 28sep2009 New idea. We store the start address of an indirect block in the block. These means that the meaning of any index block is completely independent of the location of the block, so we can change the location easily and without touching the block. So if a block becomes empty, we simply move the next block back to fill the gap. i.e. when an index block becomes truely empty (i.e. no children) - if it wasn't incorporated, simply remove it - if it was, - if there is a dependent block, rehash it to take my address - if there is a next block that is dirty, rehash it - if there is a next block that is not dirty, update parent to merge my entry with next, and rehash next if it exists - if there is no next block but we are not first, just update parent - if no next block and we are first, parent becomes empty, recurse upwards. 12Oct2009 - too long, I've forgotten what I was up to.. + I've changed the format of indirect blocks to store an address. + I've handled incorporation of an empty block So now internal index blocks can never be empty - they get immediately unlinked if they are. Leaf index blocks can be empty while they have children. We don't flag them as empty, but rather wait until another child gets incorporated. But I don't think I really like that. It is an external ugliness based entirely on internal implementation details. Empty index blocks should not get written out. We need some way to reliably find an empty index block. The address won't appear in the parent so a lookup will find the previous block which we cannot link to now as it may not exist yet. Worse - if first index block goes empty, we can only unlink it by moving the parent to start at the next block. That would make this index block totally unfindable. So I think we have to stick with writing out empty index blocks very rarely. So we need to be sure they disappear properly. The difficult case is if an index block becomes empty while it has some children which don't end up getting dirtied. e.g. an update aborts. We need to leave the block with enough credits to be written out. I guess the Ncredit should be enough... Maybe worry about that later. - what about InoIdx blocks when they become empty? It would be helpful to flag them so that inode deletion can check.... Maybe just set depth to 0.. ARRGGG... I've completely lost it. In need another ITO week. I just got a bug in summary.c:71!! 7 Jun 2010 - summary.c:71. ablocks_used has hit zero too soon. This should be the count of blocks for which space has been allocated (B_Prealloc is set) but have not been given a phys address yet - at which point the usage count is moved to cblocks_used or pblocks_used. The last block (which may not be the cause of the problem) does not have B_Prealloc set, yet physaddr == 0. The block is 0/1, so the inode for the inode usage map. This should have physaddr 8 !! We did find 8, then change to 73, but then changed to 0! Ahhh... recent fix exposed a subtle bug ... fixed. Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1) cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1) cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1) cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1) cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1) We are allocating an InoIdx block, but data block is not valid?? That isn't very reproducible so I'll have to leave it for now... erasedblock had been called on the data block .. inode 17?? Problem is that I keep changing the rules. I don't erase the InoIdx block any more. I used to, then change it to iolock_block/cluster_allocate->0 Problem: When all files are removed, usage is still quite high, two segments have over 400 blocks (out of 512). Cleaning keeps running and not making much progress. segment 6 has usage of 484. 'cluster 3072' shows: cluster 3072, 3085, 3086 3092 Inode 0: blocks 267 272 276 Inode 277: blocks 0/4 6/2 Inode 0: blocks 0/2 8 16 Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7 Inode 16: 1/1 Inode 17: 0/28 Inode 283: 12/18 etc. All 'old', so must be the product of cleaning, as you would expect. All (most) of this has been deleted though, but count didn't drop. 'Count' add to 508, plus the 4 cluster heads makes 512 - good. lafs_seg_move definitely isn't being called on these blocks. it is only called from lafs_summary_update cblocks_used "exactly" matches the number of un-removed blocks. Another problem bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) /home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) and free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0 Want dump of usage ------------[ cut here ]------------ kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028! free list is empty - that should not be. and another... /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1) [] ? lafs_get_flushable+0x131/0x191 [lafs] [] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs] [] ? cleaner+0x105/0x1426 [lafs] [] ? autoremove_wake_function+0x0/0x33 [] ? cleaner+0x0/0x1426 [lafs] 08Jun2010 Weirdness with truncating. The cleaner relocates a file resulting in the InoIdx block being Maybe-dirty and phys_addr == 0. Then truncate doesn't prune but just incorporates, finding something weird there.. file 278, blocks around 4100 seem to find 1949 instead?? Note: When a non-InoIdx block is erased we set PhysValid and physaddr == 0 to record the fact because it will not be stored... modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1) Async ?? modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1) Still Async ... wonder what it means. - directory block got corrupted. Maybe conversion to indexed?? Getting bug in remove_from_index because the addr isn't there, possibly block is empty. But incorporation is ??? instant? No it isn't. If an index block hasn't be incorporated it has B_PrimaryRef set as it hold a ref to something earlier index. But what if nothing is incorporated? Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0 looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1) Then spin in a soft-lockup in lafs_inode_handle_orphan ----------- - grow_index_tree needs to do initial incorporation so things can be found. just like end of do_incorporate_internal. NO - cannot incorp yet as do not have phys addr. Don't need to as lafs_leaf_find explicitly handles this. For truncate case we don't use the stored address, but ensure all leaf indexes must be dirty (or gone) so whole tree must be accessible for walking around. - do_incorporate_internal needs to set B_PrimaryRef and take the ref - when we remove a B_PrimaryRef without incorporating it, we need to drop a ref if the *next* in the list is B_PrimaryRef - need to use a constant to identify 'async' calls etc. - maybe I need other iolock_block in truncate ?? to ensure it is Valid so it isn't found as async.... 09Jun2010 STILL struggling with incorporation. We have a premise that any file address is coverred by precisely one leaf index block. Every leaf index has an implicit address and it covers all addresses from there to the next leaf. The last leaf covers to EOF. So there must always be a leaf at address 0. This applies within the tree from an internal index block too. Beneath an internal index block there must be a leaf covering every address up to the next internal index block. So there must be a first. So storing the first address is pointless. And harmful. When an index block becomes empty and disappears its coverage is included in the previous block unless there is none, in which case the next index block must be re-addressed. If there is no 'next', this index block must be empty and so must disappear. BUT if we re-address an index block, we implicitly re-address the first child - recursively - so we need to move/rehash them all or lose them... or record where they are. Or do lookup not by addr.... I think just rehashing them all - with an iolock - is simple and safe. So just do that. So: I cleaned up index handling a truncation somewhat. Now running looptest to see what patterns emerge: block.c:197 (*9+1) During umount, the Root datablock is Dirty+Realloc Maybe just need for cleaner to become inactive during umount - hope that doesn't deadlock didn't event work... block.c:529 (*4+1) erase dblock while iblock depth > 0 When pruning InoIdx we want to set depth to 0. FIXME is this really want I want, or is depth=0 only for data-inode ... FIXME cluster.c:533 (*2) cluster_allocate on invalid block Block is 8/0 in writepage from sync_inodes This is the orphan file. blocks aren't dirty I guess the file gets truncated while we wait for it. Just need to re-test. index.c:1936 (*2). An index block is Root - FIXED?? modify.c:1056 - secondary bug, ignore for now. modify.c:1650 update_index fails to find target. second call, phys==0 Code was bad ... may not be the cause though. modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block from orphan handler. Maybe just change the do/while back to 'do'. modify.c:1704: (*2) lafs_inc gets leaf with uninc list??? Index(0)/InoIdx in do_checkpoint uninc list gets set in lafs_add_block_address (parent of iblk), do_incorporate_internal, Maybe the InoIdx still had children. segments.c:1028. (*4) The free list becomes empty. super.c:655 (*3) Busy inodes after umount, and root InoIdx block is still pinned as inode 16 data block was still dirty. segusage slow. Maybe same as block.c:197 ?? invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown finds invalid lock. presumably the inodes was freed before invalidated. spin on writeback during truncate (r3a) 8 times. now 10 Probably because writeback cannot proceed while orphan processing keeps looping. kmalloc-1024 problems - (*2) A block - should be start of page - isn't not what it appears... Others complete with 'cb' ranging from 202 to 715 10 June 2010 Looking at segment.c:1028 We run a seg_scan every checkpoint, so that should keep free segments in the list..... Ahh.. do_checkpoint is looping because root isn't changing phase. Lowest block pinned to old phase is [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid which is not on leaf list because it has IOLock With more debugging: [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid or better (that was in lafs_iolock_written) [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid FIXED - I didn't unlock if it wasn't dirty any more. Well almost - it occurs much less now. Out of 48 runs: 8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180] 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4 2 BUG: unable to handle kernel paging request at 6b6b6bfbt 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!! So we now have 1/12 rather than 2/3. a/ pinned by IOLock from file.c:220 - FIXED b/ as above c/ Root is pinned by 4 children 328/0 with 196 of data blocks in writeback/realloc, in a cluster 0/1, 74/0, 0/8 all in a cluster waiting writeout. Don't understand this. d/ as a,b Of the 48, 11 ran to completion leaving blocks from 286 to 899 Looking at the loss of blocks when truncating. tracing show small number of files with remaining blocks at delete. sum is 26+22+14+272+11+2 == 347 cf df shows cb=457 next attempt: 14+24+26*11 =324 cf cb=1124 next attempt 26+6+15+68+29 == 144 cf cb=383 26+18+14+19+284 = 361 cf 379 files are (in order) 49 bfile - 30K 325 nbfile-49 - 30K 320 nbfile-44 - 30K 296 nbfile-20 - 30K ??331?? 11 June 2010 Thinking about truncate and index blocks becoming empty while they still have children. For leaf indexes, we need to leave the block in place in case the children get written. We need to find a time to ultimately delete it... For internal indexes,.... uhm, it just works, OK?? When I drop an uninc block, I need to remove it from the uninc list, and from phase_leafs clearing dirty and refiling should remove from leafs. When we recurse to a parent, we need to remove *this* block from the uninc list for said parent. It should be the only thing in the list. But even when we don't recurse, the fact that we have incorporated means that we should tidy up the ->uninc list. 12 June 2010 unmount hung after lafs_run_orphans from lafs_put_super There are two orphans in Writeback which cannot progress until the current cluster is written... But they keep getting re-written! Other time, one orphan, index block is Dirty on a leaf ??? orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1) [cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1) LAFS_cluster_flush 1 orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1) [cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0) OK, problem is that when we truncate and remove an index block, the next index block expands backwards to fill the space. Then we apply prune_some, but don't check if anything was done. We always mark it dirty, so it has to be written and then we loop through again... So need to check if prune_some did anything. TODO: - prune_some need to get more done at a time - let cleaner finish up before umount - use early segments first ?? - look at write-clusters and check OK - check that df:cb= drops properly. Bugs: 1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170 - SECONDARY BUG 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4 3 BUG: unable to handle kernel paging request at 00100104 5 BUG: unable to handle kernel paging request at 6b6b6bfb 1 BUG: unable to handle kernel paging request at 7fffffff 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197! 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828! 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708! 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332! 30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655! Quite a haul there! super.c:655 Pinned block in lafs_release: 0/2 is Dirty with plenty of credits, so it is a child 0/16 is Dirty/Realloc, or once Async Dirty, but not on a leaf list, not pinned segments.c:332 seg_deref with refcnt , 2 in lafs_seg_put_all segments.c:1028 No free segments - no real pattern. modify.c:1708 lafs_incorporate on non-dirty/realloc block 328/0 Index(1). 1 in uninc_table - probably during truncate. Either we add uninc while not dirty Or we clear Dirty while uninc present or there is a race between the two. Don't know: add a bugon Bugon in get_flushable didn't fire. inode.c:843 children present in truncate after final incorp... 328/0. 64 children, no uninc list. Maybe we ran the orphans too early?? or invalidate_page isn't removing the children. Might want print_tree here?- added that. Answer: all the children are in Realloc on Clean_leafs Maybe erase_page needs to disconnect from cleaner too?? inode.c:828 Orphan handling - uninc but not dirty: is Realloc (sometimes) Maybe like mod:1708 block.c:67 * delref 'primary' from modify.c:2063 in the q2 branch. nxt has PrimaryRef... Maybe move earlier, but that shouldn't make a diff. ditto at modify.c:2035 nxt is primary as was I, so drop mine. Don't know - looks like sibling list got broken. Tidied up a bit and added a print-tree. v.interesting result. Lots of consecutive index blocks all holding primary-ref on single primary - which is wrong. 1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference on self, as are being inserted into chain 2/ When splitting, new block must be addressed as first block which cannot fix, not first block which doesn't fit. Else incorping in reverse order can make lots of tiny index blocks. block.c:529 * erase with index depth > 1. 0/328 in orphan handling. Still have 8 or 15 blocks registered! Maybe caused by index block errors. Added some printks. block.c:479 * not enough credits to dirty block 2/0 in dir_delete_commit for unlink. 74/xxxx in unlink 16/1 in seg_inc/seg_move...allocated_block/cluster_flush - writepage wrote the page?? - checkpoint wrote it and didn't replenish the credits? block.c:197 XX invalidated pages finds dirty block after EOF, after iolock_written 0/0 Dirty/Realloc in unmount - all Realloc! Need to wait for cleaner etc to finish at unmount time. NULL deref in 1b4 YY cleaner->cluster_flush->count_credits->lock?? Trying to get a lock on an inode that has since been free?? spin_lock(&dblk(b)->my_inode->i_data.private_lock); 001001 YY generic_drop_inode -- extra iput?? in lafs_inode_checkpin from refile 6b6b6b YY invalidate_inode_buffers!! in kill. use-after-free 7fffff seginsert from scan_seg MAX/number-elements confusion. Worked around for now. 18 June 2010 After a couple of fixes: 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4 1 BUG: unable to handle kernel paging request at 00100104 5 BUG: unable to handle kernel paging request at 6b6b6bfb 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209! 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496! 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531! 16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852! Realloc blocks confusing truncate 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699! 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033! 19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655! TODO: - truncate gets confused by blocks being cleaned. Need to flush cleaner, or just removed the blocks. - when add PrimaryRef in middle of list, take the right ref. - fix up wait-for-cleaner at unmount time. 19 Jun 2010 3 BUG: unable to handle kernel paging request at 6b6b6bfb. 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209! 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890! 22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835! 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852! 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033! 17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656! 251 SysRq : Resetting 3 SysRq : Show State - We can erase a dblock while it is in the uninc_pending or uninc_next - need to be careful - At umount, 0/2 is Dirty but not Pinned, so not written out ditto from 0/16 16/0 sometimes is Async 16/0 Async might be from the segment scan - so wait for that. Dirty but not pinned can happen when InoIdx is pinned. - I think the uninc_next list (At least) should be sorted before being allocated. - root block dirty/realloc/leaf in final iput Could be it was changed during last checkpoint so pushed in to next phase? But why Realloc? Maybe still issue with losing inode data block. 20 June 2010 Happy Birtyhday Dad!! 420 runs. 4 BUG: unable to handle kernel paging request at 6b6b6bfb. 26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209! 87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3 12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033! 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656! Problems: - inode in i_sb_list has been freed. - block 0/0 is dirty/realloc/leaf after final iput - not all blocks freed by truncate - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip - still children when truncate should have finished. all are Realloc Maybe inode has become unhashed and we re-load it?? it is invalid after all!! - Index block not dirty when incorp - has uninc. ?? - didn't wait for free segments - Data 16/0 is dirty but not pinned after final checkpoint - FIXED watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l' watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l' Unclear on dirtying index blocks. We normally mark it dirty first, then add the address to the uninc list. Note that this is the reverse of data blocks which are changed first, then dirtied. So maybe we should mark dirty afterwards. We then need to avoid incorporation while we are adding addresses else we might find it has addresses but is not dirty. Only try if dirty? Maybe we should iolock the parent. We need to do that anyway to flush incorporations when the table is full. Yes, that fits the VM model better. Always lock while updating and preparing to write. Set writeback once write has started, then unlock. Cool. Only a block is iolocked when we allocate (to 0), so we cannot lock the parent.. 21June2010 Apart from tracking down the remaining bugs, I need to: 1/ Decide on locking for incorporation and attaching new address to a block and implement it. In particular we need to not lose the Dirty flag before the update is done. 2/ Resolve handling of pinned inode data/index blocks 3/ Correct handling of empty index blocks, particularly when parent is in different phase. Make lookup be more careful? 4/ Wait for there to be enough free segments before allowing allocation. 2: Problem is that we cannot handle a pinned inode-data block while the InoIdx block is pinned in the same phase. We currently unpin it so it drops off the leaf list. But then we need to re-pin it when the InoIdx is unpinned or phasefliped, and that gets ugly. Possible though. An alternate is to treat it like a parent and keep it off the list while the InoIdx is pinned/same-phase. So we would need to re-assess it after unpinning or flipping the InoIdx. That is probably a lot easier than re-pinning it. 1: We would normally set 'dirty' after changing the block. But we need to differentiate Dirty from Realloc, so we set before adding addresses. This requires that are careful not to write an index block while there are pending changes. The fact that pinned children stop any writing, as do pending addresses in a list should ensure this. 3: When an index block becomes empty we need to make sure that future lookup doesn't get confused by it. Specifically future index lookup must avoid the block so nothing new gets added. Possibly a previous block will split again, but this block must remain unused. However we cannot update the parent block immedatiately as it might be in a different phase. So we must record both "don't touch this" and "where to look instead" elsewhere - in children. If the block being deleted is *not* the first child in the parent, then we direct index lookup to the earlier block. If the block being deleted *is* the first child in the parent, then redirect to the second child if there is one and we weren't just there. If there is no other block we flag the parent as empty and retry from the top. We flag a parent as empty with B_EmptyIndex. What locks do we need to walk around the sibling list? the inode private_lock is minimal, but we cannot hold that to take a iolock - just to get a reference. I guess we - iolock the parent - try to find a good block using private_lock - get a ref and wait for it. - check if it is still a good block. If not, start again If we find an EmptyIndex block, it must be directly addressed by parent. It will never be followed by a PrimaryRef block because if there were such a block, we would have readdressed it back and hidden the EmptyIndex. So we need to look around for an address in the parent that leads to a non-EmptyIndex block. If all children are empty, we need to make the parent empty. But what if it is InoIdx? Maybe I am making this too hard. I could just use i_alloc_sem to block lookups while truncate is happening. That doesn't address single block removal e.g. from directories. So I need to be able to wait for incorporation to happen on an empty index block. We hold iolock on the parent. If there blocks on ->uninc, we just process them immediately. If there are blocks on ->uninc_next, we wait for the checkpoint to complete What does lafs_incorporate actually do with EmptyIndex blocks? Providing that match currently incorp addresses, they just cause those addresses to disappear. If a block is in the uninc list for its parent, then is phase_flipped and changed and written out it could get a new physaddr before it is incorporated. I guess we never allocate a B_Uninc block which is in a different phase to the parent. Currently we wouldn't do that anyway except in truncate though memory pressure on index blocks might one day?? Truncate? We cannot allocate directly in lafs_incorporate. We should get lafs_cluster_allocate to notice and DTRT. Only hash index blocks when they are incorporated. Not needed before then. When processing an uninc list, if an address appears twice, prefer the one that isn't EmptyIndex... 22June2010 I need a clear picture of the "Steady state" for an internal index block with it's children. The internal index block contains 1 or more addresses. For each address there maybe a child index block. If there is it maybe the head of a list of blocks with B_PrimaryRef set thus holding the whole list in place until incorporation happens. Each of these children can be on either ->uninc_list or ->uninc_next, or possibly neither if they haven't been queued for writing yet. Any PrimaryRef block will be Pinned. When a child is incorporated and found to be Empty it is flagged as such and then must never be returned by index lookup. Index lookup will either add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex block and so have to start again from the top. When a PrimaryRef block becomes empty it is simply removed from the PrimaryRef chain so it cannot be found. The space now belongs to the previous block. When a non-PrimaryRef block which isn't the first becomes empty it is flagged and left in place so that following blocks can be found. The address space now belongs to the previous block. When the first child (fileaddr matches parent) becomes empty - what? We could re-address first child but that forces early address change - old might not be incorp yet We could re-address the parent, but that doesn't work for InoIdx We could leave it there with physaddr == 0 Last sounds promising. So we never re-address an index block. So: From the top. Index blocks, Indirect blocks, extent blocks each have an address that never changes. When a block becomes over-full it splits - a new block appears with a new address thus implicitly limiting the address space covered by the original. When an index block becomes empty and has no pinned children it is marked as EmptyIndex (under IOLock). When an EmptyIndex is allocated it goes to phys==0 An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr) is never used again. Its address space is ceded to the previous index block - which could split several times... An EmptyIndex which is first can be re-used. Once it gets pinned children the EmptyIndex is cleared. An Index block always has an entry for the first address. It might be implicit to phys==0. Loading such a block creates an empty block. InoIdx doesn't get EmptyIndex, rather it gets ->depth=1 Indirect *doesn't* store the first address any more. Changes: DONE - remove forcestart from layoutinfo DONE - remove start-address from Indirect blocks DONE - only hash index blocks when they are known to be incorporated. DONE - when incorporating an uninc list, ignore phys==0 if also a block with same fileaddr and phys!=0. so sort phys==0 first DONE - Create EmptyIndex flag DONE - Clear the flag when adding child pin to index block DONE - avoid EmptyIndex non-start blocks during index lookup DONE - allow index blocks to be loaded with ->phys==0 DONE - allow EmptyIndex index block to be "written" to phys 0 DONE - ensure index lookup finds implicit start address, possibly 0 So now after 36 runs 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403! 10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605! 14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034! 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657! 3 SysRq : Resetting index.c:1939 block 0/2 is Realloc and being allocated from cluster_flush while parent is not Realloc or dirty That is bad as Realloc gets set in lafs_allocated_block ... except that the code was bad. FIXED. index.c:403 cleaner is pinning a block (299/25) which is not Realloc, and phase isn't locked. We are only meant to pin data blocks for updates while holding a phase lock. Ahhh - bad code again. FIXED inode.c:605 Truncate doesn't clean up properly. 327 has 60+1 331 has 108+1 327 has 34+1 327 has 60+1 No sign of any children. Very weird. Signed in incorporation going wrong. Added more debugging. Found 4084 4 12 at 890 Added 4084 4 12 Found 4089 4 16 at 878 Added 4089 4 16 Found 4094 2 20 at 866 Added 4094 2 20 Found 2561 2 22 at 854 Added 514 2 22 Found 2564 4 24 at 842 Found 2569 2 28 at 830 Found 0 0 0 at 818 Why are 2564 etc lost? No sign of alloc-to-0 segments.c:1034 no free segments - need to wait somewhere. segments.c:624 allocated_blocks has gone over free_blocks! in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint. Wanted CleanSpace to reserve the youthblk Maybe related to not waiting - ignore for now. super.c:657 block 0/2 was dirty but not pinned. Should not happen to inodes. block 0/0 was Pinned because it had a child - as above. Maybe we don't carry the pin across when we collapse dir into inode??... looks quite likely 23 June 2010 116 runs. 1 BUG: unable to handle kernel paging request at 6b6b6bfb 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497! 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710! 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606! 61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034! 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657! 42 SysRq : Resetting 6b6b6bfb: invalidate_inode_buffers called on at shutdown. Still wierd block.c:497 FIXED?? block 16/1 is not dirty with no credits. Maybe writepage got to it? dir.c:710 ouch! dir lookup failed in unlink. No real hints. Must be hash based - some off-by-one probably. Need to stare at the code. inode.c:606 FIXED Blocks still present after truncate. typically about 60, but in 1 case '4'. No index blocks. So probably content of second index block. Yes, lafs_leaf_next was doing the wrong thing for addresses before start of block. segments.c:1034 same old super.c:657 FIXED dir inode 0/2 is still Dirty but not pinned. Maybe lafs_dirty_inode should be pinning the block But now this triggers for 16/X still dirty. How and when to write blocks in a SegmentMap file? - We don't want normal write-back to write them unless they have no references - We need to write them in tail of checkpoint, and index info must follow in the next checkpoint. lafs_space_alloc is called from - mark_cleaning: always CleanSpace, failure is OK - lafs_cluster_update_pin: ReleaseSpace. -EAGAIN is OK (CHECK THIS) but failure is not - or shouldn't be. - lafs_allocated_block: CleanSpace, checking if parent of Realloc block can be saved separately from any Dirty version. Failure OK, blocking not. - lafs_prealloc - general space allocation. - lafs_cluster_update_pin is call from: - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir lafs_mknod, lafs_rename, - lafs_write_inode So best to return -EAGAIN, and it should be handled adequately. lafs_prealloc is called from: - lafs_reserve_block, after modifying the alloc_type extensively. - lafs_phase_flip to re-fill the 'next' credits. If they aren't available we simply pin all children so they aren't needed. So failure is OK - lafs_seg_ref_block: getting CleanSpace to save segusage blocks. If this fails .. what?? lafs_reserve_block fails. so... lafs_reserve_block is called from - mark_cleaning - CleanSpace - lafs_pin_dblock - type is passed int... - lafs_prepare_write - on failure write will fail or retry after checkpoint - lafs_inode_handle_orphan - to help with delete. On failure we allow cleaning to happen - lafs_seg_move - should be elsewhere. Failure BAD ! - lafs_free_get - as above, failure BAD - clean_free - update youth for new clean blocks - Failure BAD lafs_pin_dblock is called from - dir_create_pin - fail or again handled - dir_delete_pin - dir_update_pin - lafs_create etc - lafs_dir_handle_orphan - choose_free_inum - inode_map_new_pin - lafs_new_inode ... - lafs_orphan_release !! cannot handle failure - roll_block should use AccountSpace So: It seems we need a new allocation class that will never fail. Maybe it is allowed to BUG though? AccountSpace - i.e. space need to account for the use of space. Must never ever fail. Then we must ask where blocking should happen on -EAGAIN. dir.c does "lafs_checkpoint_unlock_wait", then tries again. prepare_write does too. For that to work we must start a checkpoint on returned EAGAIN.... Don't we want to wait for some cleaning to happen first though? Maybe an extra flag, and a count of the number of empty (but not clean) blocks. - Should I skip orphan handling when tight on space? Probably not. It will just keep failing while we keep cleaning... - roll_block should use account_space .. or not - lafs_space_alloc simply allocates space, or fails. 'why' is used to guide watermark choice. - lafs_prealloc allocates space to a block and all its parents base on 'why' for watermarks. It either succeeds or failed. - lafs_cluster_update_pin and lafs_reserve_block decide whether to respond to failure as -ENOSPC or -EAGAIN based on 'why'. - lafs_pin_dblock simply passes on the failure, which must be handled. So: What to do when we return -EAGAIN? We need to wait until there are *enough* clean segments, then cause a checkpoint so they become free. So a flag that says 'waiting for free space' and a count of segments required. But how do we differentiate ENOSPC and EAGAIN for NewSpace requests? Maybe we don't ?? Or do it later. Still to do: - Audit all AccountSpace and justify them + lafs_seg_move is probably wrong. Should have allocated when the free segment was allocated - lafs_orphan_release called lafs_pin_dblock but cannot handle failure - Need to wait not just for "enough space" but for "enough clean segments". - how is 'free_blocks' set - what does this tell us?? free_blocks is the sum of known-clean segments. We probably want: clean segments remainder for each active segment then reserve some segments for cleaning. And separate 'allocated_block' for each ? Notes: segments.c:647 fired: AccountSpace had no space available. Reserving space to write the segusage of youth block for a newly allocated segment. super.c:657 STILL 0/2 is Dirty but not Pinned Maybe we need PinPending soft lockup in the cleaner! Maybe I need cond_resched?? Maybe I want two separate 'free_blocks' counters. One that includes all free blocks for use in 'df' etc. One that only includes completely free segments for use in allocation... 24 June 2010 Something is wrong with cleaning and segment tracking We have 5 free segments and we get them all without writing anything! We consumer them all with cluster_flush! It seems that the root inode is not changing phase! Nothing is on the phase leafs. Most children are in Writeback on cluster. and are Realloc Others have pinned children. They are all in 'cluster', but 'flush' doesn't flush them, so they must be in a different clister??? Is the cleaner still cleaning? Yes, they are on the cleaner 'wc' list so they are queued but not flush for the cleaner. 25 June 2010 At last it looks like I nearly have a working FS. Out of 361 test runs, 9 triggered BUGS and one hung at umount. I need a new TODO list, starting with 6 jul 2007(!) and adding any FIXMEs etc. DONE 0/ start TODO list DONE 1/ document new bugs DONE 2/ Tidy up all recent changes as individual commits. DONE 3/ clean up the various 'scratch' patches discarding any tracing that I don't think I need, and making the rest 'dprintk' etc. DONE 4/ check in this README file DONE 5/ Write rest of the TODO list DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit It is Dirty but only has *N credits. 16/1 ... DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0; I guess we should getref/putref. DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not and doesn't cope well. DONE 5d/ At unmount, 16/1 is still pinned. 6/ soft lockup in unlink call. EIP is at lafs_hash_name+0xa5/0x10f [lafs] [] hash_piece+0x18/0x65 [lafs] [] lafs_dir_del_ent+0x4e/0x404 [lafs] [] ? lafs_hash_name+0xfa/0x10f [lafs] [] dir_delete_commit+0xdb/0x187 [lafs] [] lafs_unlink+0x144/0x1f4 [lafs] [] vfs_unlink+0x4e/0x92 Don't know. Looks like cleanup up a chain in dir_delete_commit. Added a BUG_ON. Would we be spinning on -EAGAIN ?? 4 empty segment are present. 6a/ index.c:1947 - lafs_add_block_address of index block where parent has depth on 1. looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1) /home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) 6b/ check_seg_cnt sees to be spinning on the 3rd section the clean list has no end! we were in seg scan CLEANABLE: 0/0 y=0 u=0 cpy=32773 CLEANABLE: 0/1 y=0 u=0 cpy=32773 CLEANABLE: 0/2 y=0 u=0 cpy=32773 CLEANABLE: 0/3 y=32773 u=6 cpy=32773 CLEANABLE: 0/4 y=32772 u=124 cpy=32773 CLEANABLE: 0/5 y=32771 u=273 cpy=32773 CLEANABLE: 0/6 y=32770 u=0 cpy=32773 of 0 0 1 2 3 6 4 124 5 273 6 0 7 496 8 0 6c/ at shut down, some simple orphans remain missing wakeup ??? DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0 SEGMOVE 1499 0 Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1) .......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1) Why have I no credits? /home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1) Cleaning is racing with truncate, and that cannot happen!! Actually it could - if i_size changed at the wrong time. DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2 block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1) in touch_atime. I think I know this one. 7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502 i.e. 1510, 1945-2038, 2448 of 5378 Appear to be looping in first loop of try_clean, maybe group_size_words == 0 ?? Add BUGON and wait. DONE 7c/ NULL pointer deref - 000001b4 Could be cluster_flush finds inode dblock without inode. Have a BUG_ON of this now. DONE 7d/ paging request at 6b6b6bfb. invalidate_inode_buffers called, so inode_has_buffers, so private_list is not empty. So presumably use-after-free. But is on s_inodes list. Probably cleaner is still active (if this is first call to invalidate_inodes in generic_shutdown_super) so list gets broken. We need locking or earlier flush. DONE 7e/ Remove BUG block.c;273 as cleaner can cause this. Check for Realloc too. PRESUME-FIXED 7f/ index.c:2024 no uninc credit [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1) found during checkpoint. Maybe inode credit problem. PRESUME-FIXED 7g/ inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has ->uninc blocks. This is during truncate. Need some interlock with cleaner maybe? Probably the same race between cleaner and truncate. DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs NOLONGERRELEVENT 7j/ resolve space allocation issues. Understand why CleanSpace can be tried and failed 1000 times before there is any change. DONE 7k/ use B_Async for all async waits, don't depend on B_Orphan to do a wakeup. write lafs_iolock_written_async. DONE 7l/ make sure i_blocks is correct. set on 'import_inode' decreased when lafs_summary_update assigned block to '0' changed when lafs_summary_allocate changes e.g. quota. lafs_summary_update is called when a block is assigned to a location, or to zero. It is real usage. lafs_summary_allocate is called when we set Prealloc on phys==0 or clear Prealloc on phys==0 So allocate must be followed exactly. update is already counted for setting !=0, so only dec on ==0. So all is good. What about quota? - hidden in quota_allocate / qcommit 7m/ delete inode could not progress through inode_map_free, so ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1) was permanently an orphan. DONE 8/ looping in do_checkpoint root is still i Phase1 because 0/2 is in Phase 1 [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid writepageflush(1) Seems to be waiting for writeback, but writeback is clear. Need to call lafs_io_wake in lafs_iocheck_writeback for when it is called by lafs_writepage DONE 9/ cluster.c:478 flush_data_To_inode finds Realloc (not dirty) block and InoIdx block is not Valid. [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0] child(1) I wonder if it was PinPending, or where it was IOLocked (or if). I guess we truncated, then added data, then tried to clean. Probably just a bad 'bug' given recent changes. No, I think it is the race between truncate and clean which is now fixed. SEEMS TO BE GONE 10/ inode.c:606 Deleting inode 328: 2+0+0 1+0 2 level index. first index at level 1 was full and prune properly. Nothing else found empty. Somehow the second index block and contents were lost. ASSUME_DONE 11/ super.c:657 Root still pinned at unmount. 0/2 is Dirty: [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid maybe dir-orphan handling stuffed up Or maybe it is the I_Dirty issue. Assume fixed. ASSUME_DONE 12/ timeout/showstate in unmount umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written That looks similar to 8 DONE 13/ delete_inode should wait for pending truncate to complete. Document I_Trunc somewhere - including that i_mutex is needed to set it. Verify that assertion. Actually it requires i_alloc_sem, or the inode to be deleted. DONE 14/ Review writepage and flush and make sure we flush often enough but not too often. Probably just remove the cluster_flush from write-page as lafs_flush will do that. But leave for now as it encourages heavy indexing. DONE 14a/ use bio_add_page to write clusters. DONE 14b/ Figure out what backing_dev to present for the filesystem. DONE 15/ The inode map file lost some credits. I think it losts a PinPending because it isn't locked properly. Don't clear PinPending if someone else might have set it. DONE15a/ Find all FIXMEs and add them here. DONE 15b/ Report directory size less confusingly DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block) DONE 15d/ What does I_Dirty mean - and implement it. FIXED 15e/ setattr should queue an update for the inode metadata. and clean up lafs_write_inode at the same time (it shouldn't do an update). and confirm when s_dirt should be set. It causes fsync to run a checkpoint. 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward? ## Items from 6 jul 2007. 15g/ test directories with non-random sequential hash. DONE 15h/ orphan deadlock lafs_run_orphans- lafs_orphan_release can block waiting for written in erase_dblock, but that won't complete until cleaner gets to run, but this is the cleaner blocked on orphans. DONE 15i/ separate thread management from 'cleaner' name. DONE 15j/ review rules in getref_locked - and document them DONE - fix accesses to iblock DONE 15k/ newblocks should probably be a count of segments. Review that. DONE 15l/ make sure checkpoint_youth is decayed properly. Review youth decay. DONE 15m/ consider combining .orphans and .cleaning lists. If something is an orphan, we probably don't want to clean it just now(?). DONE 15n/ consider if lafs_pin_dblock should check for iolock. Maybe iolock or PinPending (which must be set under iolock). Just require PinPending and always get iolock_written for that except in special cases. DONE 15o/ Can there be async blocks when checkpoint starts? Could they pin blocks in old phase? Do I need to check for them? DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just yet' thing - or somehow avoid the yuckiness. DONE 15q/ check checksums when reading cluster_header for cleaner This is already done! DONE 15r/ consider further optimisation in cleaner to avoid lookups. DONE 15s/ memory barrier for i_size check in cleaner??? DONE 15t/ review usable-space calculations in clean. DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some common helpers. DONE 15w/ review cluster updates and make sure space used is accounted properly. DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately try to create it. Would this actually save anything significant? DONE 15y/ Don't make dir blocks into orphans if it cannot be needed? DONE 15z/ make sure symlink creation is safe - do I need to log the body?? DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does. DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared after lock is taken on block? DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some writeout. DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder if more is needed. DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock before removing from lru. DONE 15af/ Review all locking in lafs_refile DONE 15ag/ Don't allocate data part of InoIdx block. DONE 15ah/ Is there a problem with lafs_allocated_block putting an about-to-be-truncated block on an uninc list? DONE 15ai/ When allocating a new segment during checkpoint, delay the youth-block update until after the checkpoint DONE 15aj/ When roll-forward finds a new segment, make sure youth number is updated. DONE 15ak/ Load orphan file during roll-forward and make every block an orphan. DONE 15al/ set filesystem update_time somewhere. DONE 15am/ filesystem 'name' needs to be handled uniformly. DONE 15an/ can we be sure 'b' will be non-null in delete_inode? DONE 15ao/ determine what locking is needed to walk the children list in lafs_inode_handle_orphan. Probably the address_space private lock. 15ap/ Make sure write_inode has been cleaned up. See if this applies to rollforward of a symlink (see FIXME) DONE 15aq/ change inode map to be little-endian, not host-endian DONE 15ar/ understand what to do about errors in lafs_truncate 15as/ handle errors from lafs_write_super ??? DONE 15at/ More wait_queues to wait for different blocks. just use wait_on_bit / wake_bit DONE 15au/ How should iocheck_block set the page error? and block_loaded <- this gets it right. 15av/ ditto for write errors? DONE 15aw/ when lafs_incorporate makes a new block where the old is Realloc, the new should be Realloc too. 15aw2 / When a block is a snapshot block it can never be dirty so we only need credits for realloc... DONE 15ax/ Think about what happens when we relocate a block in the orphan list (lafs_orphan_release), particularly if the block isn't actually loaded. FIXME still need to make sure errors will loading the orphan file are handled correctly - I guess we mark all bad orphans as type==0 and when we find those during release, reduce the size of the orphan file. DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup when an inode or dir mutex is released. No, there isn't. DONE 15az/ Sanity check all values in cluster head during roll-forward i.e. in roll_valid. If the head isn't complete, we can still use this to commit some previous checkpoints. DONE 15ba/ roll forward should not BUG on bad data like inodefile in non-primary filesystem. DONE 15bb/ Do I need to sync something before copying an update over part of an inode, then reloading the inode. DONE 15bc/ Handle DescHole in roll forward. DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock in roll forward, just for consistency. DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...) are actually the correct type. DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing a lookup - or at least we can test for that. lafs_seg_apply_all has similar problems and needs a good solution. DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent if parent splits. See what to do about that. DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative. or handle if it has. DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first. to twostage. DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in segdelete. DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean DONE 15bl/ better checks for 'valid state block address' in valid_devblock include that segment_count is credible also in valid_stateblock 15bm/ make sure everything gets free properly on error during mount / lafs_load 15bn/ How does refcounting of 'struct fs' work with multiple filesets? DONE 15bo/ use put_super to drop last refer to superblocks DONE 15bp/ review all superblocks - maybe use more anon?? 15bq/ check readonly status in lafs_get_sb DONE 15br/ sync_fs should probably wait for something if 'wait'. DONE 15bs/ set f_fsid properly in lafs_statfs DONE - use new write_begin / write_end 15bt/ - review how we ensure that credit remain with block. 15ca/ When pin inode data block, pin it as well as index block I think It is still kept of the leaf list until the index block is done with I think. 15cb/ Layout issues: DONE - subset filesys still needs a parent pointer DONE - cluster head needs mtime/ctime to log these. - need better tracking of which devices are in this array?? Need to be able to have read-only devices that are shared among arrays. DONE - need multiple parallel write-clusters to allow parallel writes. - record tuning in state block: - max_segs DONE - use crc or something, not toy checksum (e.g. cluster - state already has) - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60) - policies of whether old or new data is allowed on each device - policies of how much duplication of metadata is required DONE - inode map - not host-endian DONE - segments > 16bit: segusage file - what about youth? cluster_head Clength 15cc/ free any stray B_ASync block found in destroy_inode 15cd/ Some code assumes a cluster header does not exceed 1 page. Is this safe? Is in true? Is it enforced?p roll-forward now handles large cluster_head. Need cleaner to handle it, and need to possibly write large cluster head when making new clusters. 15ce/ classify BUGs as - internal logic errors - IO errors - unusual conditions I want a warning of - data corruption errors DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems This is needed for the cleaner - the cleaner needs to hold a ref somehow. 15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster stuff?? 15ch/ Review values of youth and checkpoint_youth and think about off-by-one issues. 15da/ Replace directory updates!!!!! 15db/ Decide how version string will be used. 15dc/ resolve table_size - it should be stored in the segusage file and validated based on device geometry. 15ea/ rollforward should recognise VerifyDevNext{,2} to allow next cluster on same device to verify previous. 15eb/ When multiple devices and lots to do and plenty of free space, allow multiple segments, one per device, to be open at once, and possibly be writing multiple clusters at once using VerifyDevNext2 15ec/ Implement i_version tracking. This should be a 64bit numbers that appears to change every time the file changes. We only need a new number when someone looks at the value with getattr. We could simply use mtime with the sub-millisecond part being a counter of times that getattr sees a change in the same millisecond. However as mtime can go backwards we might get i_version going backwards, which is awkward. I wonder if I care. Otherwise, leave for an inode extention later. 16/ Update locking.doc 17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address calls lafs_iolock_written. How do we know that won't block on cluster_flush? 18/ See if per-fs shrinker is available yet and consider it for index blocks. 19/ Review WritePhase and make sure it is used properly. 20/ Review places where we update blocks and be sure they are not in writeout or in a different phase. 21/ Review and document all lru uses (locking.doc) and make sure they are all locked properly. 22/ Check possible failures: - thread allocation - memory allocation - reading critical metadata ... 23/ Rebase on 2.6.latest. Done for .38 24/ load/dirty block0 before dirtying any other block in depth=0 file, else we might lose block0 25/ use kmem_cache for datablock indexblock - probably a mempool because we cannot allow failure when splitting an index block. skippoint (mempool?) segsum - mempool?? others? 26/ Review seg addressing code for 2-D geometries. 27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient. 28/ Make sure youth blocks are always referenced properly. 29/ Make sure new segments are referenced properly. I think there might be some double referencing. 30/ Decide when to use VerifyNULL or VerifyNext2 31/ Implement non-logged files DONE 32a/ Store access time in a file 32b/ Make it a non-logged file 32c/ Avoid writing out dirty atime file blocks when not necessary. i.e. keep the page clean and active, and trigger 'write' on release_page. 33/ Support quota : group / user / tree 34/ handle subordinate filesystems: ss[]->rootdir needs to be array or list lafs_iget_fs needs to understand this 35/ review snapshots: - peer lists and cleaning - how to create - failure modes - how to destroy 36/ review roll-forward DONE 36a/ make sure files with nlink == 0 are handled well DONE 36b/ sanity check before trusting clusters DONE 36c/ handle miniblocks which create new inodes. DONE 36d/ Handle DescHole in roll_block DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather than just iolock, for consistency... DONE 36f/ What to do if table becomes full when add_block_address in roll_block ?? DONE 36g/ Write roll_mini for directories. DONE 36h/ In roll_one, use the cluster counting code to find block number and make sure we don't exceed the segment. DONE 36i/ add more general error checking to lafs_mount - lafs_iget orphans and segsum. Check type is correct. errors from lafs_count_orphans or lafs_add_orphans. alloc_page failure for chead - maybe allocate something bigger?? 37/ Configure index block hash_table at run time base on mem size?? 38/ striped layout review everything needed for safe RAID5 39/ How to handle all different IO errors 40/ Guard against data corruption at every level. 41/ Add checksums on index blocks and dir blocks and Inodes and ??? 42/ Store duplicates of some blocks. At least index and inode. 43/ Handle writepage on mem-mapped page, adding new credits or unmapping. Make sure ->page_mkwrite sets up credits properly 44/ Examine created filesystem and make sure everything looks good. DONE 45/ mkfs.lafs 46/ fsck.lafs 47/ Write good documentation 48/ Review all code, improve all comments, remove all bugs. 49/ measure performance 50/ Support O_DIRECT 51/ Check support for multiple devices - add a device to an live array - remove a device from a live array DONE 52/ NFS export 53/ 'overlay' support So I mount one device read-only an another device writable which gets all the updates. metadata on first device not updated. 54/ cluster support - is this possible? 55/ is any useful variant of reflink possible? 56/ Review roll-forward completely. 57/ learn about FS_HAS_SUBTYPE and document it. This is for fuse in particular so users can know the real type 58/ Consider embedding symlinks and device files in directory. Need owner/group/perm for device file, but not for symlink. Can we create unique inode numbers? hard links for dev-files would be problematic. What do we gain? Maybe something for short symlinks. 40 seems a good length to get 70% of symlinks. 59/ Fix NeedFlush handling so we don't drop-then-retake a mutex as that isn't sensible. 60/ Introduce some fs state recording that fsck is needed and possibly identifying what sort of fsck. 61/ Try to make the inode struct smaller - maybe move some of the fs metadata into a separately-allocated struct. 62/ System/trusted extended attributes: fileset max size directory hash/seed 63/ user extended attributes. 64/ wonder if index blocks can be flushed out by memory pressure somehow. e.g. if a data block is written by reclaim, flag the index block. When a flagged index block has no children, it is incorporated and written. ?? 65/ review why lafs_allocated_block needs the new_parent label. Should not lafs_incorporate leave all parents dirty? Maybe it is just the need for B_Realloc - so maybe lafs_incorporate should leave the new block either realloc or dirty rather than lafs_allocated_block doing it.? See also 15ad below. 66/ Delay writeout of directory updates until an fsync. If a checkpoint happens first, discard the updates (and fsync waits for checkpoint to complete). If a cross-directory rename happens care is needed: either flush updates first or ensure that a flush does happen before the cross-directory update is flushed. Note that if the target of a rename is a directory, it must also be fully flushed before the rename can proceed. 26June2010 Investigating 5a Normal sequence is to surrender UnincCredit, then to clear Dirty, then to write. If anyone re-dirties after Dirty is clear, they will naturally have to add an UnincCredit having reserved space first. However it seems that the Cleaner gets in the way as the block in question has just previously been cleaned, which consumed the UnincCredit Do we need ReallocUnincCredit?? I hope not. We generally need a way to say "I might want to write to this" so cleaner doesn't write it early. For index blocks that is pincnt. For data it is 'PinPending'. This keeps index blocks off clean_leafs until they are ready, but not data blocks. And in any case, TypeSegmentMap blocks don't get PinPending as they get written *after* the checkpoint. That is a rather ugly exception. Maybe we make their different handling more explicit. We put them on a separate list unpinned so the rest of the checkpoint can complete. Then we flush that list? Then PinPending keeps them off the clean_leafs list. So to clarify the plan: If a block is already Pinned to this phase, we can "clean" it by marking it Dirty rather than Realloc. This is appropriate for blocks that are likely to change soon (as blocks written to the cleaner segment are not likely to change soon). For data blocks we take "PinPending" to say "might change soon". For index blocks ... we don't know if it is pinned by Realloc or Dirty or PinPending children. So we set Realloc and wait for any children to be unpinned for whatever reason. If it is only pinned by Realloc blocks, it will end up on clean_leafs and be processed to the cleaner segment. If it is pinned by anything else it will be found by the checkpoint and processed to the new-data segment. So Index blocks always get Realloc, PinPending blocks get Dirty, Other data blocks get Realloc. Good. Must review PinPending usage... always set, then maybe-dirty inside checkpoint lock. In cases of unlocked usage (inode map) we don't clear PinPending until checkpoint so it has longer exposure to Realloc->Dirty. It is likely to be changing though, so not a big cost. Even good. Could make the distinction later. PinPending blocks don't go on clean_leafs. So if they are still realloc at the checkpoint, we Realloc to the new-data segment. This has the same net effect but is arguably cleaner. It means that if a realloc block gets pinpending set, it immediately stops being a clean leaf and so is safe. So: just keep PinPending blocks off clean_leafs. Keep them on phase_leafs. However there is no mechanism for moving things from phase_leafs to clean_leafs. So maybe they stay on clean_leafs, but when the cleaner gets to them, it dirties them and drops them.... that would work. So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs to pick up. BUT: Does this work for TypeSegmentMap blocks? They aren't PinPending. We could treat them specially in the cleaner. Or we could set PinPending and pin them to the phase, but treat them differently in checkpoint. If we gathered them onto a separate list, then flush the list after the phase had changed, it might be quite neat. No more getting writepages to do our work for us. They would need to be re-pinned to the next phase, then written out. Or just unpinned, and let seg_inc re-pin as appropriate... except that seg_inc is too later to pin. It dirties. We need to pin when we get SegRef. We currently reserve but we don't pin. We really do need to phase_flip these segmentmap blocks. But that requires getting extra credits, and Pinning everything if new credits are not available. And we don't really have a good list of 'everything' that depends on a segment. But seeing the space_alloc never fails for these... So Pin them, and flip them with AccountSpace So: - split out common 'flip' code - add 'flip' for data blocks - create list of accounting blocks and flip accounting file blocks onto that list during checkpoint Flush should write that list, not the files. - Get cleaner to ignore pinpending blocks, marking them dirty. - pin segusage blocks while ref on them is held. - writepage no longer needs special case for TypeSegmentMap, just PinPending - lafs_prealloc just tests PinPending [[aside: quota files seem to be handled like segmentmap files. Is that right?? We only track usage of data blocks based on various 'owners' of the file. We need to know if a block was written in one phase or the next, and only count blocks written/allocated in the one. Data blocks can slip into 'this' phase quite late - any time before the parent is finally incorporated. So we don't write quota blocks until checkpoint is done. So yes, they are like SegmentMap ]] segsums.... If there are hundreds of snapshots, then a block being cleaned (whether to cleaner segment or new-data segment) could affect hundreds of segment usage counters. That would be clumsy to work with. Every block in the free table would need to hold references to hundreds of blocks. This is do-able and might not be a big waste of space, but is still clumsy. I could change the arrangement for accounting per-snapshot usage by having a limited number of snapshots and having all the counters for one segment in the one blocks. So 1024byte block could hold 512 counters (youth plus base plus 510 snapshots). Half that if I go to 4byte counters. In more common case of 32 snaphots, could fit counters for 8 segments in a block. This means using space/io for all possible snapshots rather than all active snapshots. It would also mean having a fairly fixed upper limit. I wonder what NILFS does.... Worry about this later. Still trying to get pinning of SegmentMap blocks right. Normally we need a phase-lock when pinning a data block so that we don't lose the pinning before we dirty. But as we phase_flip these it doesn't matter... So just add that too the test?? 28June2010 Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but datablock not, and doesn't cope. We either prealloc both, which seems clumsy, or always defer to InoIdx if it is present and pinned. lafs_prealloc does both Index and Data blocks for inode. But Data could lose as writeout while index will replenish at phase_flip, so maybe not a good idea. If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty credits across to the data block (on non-cleaning segments) so the Data block doesn't need to have credits. dirty_inode gets called: {__,}mark_inode_dirty{,_sync} inode_{inc,dec}_link_count [[various quota ops]] inode_setattr touch_atime file_accessed file_update_time generic_file_...write do_wp_page updates through inode_setattr go to lafs_setattr so the data block will be pinpending and the checkpoint lock will be held. updates through inode_*_link_count happen in filesystem and the inode data block is PinPending, or a block in the file is pinned and will be dirty, so it will get written. updates through touch_atime or file_update_time are unexpected and cannot be prepared for. file_update_time changes will be caught by normal file writeout. atime changes will be lost until we get the atime file working. So: dirty_inode cannot change the block as it might be in writeout, and it cannot lock anything as it might be in touch_atime which shouldn't block and cannot fail. So just set I_Dirty and use that to flush inode to db at writeout. Any changes which must be in the next phase will come via setattr and so will wait for incompatible changes to be written out. Reflecting on 7c - cluster_flush might find ->my_inode is NULL. my_inode is set lafs_import_inode iget and mount-time stuff lafs_inode_dblock my_inode is cleared When I_Destroyed is set and the last ref on the block is dropped When inode_map_new_prepare claims an inodeblock So we could easily not have a my_inode - e.g. just cleaning the data block. ->my_inode cannot disappear while we hold the block, so a test is safe. ---------------------------------------------- Space reservation and file-system-full conditions. Space is needed for everything we write. Some things we can reject if the fs is too full Some things we can delay when space is tight Some things we need to write in order to free up space. Others absolutely must be written so we need to always have a reserve. The things that must be written are - cluster header - which we never allocate - some seg-usage and youth blocks - and quota blocks Whese continually have credit attached - it is a bug if there are not enough. (We hit this bug) Things that we need to write to free up space are any block - data or index - that the cleaner finds. Things that we can delay, but not fail, are any change to a block that has already been written or allocate. When space is needed it can come from one of three places. - the remainder of the current main segment - the remainder of the current cleaner segment - a new segment. Only Realloc blocks can go to the cleaner segment, so the 'must write' blocks cannot go there, so unused + main must have enough space for all those. Realloc blocks can go anywhere - we don't need a cleaner segment if things are too tight. When we run out of space there are several things we can do to get more: - incorporate index blocks. This tends to free up uninc-credits which are normally over-allocated for safety. - cluster_allocate/cluster_flush so more blocks get allocated and so more can be incorporated. See above. This is probably most helpful for data blocks. - clean several segments into whole cleaner segments or into the main segment. Much of this happens by triggering a snapshot, however we should only do that when we have full cleaner-segments (or zero cleaner segments). When cleaning we don't want to over-clean. i.e. we don't want to commit any blocks from a second segment if that will stop us from commiting blocks from the first segment. Otherwise we might use one cleaning segment up by makeing 4 half-clean. This doesn't help. So: we reserve multiple segments for the cleaner, possibly zero. We clean up to that many segments at a time, though if that many is zero, we clean one segment at a time. lafs_cluster_allocate only succeeds if there was room in an allocated segment. If allocating a new segment fails, the cluster_allocate must fail. This will push extra cleaning into the main segment where allocations must not fail. The last 3(?) [adjusted for number of snapshots] segments can only be allocated to the main segment, and this space can only be used for cleaning. Once the "free_space - allocated_space" drops below one segment, we force a checkpoint. This should free up at least one segment. We need some point at which we stop cleaning because the chance of finding something to clean is too low. At that point all 'new' requests defintely become failures. They might do earlier too. Possibly at some point we start discounting youth from new usage scores so that the list becomes sorted by usage. Need: cut-off point for free_seg where we don't allow cleaner to use segments 3? 4? event when we start using fixed '0x8000' youth for new segment scores. Maybe when we clean a segment with usage gap below 16 or 1/128 event when we stop doing that. Maybe when free_segs cross some number - 8? point when alloc failure for NewSpace becomes ENOSPC same as above? point when we don't bother cleaning no cleaner segments can be allocated, and checkpoint did not increase number of clean segments (used as many as freed). Clear this state when something is deleted. Allocations come out of free_blocks which does not included those segments that have been promised to the cleaner. CleanSpace and AccountSpace cannot fail. We *know* not to ask for too many - cleaner knows when to stop. ReleaseSpace fail (to be retried) if available is below a threshold, providing the cleaner hasn't been stopped. NewSpace fail if below a somewhat higher threshold. If we haven't entered emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN. Possibly limit some 'cleaner' segments to data only?? So: work items. - change CleanSpace to never fail, but cluster_allocate new_segment can for cleaner segment. This is propagated through lafs_cluster_alloc - cleaner pre-allocates cleaner segments (for new_segment to use) and only cleans that many segments at a time. - introduce emergency cleaning mode which causes ENOSPC to be returned and ignores 'youth' on score. - pause cleaner when we are so short of space that there is not point trying until something is deleted. 30june2010 notes on current issue with checkpoint misbehaving and running out of segments. 1/ don't want to cluster-flush too early. Ideally wait until segment is full, but we currently hold writeback on everything so we cannot delay indefinitely. 2/ row goes negative!! let's see... seg_remainder doesn't change the set, but just returns the remaining rows times the width seg_step move nxt_* to *, stepping to the next ... row? save current as 'st_* seg_setsize - allocate space in the segment for 'size' blocks plus a bit to round of to a whole number of table/rows nxt_table nxt_row seg_setpos initialises the seg to a location and makes it empty, st_ and nxt_ are the same seg_next reports address of next block, and moves forward. seg_addr simply reports address of next block So the sequence should be: seg_setpos to initialise seg_remainder as much as you want seg_setsize when we start a cluster seg_next up to seg_remainder times seg_step to go to next cluster (when not seg_setpos). or maybe just before seg_setpos Need cluster_reset to be called after new_segment, or after we flush a cluster but don't need a new_segment. I think I'm cleaning too early ... I am even cleaning the current main segment!!!! OK, I got rid of the worst bugs. Now it just keeps cleaning the same blocks in the current segment over and over. 2 problems I see 1/ it cleans a segment that it should not touch We need to avoid cleaner segment increasing the checkpoint youth number. 2/ it has 6 free segments and doesn't use them clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark watermake is 4 segs, so free < 4. So we have 3 allocated to cleaner, 3 in reserve and so nothing much to clean! The heuristic for returning ENOSPC is not working. Need something more directly related to what is happening. Maybe if cleaning doesn't actually increase free space. !Need to leave segments in the table until we have finished writing to them, so they cannot be cleanable. - DONE WAIT - problem. If cleaner segment is part-used, the alloc_cleaner_segs doesn't count that. Bad? When nearly full we keep checkpointing even though it cannot help. Need clearer rules on when there is any point pushing forward. Need to know when to fail requests. 02 july 2010 I am wasting lots of space creating snapshots that don't serve any purpose. The reasons for creating a snapshot are: - turn clean segments into free segments - reduce size of required roll-forward - possibly flush all inode updates for 'sync'. We currently force one when newblocks > max_newblocks max is 1000 , newblocks is never reset! probably make that a number of segments. lafs_checkpoint_start is called when cleaner blocks, and space is available at shutdown on write_super is s_dirt __fsync_super before ->sync_fs freeze_bdev fsync_super fsync_bdev do_remount_sb generic_shutdown_super before put_super if s_dirt sync_supers is s_dirt do_sync file_sync !!! is s_dirt I think I should move checkpoint_start to ->sync_fs After testing - blocks remaining after truncate - one index and 1-4 data - truncate finds blocks being cleaned FIXED - move setting of I_Trunc - orphans aren't being cleaned up sometimes. Hacked by forcing the thread to run. - parent of index block has depth==1 Don't reduce depth while dirty children. Probably don't want uninc either? - some sort of deadlock? lafs_cluster_update_commit_both has got the wc lock and wants to flush writepage also is flushed. Not sure what the blockage is. I think the writepage is the one in clusiter_flush, and it is blocking - Async is keeping 16/0 pinned during shutdpwn 03July2010 Testing overnight with 250 runs produced: - blocked for more than 120 seconds Cleaner tries to get an inode that is being deleted and blocks, so inode_map_free is blocked waiting for checkpoint to finish - deadlock. Need to create a ->drop_inode which provides interlock with cleaner/iget But this is hard to get right. generic_forget_inode need to write_inode_now and flush all changes out and then truncate the pages off so the inode will be empty and can be freed. But flushing needs the cleaner thread which can block on the inode lookup. Ahh.... I can abuse iget5_locked. If test sees I_WILL_FREE or similar, it fails and sets a flag. if the flag was set, then 'set' fails - block.c:504 DONE (I trink). unlink/delete_commit dirties a block without credits It could have been just cleaned.. It looks like it was in Writeback for the cleaner when unlink pinned and allocated it.... or maybe it was on a cluster (due to writepage) when it was pinned. Then cluster_flush cleared dirty ... but it should still have a Credit. Maybe I should iolock the block ?? On reflection it wasn't cleaning, just tiny clusters of recent changes which were originally written as tiny checkpoints. Maybe lots of directory updates triggered the clusters. I guess writepage is being called to sync the directory??? Or maybe the checkpoint was pushed by s_dirt being set. So use PinPending and iolock to protect dir blocks from writepage. - dir.c:1266 DONE dir handle orphan find a block (74/0) which is not valid This can happen if orphan_release failed to reserve a block. We need to retry the release. - inode.c:615 index block and some data blocks still accounted to deleted file. No theory on this yet. Always one index block and a small number of data blocks. Maybe the index block looked dirty, but was then incorporated with something that was missed from the children list... Or maybe I_Trunc is cleared a bit early... Or trunc_next advanced too far?? or too soon ?? - segments.c:640 DONE prealloc in the cleaner finds all 2315 free blocks allocated. no clean reserved. Need to be able to fail CleanSpace requests when cleaner_reserve is all gone.?? or just slow down the cleaner to one segment per checkpoint when we are tight.. Hope that works. - super.c:699 async flag on 16/0 keeping block pinned Maybe clear Async flag during checkpoint. Cleaner won't need it No, just ensure to clear Async on all successful async calls. orphan file 8/0 has orphan reference keeping parent pinned [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1) Orphan handling is failing to get a reservation to write out the orphan file block? Not convincing as there should be lots of space at unmount, and 'orphan sleeping' has become empty. - Show State orphan inode blocked by leaf index stuck in writeback: [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5) [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0] This is in the write-cluster waiting to be flushed 9July2010 Review B_Async. If a thread wants async something, it - sets B_Async - checks if it can have what it wants. + if not, fail + if so, clear B_Async and succeed If a thread releases something that might be requested Async, it doesn't clear Async, but wakes up *the*thread*. This applies to IOLock - iolock_block Writeback - writeback_donem iolock_written Valid - erase_dblock, wait_block inode I_* - iget / drop_inode orphan handler, cleaner, segscan - all in the cleaner thread. 107 runs, 2 hit 'Show State' with a blocked orphan inode. Two children, one EmptyIndex, one PrimaryRef, Async,Writeback Both NoPhysAddr Several runs blocked in cluster_flush or waiting for writeback. - first case: looks like cluster flush should run but doesn't. cluster_flush runs: checkpoint, cleaner, cluster_allocate when full, update, writepage, sync_page So we have no timeout or other flush. I guess if we are waiting for writeback, we need to trigger a cluster_flush. - other case - cluster_flush was called but is waiting for pending count to go down. Looks like cluster_reset shouldn't be changing pending_next New hang. Orphans not being processed: inode, because InoIdx is on leaf and checkpoint isn't pushing it along. dir block 0 is Dirty leaf Maybe we failed to get a mutex, and mutex_unlock doesn't wake us. 10July2010 Over night it looks *very* good. Have one infinite loop with 31770 repeates of ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid, Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1) So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free lafs_add_orphan too short. tracing shows after truncate_inode_pages. must be blocked in inode_map_free - maybe use AccountSpace?? But why isn't the the truncate progressing? Probably same reason: No ReleaseSpace available. Maybe we aren't cleaning because there is a free segment, and we aren't checkpointing because there aren't enough yet... Probably the cleaner has halted while CleanerBlocks - fix that. - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere.. Need a checkpoint to release the orphan? ditto for 0/331 - 331/0 XX/0 is InoID VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day... This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI ,UninCredit,PhysValid leaf(1) intable(6) release(1) [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys Valid leaf(1) intable(6) release(1) Leaf0(0) ------------[ cut here ]------------ kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698! Forgetting 0 0 724 != 7 (st->free.cnt afte segdelete, close_segment, close_all) ------------[ cut here ]------------ WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn we called segdelete on something that was on the freelist. This happens when the final cluster starts a new segment. Need to improve the fix though. lafs_inode_handle_orphan can make progress without leaving anything async. Maybe we need a return status: -EAGAIN - try after async -ENOMEM - try some time soon - hope memory will be better 0 we called orphan_release anything else loops. - we allocate a segment in last checkpoint we don't take references properly. - orphan handle spinning on: ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1) 26402 calls. stuck in delete_inode?? ? never-ending cleaning? Maybe just computer slow ?? 11July2010 - on plane to Prague. How can we safely access ->iblock? normally iolock, but how do we get iolock? - flush data to inode - cluster flush takes private_lock - private_lock is used to set to null. I guess we use private_lock to get a reference then iolock and revalidate but I can probably test for NULL at any time? though that can change under private_lock If we own a reference to a child with a parent, then we can use rcu_dereference to get a ref which might change 12july2010 ->write_inode is called by write_inode() called by __sync_single_inode to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages Do we care? change to addresss we already handle with checkpoints change due to setattr we can handle directly if we want that just cleans mtime/ctime and atime. mtime/ctime calls ->dirty_inode as does atime So: getattr changes set I_Dirty so that when cluster_allocate happens all the changes get saved. when dirty_inode is called, we set I_Dirty but don't dirty the inode block. If anything happened to justify an inode write, it will be dirty anyway. If it isn't, this is just atime So on dirty_inode we check if atime has changed and if so we schedule change to atime file sync_inode should write an update for the inode if I_Dirty but sync_filesystems should not Simple. fsync calls ->fsync. We get that to write an inode update, but nothing else does. Possibly all directory updates could be chained onto a directory and only written when fsync is requested before a checkpoint. both sides of a rename ?? leave that for later. WritePhase - what is that all about? We must not change a block while it is being written to previous phase, else we corrupt causality. But we probably don't want to change it any way as that would mess up any checksum or duplication. So we want to ignore WritePhase - scrap it. Before changing a block, we must iolock_written - all dir updates - inode update in fsync - orphan file - segusage? - quotas? But what about regular data. If prepare_write finds a block in writeback, do I need to wait, or can I just mark it dirty in commit_write? If no checksum and no duplication applies, this should be fine. 16July2010 BUT e.g. dir operations are in particular phases. If the dirblock is pinned to the old phase, we need to flush it, then wait for io to complete. So we need lafs_phase_wait as well as iolock_written. This is already done by pin_dblock. I wonder if we need a way to accelerate pinned blocks that are being waited for - probably not, they should be done early. So we probably want to iolock after phase_wait in pin_dblock. Though dir.c pins early. I need to review all of this and get it right. So: - we aren't allowed to block much holding checkpoint_lock as checkpoint_start waits for that. However phase_wait will only block if a new checkpoint has started already, so there is not chance of phase_wait ever blocking checkpoint_start. So it is safe to call phase_wait in checkpoint_lock. phase_wait will wait until block is written, added back to the lru clean, then found and flipped... I wonder if that is good - it keeps parent from being a leaf, and so written, until child write has completed. We want to phase-flip a block as soon as it is allocated by cluster_flush. With directory blocks, i_mutex stops other changes, so an early iolock_written will leave the block clean and phase won't be an issue. With inode-map blocks.. we: set B_Pinned to ensure no-one writes except for phase change do that after lock_written so it starts safe. once we have checkpointlock, wait for phase if needed. then lock_written again which should be instant but ensures that block is locked while we change it... I think I want - refile to call phase flip if index is not dirty and is in wrong phase and has no pinned children in that phase. - Only clear PinPending if we have i_mutex or refcnt == 0 - before transaction: lock_written / set PinPending / unlock the inside cluster_lock lock_written pin / change / dirty / unlock it will only wait for writeout if phase changed. so don't need phase_wait but want pre-pin then pindblock Transactions are: dir create/delete/update - DONE inode allocate/deallocate - on inode map DONE setattr DONE orphan set/change/discard Orphans are a little different as when we compact the file, the orphan file block 'owned' by the orphan block can change. As along as we keep them all PinPending it should be fine though. I think that every block in the orphan file will always be PinPending ??? OK - done most of that. Early phase_flip is awkward. We need an iolock to phase_flip, and we don't have one. The phase_flip could cause incorporation which cannot happen until the write completes. So I guess we leave it as it is. FIXME what about inode data block - cluster_allocate is removing PinPending after making them dirty from the index block.. If all free inode numbers a B_Claimed, don't think we allocate a new block... yes we do, as 'restarted' is local to caller. Also each device has a number of flags - new metadata can go here - new data can go here - clean data can go here - clean metadata can go here - non-logged segments allowed - priority clean - any segment can be cleaned - dev is shared and read-only - no state-block updates state block needs a uuid for an ro-filesystem that this is layered on. Is metadata an issue? We might want it on a faster device, but ditto for directories and for some data. So probably skip that. Have separate segment tables for: - can have new data - can have clean data but not new. (this often empty) Clean data can go to new-not-clean if nothing else new data can go to clean-not-new ?? if not sync?? Maybe call them 'prefer clean' and 'prefer new' I think we want: 'no sync new' - don't write new data, unless it is in big chunks and can wait for checkpoint to be 'synced' 'no write' - never write anything - this is readonly. used for removing a device from the fs. A 'no sync new' device can have single-block segments. This doesn't allow compression, but avoids any need to clean In this case we don't store youth and the segusage is 32 bits per segment. That means - for 1K block size - 0.5% of devices used for segusage. That feels high. For 4K, 1/1024 so a giga per terabyte. Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks. Other segusage for 29 snaps is 1/million of space used. So we 'waste' 0.1% of device for no secondary cleaning. Can still do defrag though. clearing a snapshot on a 1TB device writes 1GB of data!! potentially. as does creating a snapshot. 18jul2010 If lafs were cluster enabled we would want multiple checkpoint clusters, one for each node. When a node crashes some node would need to find and roll-forward. For single node failure, it is enough to broadcast cluster address to all others. For whole-cluster failure, need to either list all in superblock or link from main write cluster. When writing to multiple devices we may want multiple write clusters active for new data. These all need to be findable from checkpoint cluster so linking sounds good. Having a single 'fork' link in cluster head might work but does scale to large cluster. I doesn't need to be committed to other not does checkpoint end, so that should be ok. Could have a special group_head to list other clusters for roll forward. If we put fsnum first, a large value - 0xffffffff - could easily mean something else Or every cluster head could point to an alternate stream, and if we want many quickly, each simply points to another, so we create a chain across all writers. Another issue... When we 'sync' we don't wait for blocks until after the checkpoint is started, and we know that will be driven through to CheckpointEnd which will commit and release everything. However 'fsync' doesn't have the same guarantee. The sync_page call will ensure the data has been written, but we don't know it is safe until the next header is written. So we need to push out the next cluster promptly. So if sync_page is called on a page in writeback, then we mark the cluster as synchronous. When a sync cluster completes, the next (or even next+1) clusters are flushed out promptly. Hopefully they won't be empty on a reasonably busy system, but it is OK if they are. If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon as the write completes the block will be released. So: to clarify sync_page: This can be called when page is in writeback or locked. If locked there is nothing we can do except maybe unplug the read queue. If page is in writeback and block is dirty, then it is probably in a cluster queue and we should flush the cluster and the next. If page is in writeback and block is not dirty, but is writeback, just flush one cluster. But we don't want these cluster flushes to start while the previous is still outstanding else we stop new requests from being added. So as soon as the cluster can be flushed we flush, but no sooner. I guess we use FlushNeeded and make that be less hasty. 19June2010 superblocks.... We currently have a superblock for each device. I cannot see a good reason for that. We can just bdev_claim for 'this' filesystem. Rather we should have a number of anon superblocks, one for each fileset, then one for each snapshot. Do we use different fs types? probably yes lafs - main filesystem made from devices lafs_subset - subordinate fileset, given a path to fileset object can have 'create' option when given an empty directory. lafs_snap - snapshot - given a path to filesys and textname. Cannot create a snap of a subset, only of the whole filesystem Is it OK to mount eith snap of subset or subset of snap? It probably does, so need to use the same filesystem type for both. Maybe lafs_sub or sublafs. Needs path to directory. can be given 'snap=foo'. No: a given filesystem may not exist in a snapshot. You need to mount the snapshot first, then the subset of the snapshot. So we have three types as above. All subsets as 'lafs_subset', whether they are subset of main or of snapshot. Should we be able to create a snapshot or subset without mounting it? It doesn't really seem necessary but might be elegant.. remount doesn't seem the right way to edit a filesystem as it forces some cache flushing. What do we want to edit? - add device, remove device - add/remove snapshot by name - add/remove subset? Not needed, just mkdir/rmdir and mount to convert empty dir to subset. - change cleaner settings?? Could have remount as an option. If problem find other option. While cleaning (which is always) we potentially need all superblocks available as we might need to load blocks in those filesystems to relocate them. Unfortunately each super needs to be in a global list so there is a cost in having them appear and disappear. I guess that is not a big deal. They are refcounted and will disappear cleanly when the count hits zero. So: DONE - change all prime_sb->blocksize refs to fs->blocksize DONE - create an anon sb for the main filesystem DONE - discard the device sbs, just bd_claim the devices and add to list - use lafs_subset for creating/mounting subsets. Changed s_fs_info to point to the TypeInodeFile for the super, but for root/snapshot that doesn't exist early enough to differentiate the super in sget. So we make an inode before the super exists and attach it after. Need to do all that get_new_inode does. inode_stat.nr_inodes++ - just don't generic_forget the inode add to inode_in_use - seems pointless - just set i_list to something add to sb->s_inodes - if we don't it won't flush - maybe that is good? add to hash - don't want i_state == lock|new - only really needed if hashed. but there is lots of initialisation in alloc_inode that we cannot access!! Problem is that we need s_fs_info to uniquely identify the fs with something that can be set in the spinlock, so allocating an inode is out. And also to get to the filesystem metadata which is in the inode. I guess we allocate a little something that stores identifier and later inode. for lafs we use uuid for subset we use just the inode for snapshot we use fs and number 25July2010 superblocks: - sget gives us an active super_block. We need to attach to a vfsmnt using simple_set_mnt, or call deactivate_locked_super. - sget's set should call set_anon_super - kill_sb (called by deactive_super) should then call kill_anon_super If we have a vfsmnt, we have an active reference, so we can atomic_inc s_active safely. So use this to allow snapshots and subsets to hold a ref on the prime_sb and thence on the 'fs'. 26July2010 - DONE need to set MS_ACTIVE somewhere!! - FIXME if an inode is being dropped when iget comes in, it gets confused and the inode appears to be deleted. We cannot really break the dblock <-> inode link until after write_inode_now, but there is no call-back before generic_detach_inode is complete. The last is write_inode which is only calledif I_DIRTY_something. Maybe when writeback completes on an inode dblock, we should check if the inode is I_WILL_FREE and if so, we break the link... Or maybe when we find my_inode set we can check the block and if it isn't dirty or being deleted we break the link directly... That makes more sense. So... what is the deal with freeing inodes??? ->iblock is like a hashtable reference. It is not refcounted It gets set under private_lock iblock is freed by memory pressure or lafs_release_index from destroy_inode when refcount of iblock is non-zero, ->dblock ref is counted, else it is not. dblock is set to NULL if I_Destroyed, or when dblock is discarded, (under lafs_hash_lock) and set to 'b' in lafs_iget and lafs_inode_dblock We can drop the dblock link as soon as iblock has no reference probably get clear_inode to break the link if possible, which it should be on 'forget_inode'. Then lafs_iget can wait on the bit_waitqueue. or maybe do clear_inode itself FIXME when we drop dblock we must clear iblock! as getiref iblock assumes dblock is not NULL. 28July2010 So: ->dblock and ->my_inode need to be clarified. Neither is a counted reference - the idea is that either can be freed and will destroy the pointer at the time so if the pointer is there, the object must be ... but we need locking for that. ->dblock is reasonably protected by private_lock, though if ->iblock exists we hold a ref of ->dblock so we can access it more safely. Need to check getiref_locked knows ->dblock exists when called on iblock and lafs_inode_fillblock yes, both safe! But ->my_inode needs locking too so the inode can safely disappear without having to wait for the data block to go. After all data blocks some in sets, and one shouldn't keep others with inodes. So something light-weight like rcu might work. We use call_rcu to free the inode and rcu_readlock to access ->my_inode Yes, that will work. Occasionally we will want an igrab to, but not often. Should look into rcu for index hash table and ->iblock as well. Current ->iblock is only cleared when the block is freed .. I guess that is fine... 31Jul2010 rcu protection of ->my_inode A/ orphan inodes - are they protected? B/ orphan blocks - are the inodes of those protected? Probably... inodes are 'orphan' for two reasons 1/ a truncate is in progress 2/ there are no remaining links, so inode should be truncated/deleted on restart. The second precludes us from holding a refcount on any orphan inode, else it would never get deleted. So we must assert that an inode with I_Deleting or I_Trunc has an implied reference and so delete must be delayed... not quite. If we set I_Trunc but not I_Deleting, then we igrab the inode until I_Trunc is cleared. While we hold the igrab, I_Deleting cannot possibly be set as that is set when last ref is dropped. 01Aug2010 FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC. .. and in lafs_orphan_release Well... only iolock_written can be a problem, and our rules require that only phase-change writeout can set writeback. So the cleaner can never wait for writeout here. Maybe it can wait for a lock, and maybe we don't really need a lock, just 'wait_writeback'. 08Aug2010 So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written It is writeback waiting on 74/BIGNUM fromm file.c:329. So writepage tried to write a block in a directory .. but it is PinPending so that must have been set after writepage got it... lafs_dir_handle_orphan gets an async lock, then sets PinPending. If write_page is before that, it will have the lock and dir_handle will try later. If write_page is after it will block on the lock, or see PinPending and release the lock. So someone else must be clearing PinPending! - checkpoint clears and re-sets under the lock, so that is safe - dir.c clears under i_mutex dir_handle_orphans always hold i_mutex ... or does it. - refile drops when the last non-lru reference goes. - inode_map_new_abort clears for inode No, not that - just bad test on result lof iolock_written_async ;-( Now have an interesting deadlock. rm in lafs_delete_inode in inode_map_free is waiting for the block to flush which requires the cleaner. The cleaner thread in inode-handle_orphan is calling erase_dblock on the same inode which blocks while inode_map_free has it locked.... no, not same block - just waiting for writeout which requires cleaner. lafs_erase_dblock from inode_map_free must be async! pin_dblock in lafs_orphan_release must too.... no - only the setting of PinPending needs to be async or out side of cleaner, which it is. Ok, got that fixed. All seems happy again, time for a commit. 09Aug2010 14b/ What backing-dev to show the filesystem. backing-dev holds: congested state unplug function read-ahead info throughput measurements Much of that is for generic code to use. We need to: - provide an unplug funtion that unplugs all devices - provide a congested function that which checks all devices, or for 'write' - at least the device we are writing to. How do we set the backing device? The 'struct address_space' point to one, as does struct super_block. set_anon_super establishes a null bdi, set_bdev_super gets it from the bdev->queue We need to bdi_init and bdi_register (if no error) our bdi. bdi_destroy calls unregister and reverses bdi_init or just bdi_setup_and_register but bdi_register_dev gives a better name - isn't this sick!!! Partly done ... but I'm hitting more bugs :-( -Checkpoint cannot complete because... Lots of dirty inodes that are orphans are not pinned!! I guess the InoIdx is ?? Most of them don't have InoIdx(?) Only '8' does. 8/0 is also an orphan and is on wc[0] It seems that this block keeps getting re-written and stays in Phase0. Is that because it is a data block with PinPending.. No, that works as long as it become un-dirty: we drop pinpending, refile, and set again It is being dirtied again during writeout for the checkpoint so it doesn't get to changed phase when we lift PinPending. I gues we mustn't dirty it if it is in the old phase. -And twice inode 17 is deleted without B_Orphan being set! That is the only file that exists before we mount. Problem was orphan_release instead of orphan_forget I wonder why it only affected 17... -at shutdown we drop an inode and try to invalidate pages, but root inode is still dirty - I wonder why. The dblock is in a different phase to the iblock. In checkpoint we wait until root iblock changes phase, but not root dblock! UP TO: I'm testing subordinate filesystems, which don't work yet. I need to create the root directory and inode map. Obviously I cannot record the inode map file in the inode map.... inode_map should ignore everything less than 16? 8? 2? Need to make sure creating with a given inode number works. Need to make make sure auto-allocate inum is never less than 16. 11Aug2010 How to map from filesys inode to superblock? Need in lafs_iget_fs choose_free_inum - to get inode-1 ditto in inode_map_free lafs_put_super has something odd with i_sb Could do an sget search.. Or could just store it in the inode (but not in i_sb!!) inode already a bit large though. Do it for now, but make a note to trim the fs_md part of inode into a separate allocation. lafs_new_inode should take an 'sb' not a 'filesys'. In fact, get rid of filesys. It is MAP(i->i_sb->s_fs_info)->root. 15f - timestamps for roll-forward. The writeout can be much later, but logging the mtime is fairly boring ... we could log mtime in the group head, which might be cheap enough. How much precision is needed, and against what base? probably mtime of last checkpoint from superblock. That should be not more than 2048 seconds ago, so 16 bits gets is 30msec... 14Aug2010 15l - decay youth info. Need to decay: youth_next and checkpoint_youth in 'struct fs' all blocks in youth files on storage all scores in seg-tracker. - not needed, they'll get updated in normal progress and being wrong for a while is no cost. ensure correct youth is stored in lafs_free_get check little-endian conversion of all youth accesses checkpoint_youth only used by thread, so no locking needed youth_next protected by fs->lock 15m - share orphans and cleaning list_heads in datablock It certainly is possible to clean an orphan but it is very unlikely as it will have changed recently, or be changing soon. The cleaner could just dirty any B_Orphan it finds. But if orphan finds a block on the list, it must be careful... I guess when cleaner drops a cleaning ref, it should check if the block is an orphan, and re-queue if it is. 15o - async blocks just have an extra refcount. This could: - keep PinPending set - keep an index block pinned - will phase-flip - keep ->parent link not not get in the way of a checkpoint. Should we clear any that we find though? Normally async is only used by cleaner, orphan processing, or segscan So it should all be finished when we do a checkpoint. So if checkpoint, or release_page, finds an async block, drop it. 15r - further optimisations in cleaner to avoid lookups. We have fsnum,inum,blocknum and cluster seq number and trunc num. I want to introduce more async though. Currently it only loads one inode at a time. To do more, I need to mark inodes as 'done' when they are and always restart from the start of the cluster (only do one cluster at a time for now). So if we get all the way though a cluster with no 'EAGAIN' we finish with the cluster. 15y - when could a directory block become an orphan? - when deleting that last entry - we don't know if it can be fully deleted until we look in next block - when deleting an entry follows a chain back to the first block - when deleting the last entry in the block. So it could be an orphan if the entry found: - is at end of block - is first entry - is only entry or first entry is already deleted. 15Aug2010 looking at flushing etc when run out of space. We often force a checkpoint when it won't do any good as nothing has been cleaned. In fact we write lots of dead checkpoints to 0/0 until it is full, then move on, clean 0/0 and suddenly have space. We shouldn't do that. sync should be what pushes us forwards. Maybe that is fixed.. InoIdx blocks still cause confusion. Should they ever have credits? or do only the data block have those? Certainly they cannot have SegRef. And there is confusion in my mind whether data blocks can be pinned while the InoIdx block is - need to clarify that. 13Sep2010 - now, where was I... - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense. - test run: block.c:660 - no credits available while dirtying an InoIdx block during orphan handling. lafs_reserver_block (under checkpoint lock) should have set credit. Only I just changed reserve_block to do that dblock instead - I wonder why. OK, I think I cleaned that up... - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN so pin_dblock did too. So reserve_block did too, so prealloc or summary_alloc or seg_ref_block returned error. Problem is that we don't push a checkpoint when cleaner runs out of things to do. But we don't want to go back to pushing a checkpoint too often. Maybe the problem is that we only force the checkpoint when we have enough space to do new allocations, but we need to force it earlier if nothing new can be cleaned. Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so we need to wake 'checkpoint_wait' then. But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway. ...anyway, that is sorted now at commit 95b6b05e460 So: InoIdx blocks. - These never get SegRef as that is meaningless - done. - These can have credits. It possibly isn't necessary bit it makes things easier. They are 'written' by transfering the credits to the data block, or discarding them. - I think dblock and iblock can both be pinned The problem this caused was that the dblock might get processed as a leaf before iblock. We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock is pinned to the same phase. lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when it unpins an iblock So lafs_pin_dblock doesn't need to pin the inode instead. OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000 15u - I don't need to get a segref there, but I need to have one from the original dirty block, so fix that up - commit Mon Sep 13 15:28:08 2010 +100 15v - What do we have? lafs_dirty_dblock: set Dirty, clear Credit clear NCredit set Uninc, clear Icredit clear NICredit lafs_dirty_iblock: set dirty, clear credit test uninc, clear ICredit, set Unincredit - not essential mark_cleaning: test realloc, / alloc / set realloc test dirty / clear realloc/ set credit set uninc clear icredit cleaner_flush: set dirty, clear realloc, clear credit test dirty, clear realloc set credit flush_data_to_inode: lafs_cluster_allocate - there is some odd code ther!! flip_phase lafs_allocated_block all rather different really. Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock 15w/ Space used by cluster updates?? It is all fine - just some confusion of function names. 15z/ logging symlink creation. Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the symlink itself. So I guess we queue the block for writeout so it will go with the dir update. Yes, that works: Mon Sep 13 17:33:54 2010 +100 15ab/ already did that in commit f90959e6f492b6 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to be freed after the write completes? A checkpoint could do it, but that would write out index block that cannot be freed too. A checkpoint would only be good after lots of data pages had been written. We could just wait and let other processes kick in.. I don't think we need to do anything. lafs_shrinker doesn't really know how tight memory is, and periodic checkpoint will free up any memory that we are pinning. .... but something is needed. We need some trigger to write dirty index blocks Maybe: - a timeout on checkpoints - every dirty_expire_interval - but that isn't exported. DONE THAT. Not sure this is a complete solution. I might want to incorp/flush index block when they have no dirty children, but I'm not sure about that. 14sep2010 15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly? failure happens when b2 is data block and uninc table is full so we called incorporate on the parent. This could split the parent which means the block could have been re-parented - it would have been in the child list and so found and fixed. lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate. Inf this case, realloc isn't an issue, only dirty. lafs_incorporate must have made it dirty and it won't get written while it has these in-phase children, so all is happy. 15ae - refile race? Someone might set B_IOLock before removing from lru, so onlru is 0 and refcnt is elevated so it doesn't seem to be unused. But then whoever has the refer will refile again when dropping it and so the right thing will be done. But more generally, do we really want the lru etc to own a counted reference? If it didn't: - we would need to refile when removing from any list - we would need to get a ref when removing from list. uhmmm.. lafs_refile does: clear PinPending if refcnt is low unpin if not PinPending, or dirty etc and data or refcnt is low place on leaf list - if pinned etc - this can be earlier drop parent linkm if refcnt is low, and not pinned etc handle dblock issues if lru was not refcounted, then the only things we might do when refcnt isn't zero are: unpin a dblock once it is not dirty add to lru But if we don't count lru, then we can lose the refcount on dblock Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and don't get freed. I think I want things on 'leafs' list to not hold a counted reference. Things *only* get removed while walking the list. InoIdx blocks hold a ref on the dblock both when counted and some other time. Possibly when pinned. This ensure they are held InoIdx is while a real leaf. But: When we take that first ref, how do we know the dblock even exists? What is the lifetime of ->dblock? removed when page is released set by lafs_import_inode set by lafs_inode_dblock removed by clear_inode So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock This is currently callers of getiref_locked - erase_dblock_locked ?? shouldn't need a lock - ihash_lookup - never on InoIdx - lafs_make_iblock - already have dblock So none of those really need lafs_inode_dblock What about when we set Pinned only really from set_phase ... messy. What about when we set ->parent grow index tree - not relevant ditto do_incorporate_* block_adopt Can be called on InoIdx from: lafs_make_iblock only!! 15sep2010 I have tidied lafs_refile up a lot but I need to make locking a lot cleaner. In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref is taken until I have finished my cleanup. I suspect the inode private_lock is the one to use. I also need to clean up getiref_locked and getref_locked - having both is awkward. So: when are they called? getref_locked: lafs_get_flushable - hold fs->lock first_in_seg - holds private_lock, but shouldn't need _locked as hold a ref through child. (getiref_locked) pin_all_children - hold private_lock find_better - private_lock getdref_locked lafs_invalidate_page - to get a ref on each block to either erase or invalidate it presumably page is locked lafs_get_block - holds private_lock - plus once with only page_lock lafs_release_page - holds private_lock (getiref_locked on dblock) - no locking lafs_inode_dblock - private_lock of my_inode... lafs_delete_inode - private_lock of my_inode lafs_destroy_inode - ditto lafs_drop_inode - ditto getiref_locked erase_dblock_locked - private_lock lafs_get_flushable - fs->lock ihash_lookup - lafs_hash_lock lafs_make_iblock - private_lock So private_lock looks like a good choice. Issues are: - what is the story with dblock on my_inode->private_lock - what is the lock ordering - what can refile negate that we need to be careful of. i.e. we want to keep things stable while refile does its tests, but what do we need to keep stable for others? + we break the parent link?? and so the siblings link + move things to freelist + can put_page + free dblock if not page_private Lock_ordering. private_lock, then fs->lock, then lafs_hash_lock So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock This is getting messy - I need something nice and clear. So: Index Blocks. If Pinned, either has references or is on a leaf list - possibly both If no references and not pinned then not on leaf list, so can be on free list Pinned can only be set when there are references, and can only be cleared under private_lock This is violated by phase_flip, which badly reads refcnt If refcnt is zero and not pinned, then can be moved to free_list If on freelist and refcnt is zero under hash_lock, can be freed So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore. Someone else must hold a ref and will put it and it will refile. but that is pointless as it could immediately be cleared after we test Pinned. lafs_get_flushable should get a reference before deleting from list. This ensure it won't be freed by lafs_shrinker, though it could be on the free list. If it is, then it isn't pinned so it is not interestin to us. Data Blocks: These are removed from lru when freed - we just need the extra refcnt check after removing from list. No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt so they cannot disappear What is the story with my_inode->private_lock though? This is used to protect ->dblock accesses. I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared? dblock is cleared in refile for the dblock, or in clear_inode under the inode rivate lock. So: There are various places that hold a non-counted reference to a block. These include - index hash table lafs_hash_lock - index free list lafs_hash_lock - phase_leafs / clean_leafs fs->lock only if pinned - inode->iblock lafs_hash_lock - inode->dblock inode->i_data.private_lock Each of these is protected by its own lock, but not all the same lock. When we turn one of these into a counted reference, we increment refcnt under the local lock, then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has finished. This must be done before changing/using the block in any way. To free an index block it must first be removed from _leafs list. Then if the refcount is still zero it can be freed - or put on freelist and subsequently freed. An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference. To free a data block we similarly need to recheck refcnt after removing from leaf list. If it is in an inode file we also take that inode's private_lock to clear dblock. We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero. 17sep2010 review lafs_refile - are some of those tests redundant? - yes, one is gone. So: 15ah - What about truncated blocks sitting on an uninc chain? I don't see the problem. It will eventually get incorporated and do the right thing... 15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in a stable way..... No, I don't think that is really a problem. It only gets written out in the tail of the checkpoint after the root. I guess it could then get a youth number for a segment that it has no count for, if the root is written at the end of one segment and the segusage/youth written at the start of the next. But I think roll-forward is missing something. Blocks in the next phase need to be counted into segusage. Are they? oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space, but the important blocks picked up by the roll-forward are handled. So.... A checkpoint could cover multiple segments. We need to be sure these each get a valid youth number. Probably most of them will, but we need a consistent approach to be sure. They don't need to be added to the segtracker, except the last needs to be active, and it already is. So as we find a new segment we want to do much like was lafs_free_get does youth_update. But the data block - isn't that youthblk? When it that set? segsum_find sets if it ssnum == 0 19sep2010 15ak - run the orphan file at mount time. After roll-forward when we have a working filesystem, we need to read the orphan file, load each block mentioned, and register each as an orphan. This involves: - setting the orphan_slot - setting B_Orphan - lafs_add_orphan Just like at the start of orphan_commit We also need to initialise nextfree and possibly 'reserved'. But: can orphans be created during roll-forward? They certainly can. We currently hide that in a re-use of the orphan list.. But directory updates are possible too, and not handled. I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot change the orphan file. Then after roll-forward, we read the original part of the file and set up any orphans that aren't yet. So we want to read once to get the size. Then read again to process content up to that size. 15am - filesystem name. This is only used for identifying snapshots 01oct2010 - mkfs is done to an initial version of lafs-utils. !!! So: 15am - filesystem name - used to identify snapshots So the name is pointless in subordinate filesets. So I could just shrink the metadata. The primary metadata needs to be big enough to get a name easily though. 15aw.. When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'. But we have the same B_UnincCredit bit for both. Is that safe? Processing the cleaner could absorb the UnincCredit while the blocks is reserved but not dirty. Then when it gets dirtied, there may be not enough credits to split. We set Dirty from Credit, and use ICredit for UnincCredit. But when only Realloc (not dirty) we don't use those bits. We allocate fresh credits or set Dirty if that fails. 03Oct2010 Need lafs_iget_fs to work on other filesystems. And other snapshots? We use it: in cleaner when parsing cluster head in orphan handler when loading orphan file or when rearranging it. in roll forward Each of these might need to kern-mount the fs - so we need to hold the ref somewhere. Cleaner also needs to explore snapshots. Don't want kern_mount - that is too heavy weight and includes a vfsmnt. Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need to hold. Similarly for snapshots. Cleaner needs to consider all snapshots, so they all need to be mounted. So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops. Subset 'sb's can be attached to the parent inode and then only dropped when the inode goes... only sb currently references inode. So maybe the first ref to an sb doesn't ref the inode but others do - is that possible? No, as we don't see them being dropped. Every inode in the subset could ref the filesys inode. That would keep it active the right amount of time, but release/destroy could still be racy. I guess cleaner/orphan/roll need to explicitly ref the fs. cleaner already refs inode when B_Cleaning, so hold fs too. B_Orphan seems to own and inode ref too. So: lafs_iget_fs gets a ref on the inode and the sb. need lafs_iput_fs to drop both references B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref. cleaner holds refs on all snapshots FIXME I probably need to hold inode/fs for B_Async too. No. Async only refs the block, not the inode or fs. Something else would normally ref the inode - e.g. cleaner. When the inode is free, the page invalidation will notice the B_Async flag and release it. So that is all done now, except I don't hold refs on snapshots in the cleaner yet. 11oct2010 DescHole - When is this used? directory etc don't need it. - a regular file might, but there is no API to punch a hole.... yet I guess. - So we just want to allocate these blocks to 0. 15oct2010 - happy birthday Daniel... Looking at 36: a/ files with nlink==0; If we happen to find them, we hold a reference until all roll-forward is done, incase a name is found - it is important not to start deletion early. 18oct2010 36g - write roll_mini for directories. We get a name, an inode number, and one of: LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET The REN_SOURCE is linked with a REN_*_TARGET which could be in a different directory, so we need to stash the SOURCE until the TARGET arrives. We simply impose the implied change on the directory and update the link count in the target inode. So: load the inode possibly record REN_SOURCE for later calls prepare/pin/commit as appropriate. Put the inode on orphan list if appropriate - needs care as we retarget orphan list. update inode link count. (28Feb2011) Just a refresh on the purpose of these updates. 1/ They allow us to fsync a directory without performing a full checkpoint. As directory blocks are not processed in roll-forward we need the update for data to be safe. As fsync of directories are rare in some common situations we could avoid actually writing these. Simply queue them internally and discard them on a checkpoint. If an fsync comes before the checkpoint, only then do we write them out. If there are any cross-directory renames then the preceeding updates in both directories need to be flushed before the cross-directory rename. It might be easier to always flush on a cross-directory rename. 2/ They ensure consistency of inode link-count wrt to names in the filesystem, but as link count is only updated by these (or a checkpoint) there is no problem with delaying. So: when replaying these we must update the directory content and the inode link count. It is OK to delay the write-out of these until an fsync, and not bother if a checkpoint happens. So add that to th TODO list - item 66. 28feb2010 - roll forward directory updates ... I wonder if I got it right :-)(untested). I don't seem to have easy-access notes about the various meaning of 'width' and 'stride' width: The number of independent devices across which the (virtual) device is placed. The normal goal is to write 'width' blocks on every single write. On a RAID4/5/6 this will avoid the need to pre-read for parity calculations, and it will keep all devices equally busy with writes. The 'width' blocks probably aren't consecutive. There are two different layouts - one with width*stride <= segment_size and one with width*stride > segment_size. width*stride <= segment_size This is a traditional striped layout like RAID0/4/5/6. The 'stride' is the chunk size, so 'width*stride' is the stripe size, and segment_size must be a multiple of this. In this case all addresses in a single segment are contigious. We don't necessarily write them in order if we want to write less than one stripe. segment_offset will normally be a multiple of width*stride though this isn't enforced as one could have a partition with an non-aligned start. width*stride > segment_size This implies a catentated layout. If parity-redundancy is in use when the blocks which combine to form a stripe are 'stride' blocks apart. The benefit of this layout is that an extra drive can be added by simply zeroing it and joining it to the array - no re-stripe needed. This will make all stripes slightly larger so at first the space will not be available. As cleaning happens the space will gradually become available. This still requires restriping, but unlike a normal raid5 restripe, the space becomes available in small amounts immediately, when there is no demand for more space, the re-striping (cleaning) can happen at a very low priority with no cost. In this case the blocks in a segment are not contiguous. 'segment_size/width' are, then there is a large gap (in virtual address space) to the next chunk. The segment_offset is an amount of space which is free at the start of each device. 0..segment_offset and stride..stride+segment_offset etc do not contain data and can be used for metadata. When width > 1 it makes sense to replicate each state block across every device - as we want to write the whole stripe anyway. For now we only write and read the first two copies at the beginning, and the last two at the end... Question: what do we want to do about metadata on flash devices? We really don't want a small number of locations to store the metadata, but a large number that we search through - possibly a binary search. These could be all at start/end or scattered throughout the device. The later would make it impossible to find efficiently - there is no way to create useful linkage without writing something else at start of end. As many devices optimise for random writes where the FAT table would be, it make sense to just put the metadata there and not at the end. We should allow one 'page' for each metadatum, which probably meanss 32K. So we should allow all state blocks to be near the start. 01mar2011 - Autumn arrives. Time to add handling of 'atime' and non-logged files. The idea is to have a separate file for storing only 'atime' This is separate from the inode file because the volatility of the data is very different and one of the principles of log-structured-fs is that differently volatile data should be kept separate. This does mean that an inode lookup requires getting data from two files, but it is hopped that the 'atime' file will mostly be in cache as each block contains the atime for lots of different inodes. The atime file contains 2 bytes for each inode, so with a block size of 4K, each block would hold info for 2048 inodes. 1 million inodes would require 2 megabytes. The 16bits are treated as a positive floating point number which gets added to the atime stored in the inode. The lower 5 bits are the exponent, the remaining 11 bits are mantissa. Though there is a little complexity in interpreting the exponent. If the exponent is 0, the mantissa is used as milliseconds - so shift left 5 and multiply by 1000000 for nanoseconds. The smallest change that can be recorded in 1 millisecond. and values up to (2^11-1) milliseconds - or 2seconds can be stored. If the exponent is 1 to 10, the mantissa has a '1' appended as a new msb, and is shifted by the exponent-1 and then treated as milliseconds. This ranges up to 2^(12+9) milliseconds or 30 minutes, where the granularity will be 2^9 millisecs or 0.5 seconds For exponents from 11 up to 31 we add the 1 msb and treat the number as seconds after shifting (e-11). So at e==31, we shift a number that is up to 4095 by 20 to get nearly 2^32 seconds or 136 years. At this point the granularity is 2^20 seconds or 12 days. So overall we can update the atime for 136 years without needing to update the inode, and can record differences of 1msec for the first couple of seconds, then gradually less granularity until we are down to one second an hour after the last change, and 4 hours a year later. To convert a number of seconds to this format: If >= 2048 seconds, we shift down until less than 4096 seconds counting the shift. We add 11 to that number to form exponent, and shift the resulting mantissa up 5, or with exponent, and mask out bit 16. Otherwise we convert to milliseconds (divide nanno by 1000000 and multiply seconds by 1000, and add). Then if < 2048, we shift up by 5 leaving a zero exponent and use that. Otherwise we shift down until < 4096 counting shifts, add 1 to the shift to form an exponent, and combine with mantissa as above. So that is the format - how do we implement it? We don't want to expose to user-space numbers that we cannot store. So any 'utimes' call updates that the inode directly can clear the value in the atime file. Only updates due to accesses go to the atimes file. We define a 'getattr' function which looks at the atime stored in the vfs inode and if it has changed we need to deal with it. - if the inode is still dirty we simply update the lafs inode and use the number as-is, clearing the atimes entry - else we subtract the stored atime from the new atime. If this is negative or exceeds 136 years we mark the inode dirty and store it there. It we cannot mark the inode dirty for some reason we just store all 1s in the atime file. The same operation is needed when dirty_inode is called to make sure atime updates get saved even when no getattr is called. As we always need to be able to update the atime file, it needs to be permanently pinned whenever an inode is read in. For non-logged files this should be cheap but we must do it anyway as the file might not be non-logged. So we need to keep a permanent reference to each block while the inode is loaded. That can keep it pinned. We don't want updates to the atime file to be flushed in any great hurry, especially if it is a logged file. We would be quite happy to only write at 'unmount' and probably 'sync'. So we want to stop the pages from appearing dirty in the page cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty (I_DIRTY). We can still keep them dirty in lafs metadata so if release_page is called we can schedule a write out then. So some steps: 1/ load atime file at mount time - there is one for each filesystem. It has inum of 3 and type of TypeAccesstime (6). Also release it on unmount. 2/ loading an inode must take a ref to the block in the atime file if it exists. A new inode flag records if this has happened. Unless mounted noatime, we pin the block and reserve space. 3/ getattr and dirty_inode must resolve any issues with the atime. So lafs_inode probably needs an extra field to be able to check for changes Hmm.. this is getting confusing... When atime is changed the only way we find out is by ->dirty_inode being called. But that is called when anything is changed. Filtering out whether or not we need to update the inode itself is awkward... maybe there is some context we can use. ->dirty_inode is called by mark_inode_dirty which is called: - by touch_atime, if something changed - file_update_time - at which time we also update iversion - setattr ... which has changed recently (2.3.37ish) - page_symlink - generic_file_direct_write - which increasing size of inode - set_page_dirty_nobuffers So either the inode is pinned, or it isn't. If it isn't, then this *must* be an atime-only update. If it is, then it could be anything, but in any case we update the atime directly. So: dirty_inode should try to get dblock and check if it is pinned. If it is pinned, then update the atime immediately and the offset in the atime file too. If not, just update the offset 03mar2011 ARGggg... checkpin is interfering with unmount - it keeps an s_active count so unmount 'works' but doesn't release anything. checkpin is needed is needed to ensure that inodes remain safe while we are cleaning. Particularly, while the inode index block is pinned, we keep the inode and fs referenced as well. I guess the theory is that they won't stay pinned for long - but they do. e.g. segusage blocks are permanently pinned. We could have a rule about the prime filesystem always being mounted. Then we don't need refcounts, but kill off the cleaner before unmount... which we sort-of do.. All subordinate filesystems have references on the prime_sb so the prime_sb must be the last one to go. When it goes it kills everything off... So we don't need checkpin to take a ref on the prime_sb. There might be still an issue with files in subset filesystems being permanently pinned so they stay around longer than they should... need to check on that somehow. The idea is that a quota file block is permanently pinned so it will keep the fs pinned. That in turn will keep everything else pinned... Worry about that when we implement quotas FIXME 04mar2011 I really need to sort this out, and it isn't easy... We really want to know when "all" filesystems have been unmounted so the block device(s) can be released and the cleaner stopped. But we don't have a count for that. We could if that was all we counted - but that would mean that we only have a single struct super_block for all filesystems. So that is what I have to do. A single super_block for all parts of the filesystem. I probably still need to allocated other dev numbers stat->dev, but I don't need to use them internally. Maybe I even allocate superblocks... Yes - we need to use set_anon_super and kill_anon_super to allocate the numbers. lafs_inode will need a pointer to the filesystem - we use that instead of the sb. ------- Testing... bug at block.c:658. Block not B_Valid in lafs_dirty_iblock from lafs_allocate_block from cluster_flush. Block is 74/0: InoIdx block of a newly created file I think. '74' was /f23, then /mnt/1/adir. We are creating file in that dir. This is a depth=0 InoIdx block - i.e. the data is in the dblock, so there is no index info, so it kind-a makes sense for the index block to not be Valid. yes- commit d268a566605bf006cf33c confirms that. So why are we trying to dirty it?.. Maybe: We create a couple of directory entries, then flush and end up with an in-line data block. Then we add more, flush again and so try to dirty parent... Where to we turn depth=0 inodes to depth=1?? - erase_dblock_locked - don't want that - lafs_incorporate So I guess the 'bug' is in error - it is OK to mark that invalid block as dirty. 04mar2011 So - back to the super_block reworking. We want only one superblock. So we use the TypeInodeFile inodes a bit more to hold the details of different filesystems. We need to store a unique 'dev' number in there use set_anon_super/kill_anon_super on a local 'struct super_block' and copy s_dev in/out. As we only have one sb, we can only have one fstype, so we cannot use the fstype to choose what to do. - if dev_name is a block device we try an normal mount - if dev_name is a Inode file, we perform a subset mount - if dev_name is a lafs dir and '-o snapshot=name', we mount that snapshot - if dev_name is a lafs dir in root with perm zero and '-o subset=MAXSIZE', create a subset filesystem. - lafs_iget needs an inode rather than a superblock ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free, choose_free_inum, inode_map_new_prepare - lafs_iput_fs,lafs_igrab_fs, ino_from_sb - NFS filehandles need careful thought They are 'per-super-block', not 'per-vfsmnt' which might be better. We could change that but..... For non-snapshot files it is easy - just record two inodes, the fs and the target. For snapshots there is nothing that is really stable. Maybe we could have different superblocks for snapshots. The snapshot doesn't need the cleaner as it is read-only, though the cleaner can need the snapshot... So the cleaner might automagically mount a snapshot, but a snapshot will never invoke the cleaner or any other thread stuff. So I guess we want one superblock for the fs and one for each snapshot. The filehandle is then either inum+gen or inum+inum+gen where first inum must be TypeInodeFile 07mar2011 ... though I could just put a snapshot number and partial timestamp in.. 08mar2011 This isn't a new to-do list, it is a list of the main features that are still not implemented: - full 2D layout + at very least I don't pad with zeros yet + if stripe size were multiple of 3*3*5*7*2^N, then changing width might be managable. e.g. stripe size: 40320 blocks.. But with megabyte chunksizes, we really want 32bit segsizes and 322560 block segments. - non-logged files - with interface to request access-time file - quotas - snapshots: particularly cleaning - error handling - metadata (inode/directory/etc) CRCs and duplication - fsck / debugfs What would fsck do? - locate and validate device and state blocks. - locate and validate checkpoint cluster. - locate and validate filesystem root - roll forward to collect segusage and quota blocks. - load inode map, read inode file, validate each inode and make sure map is correct. - explore each file, following all indexing, count segusage for each segment and make sure segusage file is consistent. - check no block is allocated twice. This might require multiple passes, each time we examine a different collection of segments. - checking a file requires: - checking inode is consistent - checking index blocks are consistent with depth - checking index/extent blocks are sorted with no overlaps - checking block/iblock counts are correct. - checking all cluster headers in the current segment to ensure they look consistent and agree with file information. i.e. if cluster_header identifies a block, the block must live there, or later in the segment. - scan all directories looking for consistency of hash etc. Count links for all inodes. This might need to be multi-pass too. Could use a bitmap for single-link files, and table for others. How to fix errors. - First must find segments which are not in use according to segusage file or according to block search. If there are none, require a new device be provided. - If anything looks incorrect, write corrected version to new segment Then write out new segusage files In some cases we might need to search all write-clusters for missing blocks?? That could take a very long time! What do I really want to do about CRCs and hashes. It might be nice to store a hash for each block in the index block. But that wastes precious index-block space. If I store a CRC together with address info in the block, then I could be fairly sure it is the right block. So e.g. inodes store the inode number, Index blocks could hold inode+depth+address. Last 8 bytes of each block could be a 4byte CRC and a 4byte identity. identiy is XOR of fsinum inum blocknum generation - or a CRC of these. Actually, we don't need to store the identity info - we just need to include it in the CRC. That either saves space, or allows more bits to be used for the CRC, which is probably the best use of bits for detecting errors. Though it might be nice to store phys-addr in the CRC too, we cannot as 21mar2011 My short-term todo list is: DONE - get 'lafs' to the stage where I can create an fs requiring roll-forward DONE - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more. DONE - Make lots of 'layout' changes - see 15cb 02may2011 - 'run' goes to completion, but segusage isn't updated in the final cluster and the number left over from before looks wrong. DONE - 'ls -l' on a subset file gets confused. - fs created by 'lafs' has wrong Blocks and Inodes counts - we lose a ref to a segsum and sometimes put it too often. REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP 03may2011 Once I have these bugs sorted out I want to make some format changes. DONE - fs_metadata need a 'parent' link rename needs to be careful about what is updated! so does roll_mini lafs_get_parent needs some thought. DONE - roll-forward should get exact mtime stamps, and ctime. So each data block must have an exact timestamp of when the change actually happened. Or the group_head has a timestamp for the most recent update to the file As we use nanosecond timestamps (pointless though they are) we need 30 bits for the nanoseconds and at least 11 for the seconds. So 48 bits (6 bytes) is plenty. So include a 64bit timestamp in the cluster_head and 48bit number to subtract in the group_head But saving 2 bytes per file isn't really worth it, and we may well lose it in padding. So just store a 64bit timestamp in the group_head. DONE - use CRC in place of all checksums - lafs_calc_cluster_csum DONE - state block flags for inconsistencies found If any inconsistency found, fsck is advised. For some it may be imperative. Things that can be wrong include: - generic read error - segusage negative - index block incoherent - dir block incoherent - link count negative - cluster header incoherent - 64 bits should be adequate and simple for this. Any unknown bit requires a full fsck. DONE - 32bit segment size With 16bit at 4K blocks we are limited to 256Meg segments. 64Meg with 1k blocks. This takes about 1 second to write on a modern drive. On an array it will take even less time. 24bits gives 16 to 64 gigabytes which is plenty. However 24bits is awkward to access. a 1K block holds 341 1/3. A 4K block holds 1365 1/3. But this wastes less space than 256 or 1024 and so causes less IO. But then we probably want to size segments to be very big. A few thousand segments should be OK, which is tens of blocks. I don't think the savings with 24bits are worth it, and I do think v.big segments could be useful, so lets go with 32bit segments. Youth is currently tuned to 16bits. Let's leave it there and maybe waste some space. - parallel new-data write clusters. I think it is sufficient to include a second 'next_addr' in the cluster_head - or maybe two. alt_next_addr[2]. When a thread wants to start a new stream of clusters it allocates the segments then attaches to the next outgoing write cluster. Once that is written everything in the new cluster is safe. On a checkpoint every stream writes at least one checkpoint cluster and these are linked together through alt_next_addr. The 'next' cluster for each must be the checkpoint cluster and must carry linkage but unlike with first-link, there is no need to wait The data is already safe as long as the state block isn't updated until every cluster_end block is written. So really, one is enough. I had though 2 would enable quick fan-out but there is no real need for that. As 0 is a valid write-cluster address we use 'this_address' to signify that there is no alt-next. It is possible that a block of a file could be written to two different streams at different points in time between two checkpoints. We need to ensure that roll-forward gets these in the right order. 'seq' can be the same in two different streams so we cannot use that. timestamp could possibly be used, but as times can go backwards it is not ideal. NEW IDEA. Just use one stream of clusters. However it can bounce from one device to another easily. So two different threads can be building up two different write clusters at the same time as long as they synchronise at some point to pass addresses around. They also need some other Verify mode as VerifyNext or VerifyNext2 will destroy any parallelism. As the point of this is two write to multiple devices in parallel, maybe VerifyDevNext{,2} meaning the next header on the same device serves to verify this. - policies. This includes maximum number of segments written between checkpoints whether data can be cleaned to a particular device whether a device can receive new data whether metadata duplication is needed whether an RO device from a different array is allowed. Some of these are per-device policies. Some are per-array. The 'RO Device' thing is special. I think I want an alt_uuid. It works like this: You assemble the RO array when you mount a new filesystem identifying the old as a component. So that 'state' block on the new devices must identify the alt_uuid and state seq number. Do we want to record more info about which devices are in the array? Currently we just record how many. If we find enough with the right UUID/seq, they must be it.. what else would we want? For all the other policy statements it is probably simplest to allow a set of simple strings. e.g. "noclean", "nonew", "dup=2" "maxseg=5" devblock currently uses 146 bytes, so room for 878 stateblock uses 112 plus some for snapshots, so much the same. We currently don't use 'version' and have no concrete plans. The vague idea is to allow lafs to *know* that it cannot mount the array, so any incompatible feature gets set. We could keep those in the policy sets. From that perspective there are 3 types of things. - if you don't understand, don't worry - if you don't understand, don't try to write - if you don't understand, you cannot even read. That last is really best avoided. We have version info elsewhere in the tree so that a new index style will simply make that block unreadable. So I think make the dev and state blocks a simple incrementing version number which apply to that block, and have "don't worry" and "don't write" policies distinguished by first letter. Capital is "If you don't understand, don't write" Lower is "if you don't understand, don't worry". These are space separated strings - etc. - what about i_version? Include in timestamp?