So, let's try to write a kernel module that implements this filesystem. It would be good to have a plan. - Mount filesystem, providing empty root directory o parse mount options - DONE o find/load superblocks and stateblocks - DONE o present empty directory - DONE o Compile external module - DONE o test DONE - Mount filesystem read-only with no roll-forward o IO address mapping sync_page_io or bread? - not bread I think o Index blocks management o search cluster-header for root inode o file read o Directory lookup/read o test - Support roll-forward for blocks, orphans, whatever o manage segusage files o manage quota files - Support writing o inode bitmap o cluster creation / block sorting - Support Cleaning - Interface for snapshots and other admin ------------------------ FIXME If a device is removed from the filesystem, we cannot reliably tell from the other devices or state that this is so. Maybe we need to update all devblocks with a new 'seq' number... FIXME How do we specify mounting subordinate filesets? What superblock do they have? I suspect we do a -F lafs-sub mount from the original filesystem. FIXME If mount fails, we seem to be leaving a super lying around, and sync_supers dies on it. - DONE FIXME Umount appear to work, but a sync_supers dies. - DONE FIXME subordinate supers aren't being locked as much - is that a problem? FIXME index pages never get put on an LRU - how is this supposed to work? -------------------------- Thoughts: Inodes live in an address-space, much like a file. To load the first inode, we need an address-space, so may as well have an 'struct inode' as we may want to expose it to user-space. Loading an inode, need fs (lafs filesystem structure) which subfs (maybe a lafs inode) which snapshot - this is implied by the subfs inode. and fs can be obtained from inode, so just inode, inum UPTO 03nov2005 review block_leaf_find and make_iblock need to do setparent and block_adopt next 10nov2005 need to resolve locking for ->siblings list 24nov2005 peer_find lock_phase lafs_refile I can read a file.....!!!!! Code review / tidy up. resolve locking buffer vs page Export on a web page somewhere?? 16feb2006 (I spent a while getting large-directories to work again in prototype.. and some holidays). - Priority: clean mount and unmount - large directories - multiple devices. FIXME how do we record and handle write errors??? The iput in lafs_release - which is needed - is oopsing at iput+0xe! 23feb2006 Ok, I finally have a clean mount/unmount. .. not quite. blocks being freed at unmount still have a refcnt, which is bad. Next: - make sure we can handle 'large' directories. - make sure we can handle files with indexes - handle filesystems that span devices. 02mar2006 Hurray - clean unmounts!!! There is a nasty circular reference of the root inode which is stored in a block that it manages. Maybe this should not happen, rather than having to be explicitly broken - the root-block can live elsewhere, not in the inode. Next multi-level index blocks. But first, need to understand memory pressure and pageout. How are dirty pages found to be cleaned? How is pressure put on a filesystem to clean up? How are clean pages reaped? - call pagevec_lru_add{,_active)(pvec) to put the page on an LRU lru_cache_add{,_active}(page) might be easier, but isn't exported. - call mark_page_accessed(page) to keep the page 'active'. 09mar2006 - make sure indexes work... lafs_load_block+0xf eax,bx,cx,dx,s1 all zero from block_leaf_find 203 ... OK, indexes seem to work. But 'lafs' have problems creating some large files. Try 'tt' This is due to not handling error properly.. fix it later FIXME 16mar2006 Must make sure the index address-space gets clearred up... I wonder how we find all the pages to free. This might be one reason to keep them in a radix tree. Though we should be able to walk our own data structures. Then work on mounting a 2-device filesystem. FIXME dir_next_ent always starts from the beginning rather than remembering where it is up to... can this be fixed?? 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games) Mounting snapshot needs a way to identify that it is a snapshotmount and which snapshot, and which filesystem. We could use a different filesystem type, but that isn't really needed mount -t lafs -o snapshot=name /original/mount/point /new This grabs the named snapshot of /original/mount/point and places it at /new The 'snapshot=' option is the trigger. For a control FS, we mount -t lafs -o control /original/mount/point /new To grow a filesystem, we initialise a device (super/state blocks) and mount -t lafs -o remount,new=/dev/name whatever /original/mount/point as the dev_name isn't passed to remount So, mount options are: snapshot=name dev=/dev/device new=/dev/device control and various name=value pairs matching what is exposed in the control filesystem 23mar2006 - factored out super-block finding preparatory to finding snapshots. Thoughts: superblocks for snapshots and sub-ordinate filesystems do not get stored in the 'state'. There is, however, a usage count so that the prime filesystem cannot be unmounted until all snaps and subs are gone. This should just refcount the prime_sb I suspect. So: a snapshot sb points to the 'struct fs' but doesn't .... what??? 30mar2006 - remove the super-block finding code by changing the layout to store superblock locations explicitly :-) - teach 'mount' to mount snapshots. - need to audit for bad use of ss[0] - need to find better way to map 'sb' to snapshot number. - need to make unmount work. 01apr2006 (no, really!!) - rewrite index to kmalloc index blocks and use a shrinker to free them. This means that indexblock no longer has a 'page', which makes sense. It also means they cannot live in highmem, which is sad, but could be fixed. Notes: superblocks and refcounts. Each device holding the filesystem gets a superblock. One of these (arbitrarily) is the 'prime' superblock and gets to manage the whole filesystem. Each snapshot also gets a superblock, as does each subordinate filesystem. These are anon sbs - using anon dev. Each anon sb takes a reference to the 'struct fs', and also to the prime sb.... how about the reference relationship between fs and prime_sb??? Need to ponder this, - problem with getting parent superblock due to semaphores... - when unmount, put_super isn't being called, so inode 0 isn't released! 13apr2006 (Took a week off to play with rt2500 wireless cards) - Use different filesystem type for snapshots and subordinate filesystems. This removes the semaphore problem + OK, mount and unmount works for snapshots... what next? - review index block - worry about himem? - review ss[0] usage - OK - general code review FIXME - what should leaf_lookup/index_lookup return on format error? The currently return '0' which will quietly make an empty block. Many '-1' would be better to make an error block. FIXME check how other filesystem lock the setting of PagePrivate Maybe just need to lock_page FIXME combine find/load/wait into one operation Review dir, super, roll, link FIXME module refcount increases on failed mount! 18may2006 I've been sick for too long, and not much has happened... However I think more than the above comment says. I started looking at roll-forward and have the basic block parsing in place so that it reports what it sees in the roll. Also, the format has been changes a little: the address in the state block is the CheckpointStart cluster, and we simply roll forward to the CheckpointEnd, and then keep going beyond there - there is no longer any walking back to find the start. Next step is to start incorporating rolled elements into the filesystem - data blocks: shouldn't be too hard. Don't need to update the index pages just yet - inode updates: should be straight forward enough, but care is needed as the data might be in multiple places - directory updates: these are probably most interesting.. Question: how are symlinks created? Currently we: log the inode creation commit the new inode log the directory update. This allows the 'value' stored in the inode to appear after the directory update. That might be OK for files (Which are created empty and then extended) but is bad for symlinks (which are created atomically). So, options include: - ensure inode is in a previous cluster to directory updates. This slows things down too much I think - log the content as well. This is awkward if it is big, certainly if more than a block, which is possible. - directory updates could be dependant on the inode being valid. This is ugly. - log content if it is small, else write inode, flush, then create link. So the fast option is: log inode create, log content, log filename and the slow/safe option is log inode ceate, sync file, log filename So on roll-forward if we see the inode we just save the data. Saving the whole inode seems attractive, but we want minimal order dependance: an inode update in the same cluster as the new inode should still over-ride, even though it is earlier. Ok, rollforward is proceeding slowly. I think I am now incorporating new blocks into the tree properly, though the code probably won't compile. It will be nice to test this and see the file have the right data. Next step would be to include the index incorporation code. Then - directory updates - segusage summary - quota - stuff.. 08jun2006 - what exactly should happen when rollforward finds a file with a linkcount of 0? Currently all updates get lost - I wonder if they are lost safely? - rollforward is getting the size right, but not the content - do I need to flag a block that ->phys is valid? : Ok, roll-forward picks up new blocks in a file OK, but umount has stopped working. Presumably because there are pages attached to the inode which aren't getting released. What do we want to do here? Normally those pages, or their addresses need to be recorded before they are lost. But on a read-only mount we don't care so much. 22jun2006 continuing above thought.. When we roll-forward and pick up the pieces of a file, we don't want to allocate pages to hold those pieces (and definitely don't want to read them all). We just want to attach the addresses to the parent for incorporation. Similarly after writing dirty blocks in a file we want to be able to release them immediately rather than waiting for the addresses to be incorporated (as incorporation can be more efficient when delayed). We could just allow the page associated with a block to be released, except that the page provides the indexing to find a block. We might be able to live without the indexing, and hunt down the indexblock tree, but living without the mutual-exclusion provided by block indexing would be more awkward. And the 'struct datablock' still contains a lot more than is needed. So maybe we should just have a completely separate structure attached to the indexblock which lists fileaddr/physaddr. This could include extent information. The trick would be guranteeing allocation. We could either allocate-late with a fallback of attaching the 'struct block' or performing an immediate incorporation, or allocate-early and block the dirtying of a page until there is space to record the new address. This last is bound to be easiest. So: what exactly do we use to store addresses? Probably a linked list of tables. Each table contains a link pointer and an array of fileaddr/physaddr/extentlen But we would need to allocate lots of these if there are hundreds of dirty pages, but possibly only end up using a few if they made extents very nicely. That might be wasteful. Or we could allocate just one. When it is full we perform an incorporation. But if that causes a page split we are in trouble. We could have a spare page, split to it, write out one and wait for the spare page to be written and free. But we cannot just release the index page as it might still have children. (I think I've been here before). A worst-case scenario involves writing one block and that requires spliting every index up the tree to the inode. This requires arbitrarily many pages to be allocated. To accomodate this we either pre-allocate a spare page at every level of the tree down to the data block (a bit like storage space allocation) which seems very wasteful, or we make sure we can release one of the split pages, which seems impossible. I could decide not to worry about it. Have a pool of index pages and hope it always works. Afterall, most pages are data pages, and they can be freed successfully. We would only have a deadlock if all dirty memory were index pages, and that seems unbelievably unlikely. If we trigger a checkpoint when the count of locked-pages hits some limit we should be safe. So: Keep one table per index block. Use simple append and sequential search. When table gets full, force an incorporation Do we allocate the table separately, or embed it in the indexblock?? Probably embed it. indexblocks that don't need it can be freed at any time so that space waste hopefully isn't significant. How big? If the file is written sequentially, then everything should gather into extents, and so it doesn't need to be enormous. If the file is written randomly then the index block can be expected to be 'indirect', so incorporation will be cheap. So 'small' seems ok in both cases. Let's say 8. But wait a minute..... On a checkpoint we can be getting phys updates for prev and next phases. next-phase updates cannot be incorporated until the indexblock has passed on to the next phase. So in that case, I think we still keep a linked list of unincorporated blocks and live with the fact that we cannot free them until the phase change passes. That shouldn't be a big problem as it is a limited time frame - especially for data blocks.. But does this solve our initial problem?? During roll-forward we want to keep the addresses but not the blocks, and we don't want to force incorporation. That means an arbitrary list of addresses attached to an index block. I guess we could possibly allow incorporation, but I would rather not as I want the fs to be able to be read-only nicely. So that means we need to have a list of address tables. Maybe the normal approach is 'add a table if possible, else incorporate'? OUCH... we may write a block a second time before incorporating the new address, so when adding an address to the table we need to check if it already exists. That could be expensive. For index blocks might it even be a different address? I think not but the vague possibility (in the future?) does complicate things somewhat. Maybe we just keep thing in chron order and don't worry about duplicates until incorporate time, when we have to sort anyway. todo: lafs_find_block DONE free_block must free tables DONE Unmounting still doesn't work. Problem is that an index block is holding a reference on parent, and parent references aren't getting cleaned up. On read-only unmount I guess we need to walk the list of leafs, discard any address info, and unlock the blocks. So that should be the first task for next time. 27jul2006 Leafs are locked blocks which have no locked children. So any locked data block (non-inode) is a leaf Any locked index block with lockcnt[phase] 0 is a leaf. OK - fixed numerous bugs, but I can unmount now!! I can even rmmod and insmod and all is cool. TODO: - review refile and get all the code in there from prototype DONE (I hope) - write a combined find/load/wait function and use it DONE - allocate inodes in single memcache and avoid generic_ip HALF DONE. (still using kmalloc, not doing initonce well) - review recording of new block addresses + make sure we lookup there on index lookup - YES + make sure ->uninc_next gets tranferred to table at phase change. + write incorporation code as it is tricky - review how directory updates can be incorporated into a RO filesystem. No, they cannot. We need to update the directory. - write directory update code - write cluster construction code - make sure indexblocks with unincorporated addresses get on to inc_pending ?? or is locking them enough? INCORPORATION - ARgggghhhhh. The current uninc_table doesn't really lend itself to building index block... though maybe.... Question: what happens when an index block disappears? i.e. it has no addresses in it? We clearly need to remove it from the parent. This should be trivial, a direct operation on the parent index block. etc some number to 0. Then the next incorporation pass with simply lose that entry. OK, that might be all well and good, but how do we sort unincorporated addresses so we can merge them? A linked-list merge sort is nice and open-ended, but does waste quite a bit of space in pointers. Or maybe I should just always do small-table incorporations. Is there a way that a bad ordering of writes could force very bad index layout in this case? i.e. cause a table split every time, but new blocks go in the first (full) table. OK Decision: always do small-table incorporation. i.e. not a list of blocks: just a table of addresses. FIXME check validity of index type when it is first read in, and reject early if it cannot be recognised. 24aug2006 Took a break from incorporation. Looking at directories. Wrote dir.doc in module to sum lots of stuff up. Issue: dir blocks have an info structure attached. This included a counted reference to the parent. How long does this need to hang around for?? - when there is any orphan issue happening, it must stay, via the 'pinned' flag. - when actually performing a dir op, we need to create and maintain this info. When last ref of a dir block is dropped, should drop the parent reference. Status: free list management mostly done. Next: create/delete prepare/commit/abort orphan handling dirty_block lock_block FIXME should dir_new_block zero out the block? How will commit_create know what to do with this block? NOTE another type of directory orphan is a free leaf block which is on the part-free list. ------------------------------------------------------------- 09spe2006 0 on the plane to Frankfurt Don't tell me I am rethinking preallocation again ??? TODO dirty_inode needs to record the phase it is dirty in inode_fillblock needs to check current phase and act accordingly. we inode.doc Make sure the B_Orphan flag is set and used - or discard it. How do we commit creating a symlink? If it is a full block in size we cannot make an update record. - maybe have two update records? We cannot guarantee they are in the same cluster. ... but if we put the 'make dir entry' last it should work. Change 'struct descriptor' definition the 'block_type' aka 'length' 16 field becomes 0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K. 0x8001 -> 0xc000 -> miniblock upto 16K+ 0xffff -> index block. Need to write IO routines which decrease pending-block-count in 'wc'. Thinks. a 1TB filesystem with 1K blocks and 4096 blocks/seg gives 4Meg segments. That would be 256K segments which at 2 bytes per segment - 512 segments per block - is 512 blocks in each seg usage file 12oct2006 Need to write - lafs_lock_{d,}block DONE Make sure the block has parents and allocation and set the locked flag and the phase. - lafs_flush Given a datablock, wait for it to be written out This is needed before updating a block that is still locked in the previous phase. - lafs_inode_init Used when creating a new object/inode Given a datablock which is to hold the inode and a type (Type*) and a mode, Fill in the data block with appropriate data so that when lafs_import_inode looks at it, the right stuff happens. - phase_flip - lafs_prealloc - lafs_seg_ref - lafs_lock_inode lafs_dirty_dblock lafs_cleaner_pause lafs_dirty_inode lafs_seg_flush_all lafs_write_all_super lafs_quota_flush lafs_space_use lafs_cluster_update_abort lafs_cluster_update_commit_buf lafs_cluster_update_commit lafs_seg_apply_all lafs_cluster_update_prepare lafs_inode_phase_check lafs_seg_dup lafs_dirty_block lafs_cluster_update_lock lafs_checkpoint_unlock_wait lafs_orphan_drop lafs_free_get lafs_find_next 2nov2006 - I need to know if a block is undergoing write-io so that I can avoid modifying it in certain circumstances. But I don't track this information. Options: 1/ track the info. This means an extra field in the 'struct block' because I still need to know which wc has had a write. 2/ For blocks that we care about copy the data on write... But we care about all inodes and directory blocks. That is a waste. I think we put extra info in the block. We need to know which wc was used (0,1,2) and which pending cluster in there (0-3) which comes to 4 bits. But we only care about the block for wc=0. and we could include the which-pending in the b_end_io, or maybe put it all in low bits of the block pointer.... Need max 4 bits. Can only be sure of 2... Maybe: 'which' goes in bottom two bits of bi_private 'wc' goes in ->flags 4apr2007 (What a long gap !!) - lafs_cluster_update_* How do we prepare for a cluster update? How do we lock it. The important thing is that the update can be written. That requires that there is space available. So we need to preallocate space and then release it. It is possible that each update might go in a different cluster, so maybe we need to preallocate one block per update. That sounds a little expensive. After all, we aren't preallocating a cluster block for every data block that is dirty. So: prepare does nothing lock preallocates the space - a full block. commit copies it in. For now at least. 24May2007 - Can now create and delete lots of files. This is cool. But: Orphan slots just grow and grow - never to be reclaimed - why? After rm f*, 7 files remain. but rm f* again and the go. FIXED - readdir wasn't returning them Size of directory remains large. And sometimes, files become ghosts... (try just removing one after first rm f*). TODO - process those orphans to clean up the directory. 20June2007 (Happy Birthday Dad) - Creating lots of file and then deleting them leaves 5 orphan slots for the directory busy, and one for inode 0?? Directory handling uses the following orphans: CREATE: A new index block is created by splitting. This needs to be linked in. DELETE: The dirent block we are deleting from If it becomes empty, it needs to go on free list The index block we are deleting from If it has lots of free space it might need to be rebalanced. The inode that was deleted. - When a file is fully deleted, we need to drop any orphan info... DONE - Need to do orphan handling of free blocks in directory, and unmerged parents - but there doesn't seem much point as I am going to change the directory layout (again). So: writing to a file. We need prepare_write, commit_write, and writepage. Prepare loads and links the page and checks there is space. commit marks it as dirty so writeout is possible. writepage chooses a page to write out 25June2007 - HACK week, thanks Novell!! - write - DONE - sync Somewhat done. Need to revise the process whereby async completion clears PAgeWriteback, We need locking in there, and need to worry about 'which' wrapping too soon. Need to not start IO before we set page writeback - chmod Maybe, but syncing to disk needs more thought. - 'df' Partly done, need actual content. - mkdir Can make directory, but creating first entry fails. - FIXED - symlink - readlink - new directory structure. 27Jun2007 - More HACK week :-) - new directory layout done - much easier!! - If I delete a file that was created, the blocks still have a ref-count and we crash. - mkdir doesn't increase link count on parent. - FIXED TODO: Orphan handling. Infrastructure to process orphans Handle specific cases flush orphans at key times. load orphans at roll-forward checkpoint Write out a checkpoint (when?) Make sure refcount goes back to zero on blocks I write. Check on inode_phase_check and checkpoint_unlock and inode_dirty in all directory operations. FIX: Writing a small file leaves something non-dirty but due to be written, and lafs_cluster_allocate complains. - seems to work now. FIX: dir_handle_orphan doesn't lock the orphan transaction required. FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages. FIX: lafs_allocate hasn't been written!!! FIX: before updating any block in a depth=0 file, we must first load and 'lock' block 0. 29Jun2007 - still HACK week. Summary of how incorporation works. Each index block has a small table for unicorporated changes. i.e. blocks number and their addresses. This supports efficient storage of extents, and is extensible by allocating more tables. This last is done rarely. When a block gets a new address, this is added to the table or, if there is a phase missmatch, it is added to a list until a phase change happens (so the whole block is pinned pending the phase change). If the table is full then: - if the filesystem is read-only (including during roll-forward), a new table is allocated (else rollforward fails). - otherwise we incorporate the table into the block, then add the new address to the (now empty) table. If incorporation requires that we split the index block we allocate one from a pool. If there are none in the pool, we wait. As the table is much smaller than a block, the incorporation into two block will always succeed. The 'uninc_next' and 'children' lists will then need to be shared between the two blocks before the new address is added to whichever table is appropriate. When looking for a block address, we must always check the table and then children lists. We do not need to check uninc_next as they will always be children. How to ensure that the pool always has sufficient index blocks and we don't deadlock? We have two halves of the table, one for each phase. Before we allow a block to be dirty in a phase, we ensure that the pool has adequate index blocks for that phase. e.g. twice the depth of the block. If it doesn't we block the dirtying until space becomes available. For syscall writes, this is easy as we catch in prepare_write. When we perform a phase change, we must be sure there are enough index blocks for the deepest bloc that will stay dirty. If there aren't, we need to flush all dirty block, and unmap all writable mappings before starting the checkpoint. FIX: need to work out life time rules so that inodes hang around while they have blocks. currently have an igrab that is never put. FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it. 3Jul2007 Checkpoint flushing is getting close. Current problem. InoIdx blocks are not changing phase. Phase change should happen when all children have been incorporated, and then the write has been triggered marking us clean. For InoIdx blocks, we need to be marked clean when the data block completes. 5jul2007 - a week off Checkpoint flushing seems to work !!!! FIX: what should filesize of symlink be? other filesystems use len, but still zero-terminate for vfs. Problem. A chmod is followed immediately by an unlink then a checkpoint. The chmod update gets into the checkpoint cluster, but the unlink completes before the checkpoint is finished so the new superblock sees the file as gone. Roll-forward find the update and want to update a missing file. This isn't a big problem, but with slightly different details, it could be. One option is to ignore updates that preceed the updated block. That might be awkward with e.g. directory updates and checkpoints that cross multiple segments. Another option might be to prohibit updates once a checkpoint has started unless they are known to be after the phase change. FIX: unlink isn't punching a hole in the inode file. Inode usage map isn't being updated. - FIXED (For create, not unlink). FIX: roll forward does not pick up inodes, only data blocks. But tiny files are synced to inode, so they might not be picked up. So we must process a level=0 inode like a data block. 6July2007 Time for lots of clean up. DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid. DONE 2/ rename 'lock' -> 'pin' 3/ Review and fix up all locking/refcounts. See locking.doc DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check B_Alloc inside the semaphore Also lock inode when copying in block 0 and probably when calling lafs_inode_fillblock (??) DONE 3b/ lafs_incorporate must take a copy of the table under a lock so more allocations can come in at any time. NotYet 3c/ cluster_flush should start all writes before calling _allocate as _allocate might block on incorporation/splitting. No. We really want _allocate to not block, but to queue... I think this is too hard to get perfect just now, so I will leave it. DONE 3d/ introduce PinPending for data blocks. remove fs->phase_depth. LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems so that locking of lru doesn't have to be too global DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the lru system. DONE 3g/ revise refile lru handling based on new understanding 3h/ Utilise WritePhase bit, to be cleared when write completes. In particular, find when to wait for Alloc to be cleared if WritePhase doesn't match Phase. - when about to perform an incorporation. 3i/ make sure we don't re-cluster_allocate until old-phase address has be recorded for incorporation. 3j/ Check that index blocks cannot race when getting locked.... k/ Check what locking is needed to set PagePrivate exclusively. DONE l/ cluster_done needs to call refile, but is called in interrupt context. We need to get it done in process context I think and lock ->waiting access with fs->lock after changing it to ->lru DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback only when *all* have finished. n/ on phase change, uninc_next blocks need to be shared out. NO 3o/ Make sure lafs_refile can be called from irq context. 3p/ lock all lru accesses. 3q/ Lock those index blocks!!! 3r/ Can inode data block be on leafs while index isn't, what happens if we try to write it out... FIXED Why are extent entries only grouped in 4s? If InoIdx doesn't exist, then write_inode must write the data block. 4/ resolve length of symlink FIXED - long symlink followed by 'sync' crashes. FIXED - rollforward isn't calling 'allocated' on blocks, or something FIXED - I cannot find 'bfile'. (inode isn't written) SEEMS OK...- Must flush final segment of a cluster properly... 5/ Review what does, and does not need to be initialised in a new datablock 6/ document and review all guards against dirtying a block from a previous phase that is not yet safe on storage. See lafs_dirty_dblock. 7/ check for proper handling of error conditions a/ checkpoint_start might fail to start a thread! b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot. 8/ review checkpoint loop. Should anything be explicit, or will refile do whatever is needed? 9/ Waiting. What should checkpoint_unlock_wait wait for? When do we need to wait for blocks the change state. And how? DONE 10/ rebase on 2.6.current DONE - use s_blocksize / s_blocksize_bits rather than fs-> 11/ load/dirty block0 before dirtying any other block in depth=0 file 12/ Add writecluster flag for old-phase updates. Why is this needed? updates should always go in the new phase??? 13/ use kmem_cache for 'struct datablock' 14/ indexblock allocation. use kmem_cache allocate the 'data' buffer late for InoIdx block. trigger flushing when space is tight Understand exactly when make_iblock should be called, and make it so. 15/ use a mempool for skippoints in cluster.c 16/ Review seg addressing code in cluster.c and make sure comments are good. DONE 17/ Make sure create inherits uid etc from process. 18/ consider ranges of holes in pending_addr. 20/ Implement rest of "incorporate" 21/ Implement staged truncate use for setattr and delete_inode DONE 22/ block usage counts. 23/ review segment usage /youth handling and make a todo list. a/ Understand ref counting on segments and get it right. 24/ Choose when to use VerifyNull and when to use VerifyNext2. 25/ Store accesstime in separate (non-logged) file. 26/ quotas. make sure files are released on unmount. 30/ cleaner. Support 'peer' lists and peer_find. etc 31/ subordinate filesystems: a/ ss[]->rootdir needs to be an array or list. b/ lafs_iget_fs need to understand these. 32/ review snapshots. How to create how they can fail / how to abort How to destroy 33/ review unmount - need to clean up checkpoint thread cleanly - be sure it has fully exited. 34/ review roll-forward - make sure files with nlink=0 are handled well. - sanity check various values before trusting clusters. 34/ Configure index block hash_table at run time base on memory size?? 35/ striped layout. Review everything that needs to handle laying out at cluster aligned for striping. 36/ consider how to handle IO errors in detail, and implement it. 37/ consider how to handle data corruption in indexing and directories and other metadata and guard against problems (lot of -EIO I suspect). - check all uninc_table accesses are locked if needed. And more: 1/ fs->pending_orphans and inode->orphans are largely unused! 2/ If a datablock is memory mapped writeable, then when we write it out, we need to with fill up it's credits again, or unmap it. 3/ Need to handly orphans asynchonously. --------- 22nov2007 Free index block are on two lists, both protected by the global hash_lock. 1/ The per-inode free_index, so they can be destroyed with the inode 2/ The global freelist so they can be freed by memory pressure. 11feb2008. Where was I up to again? reviewing phase_flip and lafs_refile. UPTO Reading through modify.c, at 'add_indirect'. Plan to fix all this code. Need to thnik about how index block really change. How old blocks get dis-counted from segment usage, and what optimisation are really good for re-incorporating index blocks. Operations to consider are: i)Append new block, ii)truncate, iii)over-write, iv)fill-hole. i/ leaf block splits, index block gets new entry at end, and replacement for other entry. Easy to handle ii/ trailing entries are zeroed. Should be easy, but isn't yet. iii/ probably caught in leafs. May cause internal split so we add new index address, which is easily handled if there is space. iv/ same as iii, though split more likely. What about merging index blocks. That just makes addresses disappear, which we handle the slow way. Do we ever re-target index blocks? Would need to be careful about that. Make it look like a split where one block ends up empty as a hole. Need to write grow_index_tree (DONE - untested) ib is a leaf inode that is getting full. Copy addresses into 'new', and make 'ib' an index block pointing at new. add_index/walk index (DONE - untested) end of do_incorporate (DONE - untested) new contains the early addresses. Some remain in ib and/or ui. the buffers much be swapped, so ib has the early address. ui needs to be attached to new return 2; - then new uninc needs to be split lafs_incorporate case 2 - horizontal split case 3 - vertical split 12feb2008 Bother - uninc_table is a problem (again). We can currently add at any time with just a spinlock. So when we split a block horizontally, Still need to share out children and uninc_table in do_incorporate share out credits in do_incorporate 14feb2008 Still need to do incorporate as above but took a break to... Counting allocated blocks now works - stat show right info, hopefully storage is correct too. - DONE next: truncate? orphan thread? Then segment usage and the cleaner. thoughts: truncate - removing blocks doesn't need to erase them... - nothing forces a cluster_flush promptly!!! We need a timeout or at least we need a flush before truncate_inode_pages... - in lafs_truncate we need to make the block an orphan an pin in all in a checkpoint. 21Feb2008 (Research morning) Discard checkpoint thread created on demand in favour of a cleaner thread that runs all the time. It cleans and checkpoints and orphans and scans. want to: do segment scan and get a real list of free segments and free-space info! 25Feb2008 - segment usage scanning to count free blocks - fix up re-reading of erased blocks - FIX truncate can still block waiting for writeback to complete. - FIX allocations aren't failing when we run out of free space - FIX df doesn't agree with du. problem: Truncate when an index block has addresses in uninc_table. The summary for the new address has already been performed. We need to deallocate the new without disturbing the old. However a simple allocation may not be possible. I guess we can prune them all to zero, then incorporation can proceed. TOFIX: when truncating a recently created file, it is still depth=0 so nothing happens. We really need to increase the depth to 1 as soon as we dirty any block, then reset back to 0 if it fits. 26Feb2008 We have a file that we have written to, and the data blocks have been written out and the addresses stuck in uninc_table. We then truncate the file. Who releases the usage of those blocks? And who removes them from uninc_table? OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'. I really should make sure that inodes are getting freed properly and the inode map is clean and everything. BIG QUESTION Do we reserve segment-usage blocks. We cannot do it naively as we get infinite recursion. But we need it to be allowed to dirty the segment block. But we cannot pin them to this phase as we want to write them out after this phase This still needs more thought. I avoided the recursion by setting SegRef before getting the ref. But that isn't safe. 28Feb2008 The table of cleanable segments is not working out. Each segment appears multiple times which wastes space and adds confusion. We really want to be able to lookup by dev/seg and also find the least. 'Find least' sounds like we want a heap but then we cannot discard the bottom half. We could have a skiplist for dev/segment lookup and do a merge-sort on a different link when we want to find the best segment. We then remember the best number found since a sort, and re-sort if the top is worse than the best. We keep all this in a fixed size table. Each entry has seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some addr-sort-skip links. This is 32+32+16+16+16+16 bits, or 16 bytes or bigger. Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty). One page of 16byte entries (256 of them) 2/3 page of 24byte entries, 1/3 of 32byte entries. Total 2 pages, and 256+113+43 = 412 entries. But deleting random elements is awkward... but not too awkward. We can delete lots of entries by marking them as old, then performing a single pass of the skip list deleting them. We should keep free segments here too, on a separate list. So how about: 2 pages of 16byte entries 1 page of 24 1 page of 32 free list randomly threads through all. When using from 24 or 32, randomly choose height of 2-5 or 2-9 Two lists run through the skiplist entries. One for cleanable, one for free. Remember the nth element for some small n (10, but it decreases as we pull things off the front) and if we add something less than that, we trigger a mergesort on the next time we want to clean.... maybe. Remember end of free list and add to there. Maybe merge-sort the free list by addr occasionally. Quesitions: When can we clean, when can we free wrt checkpoints? - we an clean a segment as soon as we have a checkpoint after it. So we record the youth of the segment holding the (start of the) checkpoint, and can clean any segment with a lower youth. - we can free a segment after the checkpoint after itfs usage has reached zero. So if usage is zero and youth.... We could offset the usage by one (say - for the first cluster header..) then when we find a segment with usage of '1', we schedule an update to 0 in the next checkpoint... Have about segments with different sizes - they get different weights. Need to divide by segment size: usage * youth / size. TOFIX - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking) - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush) - Lots of 'creates' makes lots of little clusters - need to optimise! Or it could be deletes as we currently cluster_flush for each delete. - I think this is fixed 29Feb2008 Started looking at the cleaner. Need to understand how much to clean each checkpoint Need to track free-space-in-active-sectors while scanning. 3Mar2008 TOFIX - the cluster head is currently limited to one page. This is not good. - Should the cleaner start before the scan is complete after a checkpoint? Probably it can, but while the scan is still happening it might be best to be cautious ?? STATE: try_clean is taking shape and has a few FIXMEs. need to write async find_block code and get it to watch for block in a cleaning segment. 28Mar2008 - where can padding appear in a cluster? between miniblocks? at end of device blocks? - need to track phys block while parsing headers for cleaning.. why? - determine rules for avoiding block lookup during cleaning based on youth/snapshot age, and truncate generation. We need to load the inode from each snapshot Can we optimise based on snapshot age? only if we know the block is newer than the snapshot. So when we relocate blocks (cleaning) they must go in a segment that is marked as being old. we cannot really guarentee that. I guess blocks that are marked as 'new' can safely be skipped if segment is newer than snapshot. This 'age' is not the youth, but is the cluster_head->seq which is stored in creation_age. - Store the rootdir for a filesystem in the metadata for the root inode. Then 'struct snapshot' doesn't need rootdir. It can have a root 30Jun2008 Looking at lafs_find_block_async. Needs async flag to make_iblock. Check that. Can we block_adopt if there was an error? iblock will exist. setparent has async flag. lafs_leaf_find has async flag lafs_wait_block_async FIXME I wakeup the cleaner every time an IO completes. Do I really want that? Maybe only when number of async IOs hits half the recent maximum?? FIXME need to ensure that lafs_pin_dblock flushed committed B_Realloc blocks. FIXME when we incorporate a dirty (non-realloc) address to an index block, we need to clear B_Realloc on the indexblock. FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without giving it any credits. Where should they come from? We don't seem to scan for free/cleanable segments often enough. FIXME we shouldn't start a checkpoint while cleaning is happening. FIXME need to be careful when cleaning about finding inodes that don't exist any more. FIXME give credits to realloc blocks. FIXME think about/document transitions between realloc and dirty, and what locking is needed. 2Jul2008 Allowing for the FIXMEs above, the cleaner is now identifying blocks that need to be cleaned and marking them B_Realloc (I think). We now need to gather these into a write cluster and write them. They will all be on the clean_leafs list, so we can iterate that allocating or incorporating as needed. This will be similar to do_checkpoint. Important question is: when? Ideally we would have some auto-flush mechanism. The cleaner just keeps finding blocks to clean and when we start running out of resources we flush the cleaning queue. However we will still want to flush the cleaner always before a checkpoint, so for now we cna implement that bit and wait for a need for the other to arise. FIXME: cleaner lookup of 0/0/0 has interesting consequences as we don't record that location the same way.. how to handle? Should check that 'adopt' doesn't do the wrong thing with this block. Realloc blocks need to be pinned. That makes sense. Only that way will they get onto the clean_leafs list. When checkpointing we should probably examine clean_leafs to be on the safe side. Realloc and Dirty: Both of these hold a Credit. Both can be set at the same time. Cleaner ignores Dirty and sets Realloc anytime the block is in the wrong segment. It also Pins the block. When the cleaner is flushing to the cleaning segment, it ignores Dirty blocks. They get their Realloc cleared, but the remain pinned. So they will get moved at the next checkpoint. How do we know whether an indexblock should be Dirty or Realloc? The Dirty/Realloc bit is cleared before we get to incorporation. Maybe we lafs_dirty_iblock the parent of any block we write out. Then after incorporation, we set Realloc if it is not dirty. STATUS: I think I'm pinning cleaner blocks now. Need to make sure the dirty ones are dropped. DONE Need to make sure the usage is transferred Need to get free segments back into use Need some more 'dump' options. Maybe youth/usage files. Maybe tree. Need to make sure scan etc are triggered often enough. FIXME lafs_prealloc walks up ->parent without locking I think we want i_mapping->private_lock like lafs_pin_iblock. TODO: 1/ a 'dump' option that triggers a scan and prints everything out. 2/ scan must mark freeable as such, then subsequently free them. 3/ Look at code that decreases usage of old segments. 4/ Review lafs_cluster_wait_all and decide exactly how long we need to wait. 5/ Review 'FIXME that is gross' HZ/10 thing. 6/ Review 'wait for checkpoint to flush' msleep(500); Maybe remove that altogether. FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush FIXME BUG in lafs_allocated_block fired. from lafs_erase_dblock from invalidate_page from .. vmtruncate from lafs_setattr Current problem: An inode data block is dirty and pinned, but the inoidx is no longer pinned. Presumably it isn't dirty. Recheck what 'dirty' means on the two blocks and see how this can happen. 10july2008 Tree gets very big! Lots of 'Realloc' blocks that should be long gone. WE are spinning in cleaner again, and not in try_clean. Is it a problem that 'Pinned' is used for Realloc and dirty blocks? In general it shouldn't be. The flush_cleaner process will remove the Realloc bits so the blocks fall off clean_leafs. They then either go onto phase_leafs or get unpinned. But I currently have a problem with InoIdx/data. The Pin is transferred to the Data block, but it doesn't go from the InoIdx block because it has a pincnt. Now that is probably a bug, but what if it weren't? What if, while we were cleaning, a block got dirtied. That would pin the whole tree. I guess the rule about not allocating an inodedata block while the InoIdx is pinned needs to be revised. If the inodedata block is Realloc (and not Dirty) while the InoIdx is not Realloc, we can go ahead (in a cleaning segment). FIXME to check: adir/big1 is garbage.... big1 was removed, so why is it even there? FIXED. echo tre > dump # still too much stuff. Put cond_sched in checkpoint loops! Thoughts about cleaning and pinning. When cleaning we need to know how many dependant blocks are being cleaned so that we know when *this* block can be written - i.e. when the could hits 0. We cannot use the pincnt for this phase because there may be dependant blocks which are dirty. They, and therefore this, may get flushed at next checkpoint, but they may not. If we could be certain they would, we could just write to the clean-segment blocks which can become unpinned. However if there is an index block being cleaned, and no dependant is being cleaned, but some are dirty but not pinned, then the checkpoint can go past without the block being moved.... but maybe we can detect that. Try this: We set B_Realloc precisely on blocks found in segments being cleaned. We pin these blocks and leafs which are Realloc go in clean_leafs. If a block is both Realloc and Dirty we clear Realloc but leave pinned. That way it gets written at end of checkpoint, but to main cluster. When we incorporate Realloc blocks into an index block, it gets marked Realloc. When we incorp dirty blocks, mark dirty. Then see above. On a checkpoint, we process both phase_leafs and clean_leafs FIXME do inode reads async better when cleaning... FIXME if a realloc inode has been allocated to a cluster when we try to dirty it, confusion can ensue as the writeout won't mark it clean, but will use up the credits. Maybe we need something similar to phasewait to not set PinPending... But normal dirtying doesn't phasewait. I think we just need to detect this case and wait for the clean-cluster to flush. Messy... FIXME make sure incorporate is doing the right thing with credits. FIXME lafs_write_inode. We need to be careful about clearing Dirty when making an update. Need some sort of locking. Need to review all inode dirty stuff and make sure we do write thing no matter when it is called. FIXME when blocks are attached to uninc_next, they don't have 'dirty' anymore so we don't know how to flag the index block. 2008jul13 UPTO: unlink etc don't prealloc the inode that will be modified. And a warnon inode.c:579 is very noisy. 2008jul22 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set, but it doesn't get set until AFTER lafs_reserve_block is called. Here I am... Cleaning cleans an InoIdx block which schedules the data block. Subsequent the InoIdx block gets pinned again. Now when we go to write the data block, we cannot because InoIdx is pinned in same phase. Maybe given that data block is pinned, we write it anyway... FIXME: when we realloc an block embedded in the inode, don't pluck it out and put it back in again. Just realloc the inode. FIXME: when cleaning a directory that has shrunk, we think we have blocks that don't exist any more. FIXED - we thought '0' was in segment '0'. 2008jul23 FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster flush finds no credit. for InoIdx block of 8501 FIXME: do we do SEGREF on all the index blocks? do we need to? 2008jul24 FIXME: seg usage for segment 0/5 isn't dropping to zero. Part of a file got moved off, but count is still there. FIXED - seg_move wasn't being called. FIXME: segusage file has inconsistent extents: Extent entries: 0 -> 694 for 2 1 -> 1291 for 1 1 -> 15 for 1 FIXED several bugs in walk_extent FIXME qphase: any locking between that changing and lafs_seg_move?? I don't think so. Just that seg_apply_all must be called after qphase is set. FIXME make sure we don't try to clean the current segment!! FIXME 'Available' goes negative! Creating large file doesn't instantly reduce 'Used'. Deleting files plus sync doesn't increase Avail? FIXME a segment is in the table but doesn't print out! FIXME we don't cope with running out of free segments (not that we ever should). FIXME check all Credit usage and make sure credits are returned when ->parent is dropped. provide visibilty into credit counts. Make sure we are keeping enough space for cleaning. We should always have a few segments unallocatable. 2008jul25 FIXME cannot do io completion in cleaner thread as it can block on a i_mutex which might be waiting for completion. FIXED (keventd). FIXME as ->iblock isn't refcounted we need to be careful accessing it. If we 'know' we have a reference, e.g. a child with a ->parent link, we can access it without locking. So: lafs_make_iblock should return a counted reference. If we own an (indirect?) reference to iblock, we can access both iblock and dblock for free... but iblock can change??? If not, we need to get a reference to on or other under a lock. FIXME block->inode should be a counted reference? lafs_make_iblock OK lafs_leaf_find OK lafs_inode_handle_orphan OK inode_handle_orphan_loop FIXED __lafs_find_next OK find_block FIXED __lafs_find_next OK lafs_find_next FIXED dir_lookup_blk dir_handle_orphan lafs_readdir lafs_inode_handle_orphan choose_free_inum find_block - FIXED FIXME root->iblock should always be refcounted. Is it? FIXME walking siblings - what lock? 2008jul28 FIXME several times we clean PinPending without refiling, in dir.c in particular. that looks wrong. FIXED Maybe lafs_new_inode should return a reference to the dblock Or pin it. or something. FIXED And pinned (when needed). FIXME lafs_inode_dblock might return a block without valid data... Need to get valid data, then load block 0 in find_block rather than load_block. FIXED FIXME we really should own a reference to ->dblock before calling lafs_pin_inode. We don't want IO during a pin request. FIXED FIXME review use of PhysValid FIXED lafs_orphan_abort - what if lafs_orphan_pin not called? or if 'b' is NULL. FIXED Do I Need to clean PinPending when retrying?? Well, we need to be phase-locked when we set PinPending, so it must be Pinned to the current phase. So when we unpin a datablock, we must clear PinPending. FIXED we now clear PinPending in do_checkpoint. Does phase_wait do the right thing when pinning an inoidx block for an inode? FIXED Pending Need to understand and document the lifetime of a page with datablocks. who hold what refcount, and when can it be freed? Then fix up locking in lafs_refile, __putref. FIXME how keep what refcount on orphan blocks/inodes?? FIXME should dirty/pinned/etc hold a refcount? they don't. Later: FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually) FIXME make sure empty files have depth of 1. FIXME Truncate proceeds lazily. All data blocks need to be gone 26aug2008 If I call lafs_erase_dblock while a write is underway, we have a problem. We need to wait potentially for a checkpoint to let go of the block and a write to complete. This should be done with waiting for PG_writeback on the page to disappear. Check this out. When end_page_writeback is called, we must have dropped all references to the page. When we commit to writing a block, we have to set PG_writeback on the page so that truncate et al can wait for it. Before we have committed, truncate can just remove the page. Internally we differentiate by B_Alloc. So before setting B_Allocated we need to test_set_page_writeback(page). Be careful of races. I don't think we can ensure all references are dropped. After all, that is the point of refcounts. So dblock array must exist without page! But we need to ensure that we don't start a writeout after truncate has done wait_on_page_writeback. This is done with the page locked so when we want to write a page in a checkpoint, we need to lock the page first. Once we have the lock, we check if the page is still dirty. If it has been truncated it will be clean. But how do we safely reference the page if b->page can be cleared? How about: When we clear PagePrivate, we take a counted reference to the page for db->page. This is dropped when the page is freed by lafs_refile. But while it is held, it is still safe for db->page to be dereferenced. So before we commence writeout we have to lock the page and set PG_writeback. After locking, we need to test if writeback is still appropriate. Maybe not. I think we can submit blocks for writeout without setting the page to writeback. If we do, then we need to be sure those writes finish before invalidatepage calls releasepage (block_invalidatepage calls discard_buffer which calls lock_buffer which waits). In our case invalidatepage need to make sure that no new write commenses. Maybe we should lafs_iolock_block before we allocate to a cluster and check again if the block is dirty. So: lafs_cluster_allocate does: lafs_iolock_block check if still dirty. If not, unlock and return set allocate flag allocate and write when write completes, allocate is cleared. unlock block invalidatepage does lafs_iolock_block clear Valid,Dirty,Realloc lafs_iounlock_block