2 So, let's try to write a kernel module that implements this filesystem.
3 It would be good to have a plan.
5 - Mount filesystem, providing empty root directory
6 o parse mount options - DONE
7 o find/load superblocks and stateblocks - DONE
8 o present empty directory - DONE
9 o Compile external module - DONE
12 - Mount filesystem read-only with no roll-forward
14 sync_page_io or bread? - not bread I think
15 o Index blocks management
16 o search cluster-header for root inode
18 o Directory lookup/read
21 - Support roll-forward for blocks, orphans, whatever
22 o manage segusage files
27 o cluster creation / block sorting
31 - Interface for snapshots and other admin
35 ------------------------
37 If a device is removed from the filesystem, we cannot reliably
38 tell from the other devices or state that this is so.
39 Maybe we need to update all devblocks with a new 'seq' number...
41 How do we specify mounting subordinate filesets?
42 What superblock do they have?
43 I suspect we do a -F lafs-sub mount from the original filesystem.
46 If mount fails, we seem to be leaving a super lying around,
47 and sync_supers dies on it. - DONE
50 Umount appear to work, but a sync_supers dies. - DONE
53 subordinate supers aren't being locked as much - is that a problem?
56 index pages never get put on an LRU - how is this supposed to work?
64 --------------------------
66 Inodes live in an address-space, much like a file. To load the
67 first inode, we need an address-space, so may as well have an
68 'struct inode' as we may want to expose it to user-space.
70 Loading an inode, need
71 fs (lafs filesystem structure)
72 which subfs (maybe a lafs inode)
73 which snapshot - this is implied by the subfs inode.
74 and fs can be obtained from inode, so just inode, inum
80 review block_leaf_find and make_iblock
81 need to do setparent and block_adopt next
84 need to resolve locking for ->siblings list
91 I can read a file.....!!!!!
92 Code review / tidy up.
93 resolve locking buffer vs page
95 Export on a web page somewhere??
98 (I spent a while getting large-directories to work again in prototype..
100 - Priority: clean mount and unmount
104 FIXME how do we record and handle write errors???
106 The iput in lafs_release - which is needed - is oopsing
110 Ok, I finally have a clean mount/unmount.
111 .. not quite. blocks being freed at unmount still have a refcnt, which is bad.
114 - make sure we can handle 'large' directories.
115 - make sure we can handle files with indexes
116 - handle filesystems that span devices.
119 Hurray - clean unmounts!!!
120 There is a nasty circular reference of the root inode which is stored in
121 a block that it manages. Maybe this should not happen, rather than having to be
122 explicitly broken - the root-block can live elsewhere, not in the inode.
124 Next multi-level index blocks.
126 But first, need to understand memory pressure and pageout.
127 How are dirty pages found to be cleaned?
128 How is pressure put on a filesystem to clean up?
129 How are clean pages reaped?
131 - call pagevec_lru_add{,_active)(pvec) to put the page on an LRU
132 lru_cache_add{,_active}(page) might be easier, but isn't exported.
133 - call mark_page_accessed(page) to keep the page 'active'.
136 - make sure indexes work...
140 eax,bx,cx,dx,s1 all zero
141 from block_leaf_find 203
143 ... OK, indexes seem to work.
144 But 'lafs' have problems creating some large files.
147 This is due to not handling error properly.. fix it later FIXME
151 Must make sure the index address-space gets clearred up... I wonder
152 how we find all the pages to free. This might be one reason to keep them
153 in a radix tree. Though we should be able to walk our own data structures.
156 Then work on mounting a 2-device filesystem.
159 FIXME dir_next_ent always starts from the beginning rather than
160 remembering where it is up to... can this be fixed??
163 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
165 Mounting snapshot needs a way to identify that it is a snapshotmount
166 and which snapshot, and which filesystem.
167 We could use a different filesystem type, but that isn't really needed
169 mount -t lafs -o snapshot=name /original/mount/point /new
171 This grabs the named snapshot of /original/mount/point and places it at
173 The 'snapshot=' option is the trigger.
176 mount -t lafs -o control /original/mount/point /new
178 To grow a filesystem, we initialise a device (super/state blocks) and
179 mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
181 as the dev_name isn't passed to remount
183 So, mount options are:
190 pairs matching what is exposed in the control filesystem
193 - factored out super-block finding preparatory to finding snapshots.
196 superblocks for snapshots and sub-ordinate filesystems do
197 not get stored in the 'state'. There is, however, a usage count so that
198 the prime filesystem cannot be unmounted until all snaps and subs are gone.
199 This should just refcount the prime_sb I suspect.
201 So: a snapshot sb points to the 'struct fs' but doesn't .... what???
204 - remove the super-block finding code by changing the layout to store
205 superblock locations explicitly :-)
207 - teach 'mount' to mount snapshots.
209 - need to audit for bad use of ss[0]
210 - need to find better way to map 'sb' to snapshot number.
211 - need to make unmount work.
213 01apr2006 (no, really!!)
214 - rewrite index to kmalloc index blocks and use a shrinker to free them.
215 This means that indexblock no longer has a 'page', which makes sense.
216 It also means they cannot live in highmem, which is sad, but could
219 Notes: superblocks and refcounts.
220 Each device holding the filesystem gets a superblock.
221 One of these (arbitrarily) is the 'prime' superblock and gets to
222 manage the whole filesystem.
223 Each snapshot also gets a superblock, as does each
224 subordinate filesystem. These are anon sbs - using anon dev.
225 Each anon sb takes a reference to the 'struct fs', and also to the
226 prime sb.... how about the reference relationship between fs and prime_sb???
230 - problem with getting parent superblock due to semaphores...
231 - when unmount, put_super isn't being called, so inode 0 isn't released!
234 (Took a week off to play with rt2500 wireless cards)
235 - Use different filesystem type for snapshots and subordinate filesystems.
236 This removes the semaphore problem
237 + OK, mount and unmount works for snapshots... what next?
238 - review index block - worry about himem?
239 - review ss[0] usage - OK
240 - general code review
242 FIXME - what should leaf_lookup/index_lookup return on format error?
243 The currently return '0' which will quietly make an empty block.
244 Many '-1' would be better to make an error block.
245 FIXME check how other filesystem lock the setting of PagePrivate
246 Maybe just need to lock_page
247 FIXME combine find/load/wait into one operation
248 Review dir, super, roll, link
250 FIXME module refcount increases on failed mount!
253 I've been sick for too long, and not much has happened... However I think more than
254 the above comment says. I started looking at roll-forward and have the
255 basic block parsing in place so that it reports what it sees in the roll.
256 Also, the format has been changes a little: the address in the state block
257 is the CheckpointStart cluster, and we simply roll forward to the
258 CheckpointEnd, and then keep going beyond there - there is no longer any
259 walking back to find the start.
261 Next step is to start incorporating rolled elements into the filesystem
263 - data blocks: shouldn't be too hard. Don't need to update the
265 - inode updates: should be straight forward enough, but care is needed
266 as the data might be in multiple places
267 - directory updates: these are probably most interesting..
270 Question: how are symlinks created?
272 log the inode creation
274 log the directory update.
275 This allows the 'value' stored in the inode to appear after the directory
277 That might be OK for files (Which are created empty and then extended)
278 but is bad for symlinks (which are created atomically).
280 - ensure inode is in a previous cluster to directory updates.
281 This slows things down too much I think
282 - log the content as well. This is awkward if it is big, certainly if more
283 than a block, which is possible.
284 - directory updates could be dependant on the inode being valid.
286 - log content if it is small, else write inode, flush, then create link.
288 So the fast option is:
289 log inode create, log content, log filename
290 and the slow/safe option is
291 log inode ceate, sync file, log filename
293 So on roll-forward if we see the inode we just save the data.
294 Saving the whole inode seems attractive, but we want minimal order
295 dependance: an inode update in the same cluster as the new inode should
296 still over-ride, even though it is earlier.
298 Ok, rollforward is proceeding slowly. I think I am now incorporating
299 new blocks into the tree properly, though the code probably won't compile.
300 It will be nice to test this and see the file have the right data.
302 Next step would be to include the index incorporation code.
310 - what exactly should happen when rollforward finds a file with a linkcount of 0?
311 Currently all updates get lost - I wonder if they are lost safely?
312 - rollforward is getting the size right, but not the content
313 - do I need to flag a block that ->phys is valid?
315 : Ok, roll-forward picks up new blocks in a file OK,
316 but umount has stopped working.
317 Presumably because there are pages attached to the inode which aren't
318 getting released. What do we want to do here?
319 Normally those pages, or their addresses need to be recorded before
320 they are lost. But on a read-only mount we don't care so much.
322 22jun2006 continuing above thought..
324 When we roll-forward and pick up the pieces of a file, we don't
325 want to allocate pages to hold those pieces (and definitely don't
326 want to read them all). We just want to attach the addresses
327 to the parent for incorporation. Similarly after writing
328 dirty blocks in a file we want to be able to release them
329 immediately rather than waiting for the addresses to be
330 incorporated (as incorporation can be more efficient when delayed).
332 We could just allow the page associated with a block to be released,
333 except that the page provides the indexing to find a block. We might
334 be able to live without the indexing, and hunt down the indexblock tree,
335 but living without the mutual-exclusion provided by block indexing would
337 And the 'struct datablock' still contains a lot more than is needed.
339 So maybe we should just have a completely separate structure attached to
340 the indexblock which lists fileaddr/physaddr. This could include
341 extent information. The trick would be guranteeing allocation.
342 We could either allocate-late with a fallback of attaching the 'struct block'
343 or performing an immediate incorporation, or allocate-early and block
344 the dirtying of a page until there is space to record the new address.
345 This last is bound to be easiest.
347 So: what exactly do we use to store addresses?
348 Probably a linked list of tables.
349 Each table contains a link pointer and an array of
350 fileaddr/physaddr/extentlen
351 But we would need to allocate lots of these if there are hundreds of
352 dirty pages, but possibly only end up using a few if they made
353 extents very nicely. That might be wasteful.
355 Or we could allocate just one. When it is full we perform an
356 incorporation. But if that causes a page split we are in trouble.
357 We could have a spare page, split to it, write out one
358 and wait for the spare page to be written and free.
359 But we cannot just release the index page as it might still have
362 (I think I've been here before).
363 A worst-case scenario involves writing one block and that requires
364 spliting every index up the tree to the inode. This requires
365 arbitrarily many pages to be allocated. To accomodate this we either
366 pre-allocate a spare page at every level of the tree down to the data
367 block (a bit like storage space allocation) which seems very wasteful,
368 or we make sure we can release one of the split pages, which seems impossible.
370 I could decide not to worry about it. Have a pool of index pages and hope
371 it always works. Afterall, most pages are data pages, and they can be
372 freed successfully. We would only have a deadlock if all dirty memory were
373 index pages, and that seems unbelievably unlikely. If we trigger a
374 checkpoint when the count of locked-pages hits some limit we should be
377 So: Keep one table per index block. Use simple append and sequential search.
378 When table gets full, force an incorporation
380 Do we allocate the table separately, or embed it in the indexblock??
382 Probably embed it. indexblocks that don't need it can be freed at any
383 time so that space waste hopefully isn't significant.
386 If the file is written sequentially, then everything should gather into
387 extents, and so it doesn't need to be enormous.
388 If the file is written randomly then the index block can be expected to
389 be 'indirect', so incorporation will be cheap.
390 So 'small' seems ok in both cases.
394 But wait a minute.....
395 On a checkpoint we can be getting phys updates for prev and next phases.
396 next-phase updates cannot be incorporated until the indexblock has passed
397 on to the next phase. So in that case, I think we still keep a linked
398 list of unincorporated blocks and live with the fact that we cannot
399 free them until the phase change passes. That shouldn't be a big problem
400 as it is a limited time frame - especially for data blocks..
402 But does this solve our initial problem??
403 During roll-forward we want to keep the addresses but not the blocks,
404 and we don't want to force incorporation. That means an arbitrary list
405 of addresses attached to an index block.
406 I guess we could possibly allow incorporation, but I would rather not
407 as I want the fs to be able to be read-only nicely.
408 So that means we need to have a list of address tables.
409 Maybe the normal approach is 'add a table if possible, else incorporate'?
411 OUCH... we may write a block a second time before incorporating the
412 new address, so when adding an address to the table we need to check
413 if it already exists. That could be expensive.
414 For index blocks might it even be a different address? I think
415 not but the vague possibility (in the future?) does complicate
416 things somewhat. Maybe we just keep thing in chron order and
417 don't worry about duplicates until incorporate time, when we have to
423 free_block must free tables DONE
426 Unmounting still doesn't work.
427 Problem is that an index block is holding a reference on parent,
428 and parent references aren't getting cleaned up.
429 On read-only unmount I guess we need to walk the list of leafs,
430 discard any address info, and unlock the blocks.
431 So that should be the first task for next time.
434 Leafs are locked blocks which have no locked children.
435 So any locked data block (non-inode) is a leaf
436 Any locked index block with lockcnt[phase] 0 is a leaf.
438 OK - fixed numerous bugs, but I can unmount now!!
439 I can even rmmod and insmod and all is cool.
443 - review refile and get all the code in there from prototype
445 - write a combined find/load/wait function and use it
447 - allocate inodes in single memcache and avoid generic_ip
448 HALF DONE. (still using kmalloc, not doing initonce well)
449 - review recording of new block addresses
450 + make sure we lookup there on index lookup - YES
451 + make sure ->uninc_next gets tranferred to table at phase change.
452 + write incorporation code as it is tricky
453 - review how directory updates can be incorporated into a RO filesystem.
454 No, they cannot. We need to update the directory.
455 - write directory update code
456 - write cluster construction code
457 - make sure indexblocks with unincorporated addresses get on to inc_pending
458 ?? or is locking them enough?
461 INCORPORATION - ARgggghhhhh.
462 The current uninc_table doesn't really lend itself to building
463 index block... though maybe....
464 Question: what happens when an index block disappears? i.e. it has no
466 We clearly need to remove it from the parent. This should be trivial,
467 a direct operation on the parent index block. etc some number to 0.
468 Then the next incorporation pass with simply lose that entry.
470 OK, that might be all well and good, but how do we sort unincorporated
471 addresses so we can merge them?
472 A linked-list merge sort is nice and open-ended, but does waste
473 quite a bit of space in pointers.
475 Or maybe I should just always do small-table incorporations.
476 Is there a way that a bad ordering of writes could force very bad
477 index layout in this case? i.e. cause a table split every time,
478 but new blocks go in the first (full) table.
479 OK Decision: always do small-table incorporation.
480 i.e. not a list of blocks: just a table of addresses.
482 FIXME check validity of index type when it is first read in,
483 and reject early if it cannot be recognised.
486 Took a break from incorporation.
487 Looking at directories.
488 Wrote dir.doc in module to sum lots of stuff up.
490 dir blocks have an info structure attached.
491 This included a counted reference to the parent.
492 How long does this need to hang around for??
494 - when there is any orphan issue happening, it must stay, via
496 - when actually performing a dir op, we need to create and
499 When last ref of a dir block is dropped, should drop
500 the parent reference.
504 free list management mostly done.
506 create/delete prepare/commit/abort
508 dirty_block lock_block
511 FIXME should dir_new_block zero out the block?
512 How will commit_create know what to do with this block?
514 NOTE another type of directory orphan is a free leaf block which
515 is on the part-free list.
517 -------------------------------------------------------------
518 09spe2006 0 on the plane to Frankfurt
519 Don't tell me I am rethinking preallocation again ???
522 dirty_inode needs to record the phase it is dirty in
523 inode_fillblock needs to check current phase and act accordingly.
525 Make sure the B_Orphan flag is set and used - or discard it.
527 How do we commit creating a symlink?
528 If it is a full block in size we cannot make an update record.
529 - maybe have two update records? We cannot guarantee they are in
531 ... but if we put the 'make dir entry' last it should work.
533 Change 'struct descriptor' definition
534 the 'block_type' aka 'length' 16 field becomes
535 0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
536 0x8001 -> 0xc000 -> miniblock upto 16K+
537 0xffff -> index block.
539 Need to write IO routines which decrease pending-block-count in
543 Thinks. a 1TB filesystem with 1K blocks and 4096 blocks/seg
544 gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
545 - 512 segments per block - is 512 blocks in each seg usage file
549 - lafs_lock_{d,}block DONE
550 Make sure the block has parents and allocation and set the locked
554 Given a datablock, wait for it to be written out
555 This is needed before updating a block that is still locked in the
558 Used when creating a new object/inode
559 Given a datablock which is to hold the inode
560 and a type (Type*) and a mode,
561 Fill in the data block with appropriate data so that
562 when lafs_import_inode looks at it, the right stuff happens.
575 lafs_cluster_update_abort
576 lafs_cluster_update_commit_buf
577 lafs_cluster_update_commit
579 lafs_cluster_update_prepare
580 lafs_inode_phase_check
583 lafs_cluster_update_lock
584 lafs_checkpoint_unlock_wait
590 - I need to know if a block is undergoing write-io so that I can
591 avoid modifying it in certain circumstances. But I don't track
592 this information. Options:
593 1/ track the info. This means an extra field in the 'struct block'
594 because I still need to know which wc has had a write.
595 2/ For blocks that we care about copy the data on write...
596 But we care about all inodes and directory blocks. That is a waste.
597 I think we put extra info in the block.
598 We need to know which wc was used (0,1,2) and which pending cluster
599 in there (0-3) which comes to 4 bits.
600 But we only care about the block for wc=0. and we could include the
601 which-pending in the b_end_io, or maybe put it all in low bits
602 of the block pointer.... Need max 4 bits. Can only be sure of 2...
605 'which' goes in bottom two bits of bi_private
609 4apr2007 (What a long gap !!)
611 - lafs_cluster_update_*
612 How do we prepare for a cluster update? How do we lock it.
614 The important thing is that the update can be written. That
615 requires that there is space available. So we need to preallocate
616 space and then release it.
617 It is possible that each update might go in a different cluster, so maybe
618 we need to preallocate one block per update. That sounds a little expensive.
619 After all, we aren't preallocating a cluster block for every data block
621 So: prepare does nothing
622 lock preallocates the space - a full block.
628 - Can now create and delete lots of files. This is cool.
630 Orphan slots just grow and grow - never to be reclaimed - why?
631 After rm f*, 7 files remain. but rm f* again and the go.
632 FIXED - readdir wasn't returning them
633 Size of directory remains large.
634 And sometimes, files become ghosts... (try just removing one after first rm f*).
636 TODO - process those orphans to clean up the directory.
638 20June2007 (Happy Birthday Dad)
640 - Creating lots of file and then deleting them leaves 5 orphan slots
641 for the directory busy, and one for inode 0??
643 Directory handling uses the following orphans:
645 A new index block is created by splitting. This needs to be linked in.
647 The dirent block we are deleting from
648 If it becomes empty, it needs to go on free list
649 The index block we are deleting from
650 If it has lots of free space it might need to be rebalanced.
651 The inode that was deleted.
654 - When a file is fully deleted, we need to drop any orphan info... DONE
655 - Need to do orphan handling of free blocks in directory, and
656 unmerged parents - but there doesn't seem much point as I am going to
657 change the directory layout (again).
659 So: writing to a file.
660 We need prepare_write, commit_write, and writepage.
661 Prepare loads and links the page and checks there is space.
662 commit marks it as dirty so writeout is possible.
663 writepage chooses a page to write out
665 25June2007 - HACK week, thanks Novell!!
669 Need to revise the process whereby async completion
670 clears PAgeWriteback,
671 We need locking in there, and need to worry about
672 'which' wrapping too soon.
673 Need to not start IO before we set page writeback
675 Maybe, but syncing to disk needs more thought.
677 Partly done, need actual content.
679 Can make directory, but creating first entry fails. - FIXED
682 - new directory structure.
684 27Jun2007 - More HACK week :-)
686 - new directory layout done - much easier!!
687 - If I delete a file that was created, the blocks still have a ref-count
689 - mkdir doesn't increase link count on parent. - FIXED
693 Infrastructure to process orphans
694 Handle specific cases
695 flush orphans at key times.
696 load orphans at roll-forward
699 Write out a checkpoint (when?)
700 Make sure refcount goes back to zero on blocks I write.
702 Check on inode_phase_check and checkpoint_unlock and inode_dirty
703 in all directory operations.
705 FIX: Writing a small file leaves something non-dirty but
706 due to be written, and lafs_cluster_allocate complains.
709 FIX: dir_handle_orphan doesn't lock the orphan transaction required.
711 FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
713 FIX: lafs_allocate hasn't been written!!!
715 FIX: before updating any block in a depth=0 file, we must first load
718 29Jun2007 - still HACK week.
719 Summary of how incorporation works.
721 Each index block has a small table for unicorporated changes. i.e.
722 blocks number and their addresses.
723 This supports efficient storage of extents, and is extensible by allocating
724 more tables. This last is done rarely.
726 When a block gets a new address, this is added to the table or, if
727 there is a phase missmatch, it is added to a list until a phase change
728 happens (so the whole block is pinned pending the phase change).
730 If the table is full then:
731 - if the filesystem is read-only (including during roll-forward),
732 a new table is allocated (else rollforward fails).
733 - otherwise we incorporate the table into the block, then add the new
734 address to the (now empty) table.
736 If incorporation requires that we split the index block we allocate one
737 from a pool. If there are none in the pool, we wait.
739 As the table is much smaller than a block, the incorporation into
740 two block will always succeed.
741 The 'uninc_next' and 'children' lists will then need to be shared
742 between the two blocks before the new address is added to whichever
743 table is appropriate.
745 When looking for a block address, we must always check the table and
746 then children lists. We do not need to check uninc_next as they will always
749 How to ensure that the pool always has sufficient index blocks and we don't
751 We have two halves of the table, one for each phase. Before we allow
752 a block to be dirty in a phase, we ensure that the pool has adequate
753 index blocks for that phase. e.g. twice the depth of the block. If it
754 doesn't we block the dirtying until space becomes available.
755 For syscall writes, this is easy as we catch in prepare_write.
756 When we perform a phase change, we must be sure there are enough index
757 blocks for the deepest bloc that will stay dirty. If there aren't, we need
758 to flush all dirty block, and unmap all writable mappings before
759 starting the checkpoint.
762 FIX: need to work out life time rules so that inodes hang around while they have blocks.
763 currently have an igrab that is never put.
765 FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
768 Checkpoint flushing is getting close.
770 InoIdx blocks are not changing phase.
771 Phase change should happen when all children have been incorporated, and
772 then the write has been triggered marking us clean.
773 For InoIdx blocks, we need to be marked clean when the data block
776 5jul2007 - a week off
777 Checkpoint flushing seems to work !!!!
778 FIX: what should filesize of symlink be?
779 other filesystems use len, but still zero-terminate for vfs.
781 Problem. A chmod is followed immediately by an unlink then a checkpoint.
782 The chmod update gets into the checkpoint cluster, but the unlink completes
783 before the checkpoint is finished so the new superblock sees the file
784 as gone. Roll-forward find the update and want to update a missing file.
786 This isn't a big problem, but with slightly different details, it could be.
788 One option is to ignore updates that preceed the updated block. That might
789 be awkward with e.g. directory updates and checkpoints that cross multiple
792 Another option might be to prohibit updates once a checkpoint has started
793 unless they are known to be after the phase change.
795 FIX: unlink isn't punching a hole in the inode file.
796 Inode usage map isn't being updated. - FIXED (For create, not unlink).
798 FIX: roll forward does not pick up inodes, only data blocks.
799 But tiny files are synced to inode, so they might not be picked up.
800 So we must process a level=0 inode like a data block.
803 Time for lots of clean up.
805 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
806 DONE 2/ rename 'lock' -> 'pin'
807 3/ Review and fix up all locking/refcounts. See locking.doc
808 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
809 B_Alloc inside the semaphore
810 Also lock inode when copying in block 0 and probably
811 when calling lafs_inode_fillblock (??)
812 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
813 more allocations can come in at any time.
814 NotYet 3c/ cluster_flush should start all writes before calling _allocate
815 as _allocate might block on incorporation/splitting.
816 No. We really want _allocate to not block, but to queue...
817 I think this is too hard to get perfect just now, so I will leave it.
818 DONE 3d/ introduce PinPending for data blocks. remove fs->phase_depth.
819 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
820 so that locking of lru doesn't have to be too global
821 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
823 DONE 3g/ revise refile lru handling based on new understanding
824 3h/ Utilise WritePhase bit, to be cleared when write completes.
825 In particular, find when to wait for Alloc to be cleared if
826 WritePhase doesn't match Phase.
827 - when about to perform an incorporation.
828 3i/ make sure we don't re-cluster_allocate until old-phase address has
829 be recorded for incorporation.
830 3j/ Check that index blocks cannot race when getting locked....
831 k/ Check what locking is needed to set PagePrivate exclusively.
832 DONE l/ cluster_done needs to call refile, but is called in interrupt context.
833 We need to get it done in process context I think and lock
834 ->waiting access with fs->lock after changing it to ->lru
835 DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback
836 only when *all* have finished.
837 n/ on phase change, uninc_next blocks need to be shared out.
838 NO 3o/ Make sure lafs_refile can be called from irq context.
839 3p/ lock all lru accesses.
840 3q/ Lock those index blocks!!!
841 3r/ Can inode data block be on leafs while index isn't, what happens if we
842 try to write it out...
843 FIXED Why are extent entries only grouped in 4s?
844 If InoIdx doesn't exist, then write_inode must write the data block.
845 4/ resolve length of symlink
846 FIXED - long symlink followed by 'sync' crashes.
847 FIXED - rollforward isn't calling 'allocated' on blocks, or something
848 FIXED - I cannot find 'bfile'. (inode isn't written)
849 SEEMS OK...- Must flush final segment of a cluster properly...
850 5/ Review what does, and does not need to be initialised in a new datablock
851 6/ document and review all guards against dirtying a block from a previous phase
852 that is not yet safe on storage.
853 See lafs_dirty_dblock.
854 7/ check for proper handling of error conditions
855 a/ checkpoint_start might fail to start a thread!
856 b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
857 8/ review checkpoint loop.
858 Should anything be explicit, or will refile do whatever is needed?
860 What should checkpoint_unlock_wait wait for?
861 When do we need to wait for blocks the change state. And how?
862 DONE 10/ rebase on 2.6.current
863 DONE - use s_blocksize / s_blocksize_bits rather than fs->
865 11/ load/dirty block0 before dirtying any other block in depth=0 file
866 12/ Add writecluster flag for old-phase updates.
867 Why is this needed? updates should always go in the new phase???
868 13/ use kmem_cache for 'struct datablock'
869 14/ indexblock allocation.
871 allocate the 'data' buffer late for InoIdx block.
872 trigger flushing when space is tight
873 Understand exactly when make_iblock should be called, and make it so.
874 15/ use a mempool for skippoints in cluster.c
875 16/ Review seg addressing code in cluster.c and make sure comments are good.
876 DONE 17/ Make sure create inherits uid etc from process.
877 18/ consider ranges of holes in pending_addr.
879 20/ Implement rest of "incorporate"
880 21/ Implement staged truncate
881 use for setattr and delete_inode
882 DONE 22/ block usage counts.
883 23/ review segment usage /youth handling and make a todo list.
884 a/ Understand ref counting on segments and get it right.
885 24/ Choose when to use VerifyNull and when to use VerifyNext2.
886 25/ Store accesstime in separate (non-logged) file.
888 make sure files are released on unmount.
891 Support 'peer' lists and peer_find. etc
892 31/ subordinate filesystems:
893 a/ ss[]->rootdir needs to be an array or list.
894 b/ lafs_iget_fs need to understand these.
895 32/ review snapshots.
897 how they can fail / how to abort
900 - need to clean up checkpoint thread cleanly - be sure it has fully exited.
901 34/ review roll-forward
902 - make sure files with nlink=0 are handled well.
903 - sanity check various values before trusting clusters.
905 34/ Configure index block hash_table at run time base on memory size??
907 Review everything that needs to handle laying out at cluster
908 aligned for striping.
910 36/ consider how to handle IO errors in detail, and implement it.
911 37/ consider how to handle data corruption in indexing and directories and
912 other metadata and guard against problems (lot of -EIO I suspect).
914 - check all uninc_table accesses are locked if needed.
917 1/ fs->pending_orphans and inode->orphans are largely unused!
918 2/ If a datablock is memory mapped writeable, then when we write it out,
919 we need to with fill up it's credits again, or unmap it.
920 3/ Need to handly orphans asynchonously.
924 Free index block are on two lists, both protected by the global
926 1/ The per-inode free_index, so they can be destroyed with the inode
927 2/ The global freelist so they can be freed by memory pressure.
929 11feb2008. Where was I up to again?
930 reviewing phase_flip and lafs_refile.
933 Reading through modify.c, at 'add_indirect'. Plan to fix all this code.
934 Need to thnik about how index block really change. How old blocks get
935 dis-counted from segment usage, and what optimisation are really good
936 for re-incorporating index blocks.
937 Operations to consider are:
938 i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
939 i/ leaf block splits, index block gets new entry at end, and replacement
940 for other entry. Easy to handle
941 ii/ trailing entries are zeroed. Should be easy, but isn't yet.
942 iii/ probably caught in leafs. May cause internal split so we add new
943 index address, which is easily handled if there is space.
944 iv/ same as iii, though split more likely.
946 What about merging index blocks. That just makes addresses disappear, which
947 we handle the slow way.
948 Do we ever re-target index blocks? Would need to be careful about that.
949 Make it look like a split where one block ends up empty as a hole.
951 grow_index_tree (DONE - untested)
952 ib is a leaf inode that is getting full. Copy addresses
953 into 'new', and make 'ib' an index block pointing at new.
955 add_index/walk index (DONE - untested)
957 end of do_incorporate (DONE - untested)
958 new contains the early addresses. Some remain in ib
960 the buffers much be swapped, so ib has the early address.
961 ui needs to be attached to new
962 return 2; - then new uninc needs to be split
965 case 2 - horizontal split
966 case 3 - vertical split
968 Bother - uninc_table is a problem (again).
969 We can currently add at any time with just a spinlock.
970 So when we split a block horizontally,
974 share out children and uninc_table in do_incorporate
975 share out credits in do_incorporate
978 Still need to do incorporate as above but took a break to...
980 Counting allocated blocks now works - stat show right info, hopefully
981 storage is correct too. - DONE
983 next: truncate? orphan thread?
984 Then segment usage and the cleaner.
988 truncate - removing blocks doesn't need to erase them...
989 - nothing forces a cluster_flush promptly!!! We need a timeout
990 or at least we need a flush before truncate_inode_pages...
992 - in lafs_truncate we need to make the block an orphan an pin in
995 21Feb2008 (Research morning)
996 Discard checkpoint thread created on demand in favour of a cleaner
997 thread that runs all the time. It cleans and checkpoints and
1001 do segment scan and get a real list of free segments and
1005 - segment usage scanning to count free blocks
1006 - fix up re-reading of erased blocks
1007 - FIX truncate can still block waiting for writeback to complete.
1008 - FIX allocations aren't failing when we run out of free space
1009 - FIX df doesn't agree with du.
1012 Truncate when an index block has addresses in uninc_table.
1013 The summary for the new address has already been performed.
1014 We need to deallocate the new without disturbing the old.
1015 However a simple allocation may not be possible.
1016 I guess we can prune them all to zero, then incorporation
1019 TOFIX: when truncating a recently created file, it is still depth=0 so
1021 We really need to increase the depth to 1 as soon as we dirty
1022 any block, then reset back to 0 if it fits.
1025 We have a file that we have written to, and the data blocks have been
1026 written out and the addresses stuck in uninc_table.
1027 We then truncate the file. Who releases the usage of those blocks?
1028 And who removes them from uninc_table?
1030 OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1031 I really should make sure that inodes are getting freed properly and the
1032 inode map is clean and everything.
1035 Do we reserve segment-usage blocks.
1036 We cannot do it naively as we get infinite recursion.
1037 But we need it to be allowed to dirty the segment block.
1038 But we cannot pin them to this phase as we want to write them out
1040 This still needs more thought. I avoided the recursion by setting SegRef
1041 before getting the ref. But that isn't safe.
1044 The table of cleanable segments is not working out. Each segment appears multiple
1045 times which wastes space and adds confusion.
1046 We really want to be able to lookup by dev/seg and also find the least.
1047 'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1049 We could have a skiplist for dev/segment lookup and do a merge-sort on
1050 a different link when we want to find the best segment.
1051 We then remember the best number found since a sort, and re-sort if the top
1052 is worse than the best.
1054 We keep all this in a fixed size table. Each entry has
1055 seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1056 addr-sort-skip links.
1057 This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1058 Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1059 One page of 16byte entries (256 of them)
1060 2/3 page of 24byte entries, 1/3 of 32byte entries.
1061 Total 2 pages, and 256+113+43 = 412 entries.
1063 But deleting random elements is awkward... but not too awkward. We can delete
1064 lots of entries by marking them as old, then performing a single pass of the skip
1067 We should keep free segments here too, on a separate list.
1070 2 pages of 16byte entries
1074 free list randomly threads through all.
1076 When using from 24 or 32, randomly choose height of 2-5 or 2-9
1077 Two lists run through the skiplist entries. One for cleanable, one for free.
1078 Remember the nth element for some small n (10, but it decreases as we pull
1079 things off the front) and if we add something less than that, we trigger a
1080 mergesort on the next time we want to clean.... maybe.
1082 Remember end of free list and add to there. Maybe merge-sort the free list
1083 by addr occasionally.
1086 When can we clean, when can we free wrt checkpoints?
1087 - we an clean a segment as soon as we have a checkpoint after it.
1088 So we record the youth of the segment holding the (start of the)
1089 checkpoint, and can clean any segment with a lower youth.
1090 - we can free a segment after the checkpoint after itfs usage has reached
1091 zero. So if usage is zero and youth....
1092 We could offset the usage by one (say - for the first cluster header..)
1093 then when we find a segment with usage of '1', we schedule an update to
1094 0 in the next checkpoint...
1095 Have about segments with different sizes - they get different weights.
1096 Need to divide by segment size: usage * youth / size.
1099 - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1100 - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1102 - Lots of 'creates' makes lots of little clusters - need to optimise!
1103 Or it could be deletes as we currently cluster_flush for each
1105 - I think this is fixed
1108 Started looking at the cleaner.
1109 Need to understand how much to clean each checkpoint
1110 Need to track free-space-in-active-sectors while scanning.
1114 - the cluster head is currently limited to one page. This is not good.
1116 - Should the cleaner start before the scan is complete after a checkpoint?
1117 Probably it can, but while the scan is still happening it might be best
1121 try_clean is taking shape and has a few FIXMEs.
1122 need to write async find_block code and get it to watch for
1123 block in a cleaning segment.
1126 - where can padding appear in a cluster? between miniblocks? at
1127 end of device blocks?
1128 - need to track phys block while parsing headers for cleaning.. why?
1129 - determine rules for avoiding block lookup during cleaning
1130 based on youth/snapshot age, and truncate generation.
1131 We need to load the inode from each snapshot
1132 Can we optimise based on snapshot age?
1133 only if we know the block is newer than the snapshot.
1134 So when we relocate blocks (cleaning) they must go in a segment
1135 that is marked as being old. we cannot really guarentee that.
1136 I guess blocks that are marked as 'new' can safely be skipped if
1137 segment is newer than snapshot. This 'age' is not the youth, but
1138 is the cluster_head->seq which is stored in creation_age.
1140 - Store the rootdir for a filesystem in the metadata for the root inode.
1141 Then 'struct snapshot' doesn't need rootdir. It can have a root
1144 Looking at lafs_find_block_async.
1145 Needs async flag to make_iblock.
1146 Check that. Can we block_adopt if there was an error?
1148 setparent has async flag.
1149 lafs_leaf_find has async flag
1150 lafs_wait_block_async
1152 FIXME I wakeup the cleaner every time an IO completes.
1153 Do I really want that? Maybe only when number of async IOs hits
1154 half the recent maximum??
1156 FIXME need to ensure that lafs_pin_dblock flushed committed
1159 FIXME when we incorporate a dirty (non-realloc) address to an index block,
1160 we need to clear B_Realloc on the indexblock.
1162 FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1163 giving it any credits. Where should they come from?
1165 We don't seem to scan for free/cleanable segments often enough.
1167 FIXME we shouldn't start a checkpoint while cleaning is happening.
1169 FIXME need to be careful when cleaning about finding inodes that
1170 don't exist any more.
1172 FIXME give credits to realloc blocks.
1174 FIXME think about/document transitions between realloc and dirty,
1175 and what locking is needed.
1178 Allowing for the FIXMEs above, the cleaner is now identifying
1179 blocks that need to be cleaned and marking them B_Realloc (I think).
1180 We now need to gather these into a write cluster and write them.
1181 They will all be on the clean_leafs list, so we can iterate that
1182 allocating or incorporating as needed. This will be similar to
1184 Important question is: when?
1185 Ideally we would have some auto-flush mechanism. The cleaner just
1186 keeps finding blocks to clean and when we start running out of
1187 resources we flush the cleaning queue.
1188 However we will still want to flush the cleaner always before a
1189 checkpoint, so for now we cna implement that bit and wait for a
1190 need for the other to arise.
1193 FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1194 don't record that location the same way.. how to handle?
1195 Should check that 'adopt' doesn't do the wrong thing with this block.
1198 Realloc blocks need to be pinned. That makes sense. Only that way
1199 will they get onto the clean_leafs list.
1200 When checkpointing we should probably examine clean_leafs to be
1206 Both of these hold a Credit.
1207 Both can be set at the same time.
1208 Cleaner ignores Dirty and sets Realloc anytime the block is in
1209 the wrong segment. It also Pins the block.
1210 When the cleaner is flushing to the cleaning segment, it
1211 ignores Dirty blocks. They get their Realloc cleared, but
1212 the remain pinned. So they will get moved at the next checkpoint.
1213 How do we know whether an indexblock should be Dirty or Realloc?
1214 The Dirty/Realloc bit is cleared before we get to incorporation.
1215 Maybe we lafs_dirty_iblock the parent of any block we write
1216 out. Then after incorporation, we set Realloc if it is not
1220 I think I'm pinning cleaner blocks now.
1221 Need to make sure the dirty ones are dropped. DONE
1222 Need to make sure the usage is transferred
1223 Need to get free segments back into use
1224 Need some more 'dump' options. Maybe youth/usage files.
1226 Need to make sure scan etc are triggered often enough.
1228 FIXME lafs_prealloc walks up ->parent without locking
1229 I think we want i_mapping->private_lock like lafs_pin_iblock.
1232 1/ a 'dump' option that triggers a scan and prints everything out.
1233 2/ scan must mark freeable as such, then subsequently free them.
1234 3/ Look at code that decreases usage of old segments.
1235 4/ Review lafs_cluster_wait_all and decide exactly how long we need
1237 5/ Review 'FIXME that is gross' HZ/10 thing.
1238 6/ Review 'wait for checkpoint to flush' msleep(500);
1239 Maybe remove that altogether.
1241 FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1242 FIXME BUG in lafs_allocated_block fired.
1243 from lafs_erase_dblock from invalidate_page from .. vmtruncate
1247 An inode data block is dirty and pinned, but the inoidx is no longer
1248 pinned. Presumably it isn't dirty.
1249 Recheck what 'dirty' means on the two blocks and see how this can happen.
1252 Tree gets very big! Lots of 'Realloc' blocks that should
1255 WE are spinning in cleaner again, and not in try_clean.
1257 Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1258 In general it shouldn't be. The flush_cleaner process will remove
1259 the Realloc bits so the blocks fall off clean_leafs. They then either
1260 go onto phase_leafs or get unpinned.
1261 But I currently have a problem with InoIdx/data.
1262 The Pin is transferred to the Data block, but it doesn't go from the
1263 InoIdx block because it has a pincnt. Now that is probably a bug, but
1264 what if it weren't? What if, while we were cleaning, a block got dirtied.
1265 That would pin the whole tree.
1266 I guess the rule about not allocating an inodedata block while the
1267 InoIdx is pinned needs to be revised. If the inodedata block is
1268 Realloc (and not Dirty) while the InoIdx is not Realloc, we
1269 can go ahead (in a cleaning segment).
1272 adir/big1 is garbage.... big1 was removed, so why is it even there?
1274 echo tre > dump # still too much stuff.
1278 Put cond_sched in checkpoint loops!
1281 Thoughts about cleaning and pinning.
1283 When cleaning we need to know how many dependant blocks are being cleaned
1284 so that we know when *this* block can be written - i.e. when the could hits 0.
1285 We cannot use the pincnt for this phase because there may be dependant blocks
1286 which are dirty. They, and therefore this, may get flushed at next checkpoint,
1287 but they may not. If we could be certain they would, we could just write
1288 to the clean-segment blocks which can become unpinned. However if there
1289 is an index block being cleaned, and no dependant is being cleaned, but some
1290 are dirty but not pinned, then the checkpoint can go past without the block
1291 being moved.... but maybe we can detect that.
1294 We set B_Realloc precisely on blocks found in segments being cleaned.
1295 We pin these blocks and leafs which are Realloc go in clean_leafs.
1296 If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1297 That way it gets written at end of checkpoint, but to main cluster.
1298 When we incorporate Realloc blocks into an index block, it gets marked
1299 Realloc. When we incorp dirty blocks, mark dirty. Then see above.
1300 On a checkpoint, we process both phase_leafs and clean_leafs
1303 FIXME do inode reads async better when cleaning...
1305 FIXME if a realloc inode has been allocated to a cluster when we try
1306 to dirty it, confusion can ensue as the writeout won't mark it
1307 clean, but will use up the credits.
1308 Maybe we need something similar to phasewait to not set PinPending...
1309 But normal dirtying doesn't phasewait. I think we just need to
1310 detect this case and wait for the clean-cluster to flush.
1313 FIXME make sure incorporate is doing the right thing with credits.
1315 FIXME lafs_write_inode. We need to be careful about clearing Dirty
1316 when making an update. Need some sort of locking.
1317 Need to review all inode dirty stuff and make sure we do
1318 write thing no matter when it is called.
1320 FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1321 anymore so we don't know how to flag the index block.
1324 UPTO: unlink etc don't prealloc the inode that will be modified.
1325 And a warnon inode.c:579 is very noisy.
1328 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1329 but it doesn't get set until AFTER lafs_reserve_block is called.
1332 Cleaning cleans an InoIdx block which schedules the data block.
1333 Subsequent the InoIdx block gets pinned again.
1334 Now when we go to write the data block, we cannot because InoIdx is pinned
1336 Maybe given that data block is pinned, we write it anyway...
1338 FIXME: when we realloc an block embedded in the inode, don't pluck it out
1339 and put it back in again. Just realloc the inode.
1341 FIXME: when cleaning a directory that has shrunk, we think we have
1342 blocks that don't exist any more. FIXED - we thought '0' was in
1346 FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1347 flush finds no credit. for InoIdx block of 8501
1349 FIXME: do we do SEGREF on all the index blocks? do we need to?
1353 FIXME: seg usage for segment 0/5 isn't dropping to zero.
1354 Part of a file got moved off, but count is still there.
1355 FIXED - seg_move wasn't being called.
1356 FIXME: segusage file has inconsistent extents:
1361 FIXED several bugs in walk_extent
1363 FIXME qphase: any locking between that changing and lafs_seg_move??
1364 I don't think so. Just that seg_apply_all must be called after qphase is set.
1366 FIXME make sure we don't try to clean the current segment!!
1368 FIXME 'Available' goes negative!
1369 Creating large file doesn't instantly reduce 'Used'.
1370 Deleting files plus sync doesn't increase Avail?
1372 FIXME a segment is in the table but doesn't print out!
1374 FIXME we don't cope with running out of free segments (not that we ever should).
1376 FIXME check all Credit usage and make sure credits are returned when
1377 ->parent is dropped.
1378 provide visibilty into credit counts.
1379 Make sure we are keeping enough space for cleaning. We should always
1380 have a few segments unallocatable.
1383 FIXME cannot do io completion in cleaner thread as it can block on
1384 a i_mutex which might be waiting for completion. FIXED (keventd).
1386 FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1387 If we 'know' we have a reference, e.g. a child with a ->parent
1388 link, we can access it without locking.
1390 lafs_make_iblock should return a counted reference.
1392 If we own an (indirect?) reference to iblock, we can access
1393 both iblock and dblock for free... but iblock can change???
1394 If not, we need to get a reference to on or other under a lock.
1396 FIXME block->inode should be a counted reference?
1400 lafs_inode_handle_orphan OK
1401 inode_handle_orphan_loop FIXED
1405 lafs_find_next FIXED
1409 lafs_inode_handle_orphan
1413 FIXME root->iblock should always be refcounted. Is it?
1414 FIXME walking siblings - what lock?
1417 FIXME several times we clean PinPending without refiling, in dir.c in particular.
1418 that looks wrong. FIXED
1420 Maybe lafs_new_inode should return a reference to the dblock
1421 Or pin it. or something. FIXED And pinned (when needed).
1423 FIXME lafs_inode_dblock might return a block without valid data...
1424 Need to get valid data, then load block 0 in find_block rather than
1427 FIXME we really should own a reference to ->dblock before calling
1428 lafs_pin_inode. We don't want IO during a pin request.
1431 FIXME review use of PhysValid FIXED
1433 lafs_orphan_abort - what if lafs_orphan_pin not called?
1434 or if 'b' is NULL. FIXED
1436 Do I Need to clean PinPending when retrying??
1437 Well, we need to be phase-locked when we set PinPending, so
1438 it must be Pinned to the current phase.
1439 So when we unpin a datablock, we must clear PinPending.
1440 FIXED we now clear PinPending in do_checkpoint.
1442 Does phase_wait do the right thing when pinning an inoidx block
1447 Need to understand and document the lifetime of a page with datablocks.
1448 who hold what refcount, and when can it be freed?
1449 Then fix up locking in lafs_refile, __putref.
1451 FIXME how keep what refcount on orphan blocks/inodes??
1452 FIXME should dirty/pinned/etc hold a refcount? they don't.
1456 FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1458 FIXME make sure empty files have depth of 1.
1460 FIXME Truncate proceeds lazily. All data blocks need to be gone
1463 If I call lafs_erase_dblock while a write is underway, we have a problem.
1464 We need to wait potentially for a checkpoint to let go of the block and
1465 a write to complete.
1466 This should be done with waiting for PG_writeback on the page to disappear.
1469 When end_page_writeback is called, we must have dropped all references to the
1471 When we commit to writing a block, we have to set PG_writeback on the page
1472 so that truncate et al can wait for it. Before we have committed, truncate
1473 can just remove the page. Internally we differentiate by B_Alloc.
1474 So before setting B_Allocated we need to test_set_page_writeback(page).
1475 Be careful of races.
1476 I don't think we can ensure all references are dropped. After all, that is
1477 the point of refcounts. So dblock array must exist without page!
1478 But we need to ensure that we don't start a writeout after truncate
1479 has done wait_on_page_writeback.
1480 This is done with the page locked so when we want to write a page
1481 in a checkpoint, we need to lock the page first. Once we have the lock,
1482 we check if the page is still dirty. If it has been truncated it
1484 But how do we safely reference the page if b->page can be cleared?
1486 When we clear PagePrivate, we take a counted reference to the page
1487 for db->page. This is dropped when the page is freed by lafs_refile.
1488 But while it is held, it is still safe for db->page to be dereferenced.
1489 So before we commence writeout we have to lock the page and set
1490 PG_writeback. After locking, we need to test if writeback is still
1493 Maybe not. I think we can submit blocks for writeout without setting the
1494 page to writeback. If we do, then we need to be sure those writes
1495 finish before invalidatepage calls releasepage (block_invalidatepage
1496 calls discard_buffer which calls lock_buffer which waits).
1497 In our case invalidatepage need to make sure that no new write commenses.
1498 Maybe we should lafs_iolock_block before we allocate to a cluster and check
1499 again if the block is dirty.
1502 lafs_cluster_allocate does:
1504 check if still dirty. If not, unlock and return
1507 when write completes, allocate is cleared.
1512 clear Valid,Dirty,Realloc
1517 2008 aug 28 - happy birthday.
1518 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1519 lafs_prealloc complains.
1521 mark_cleaning does too, but cleaning only happens well away from a checkpoint
1523 segsum_find is being called to reference a new segment when we flush a cluster.
1524 segment usage blocks are special. Their index information doesn't
1525 need to be written out in the current checkpoint. We can do that, but
1526 the backstop is to write just the data block in the tail of the
1527 checkpoint and write indexing information later.
1530 unlink is getting "No space left on device". This is when trying to
1531 pin the directoory block, the physaddr is 0, so it looks like we want
1532 NewSpace. But we should even be trying to prealloc in that case becase
1533 there should already be a prealloc on the block. i.e. there should be
1535 Hmmm. after multiple 'syncs' how can the block not be written out.
1536 Maybe it is embedded in the inode?
1537 When we pin a block that was embedded in the inode it isn't clear what to
1538 do. If we might grow the file so it doesn't fit any more, we need to
1539 allocate NewSpace. If we know it won't grow. we use Release.
1540 This still needs a proper fix.
1542 Cleaning seems to be working nicely. However we don't get all the space
1543 back that we should because lots of blocks still have credits that
1544 aren't being returned.
1546 So when should credits be returned?
1547 They are set when a block is pinned. It then gets dirtied which
1548 consumes a credit. Then gets unpinned. I guess if it isn't pinned,
1549 then it doesn't need any credits.
1552 It seems that cluster_flush is not always writing things in the correct
1553 order. Root gets written before some other things below it.
1554 Maybe they are temporarily out of the loop??
1555 No. There are dirty blocks which one checkpoint doesn't pick up, but
1556 they aren't holding the index block pinned. so they lose allocation.
1558 But they must hold the indexblock pinned, even though they aren't pinned
1559 themselves. We maybe do this just with the refcnt... maybe. That will cause
1560 it to phase-flip rather than drop pinning, which I think is right.
1562 So: too many credits remain allocated. Where are they? There are 1464
1563 outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1564 But things removed from the tree have credits removed.
1568 FIXME roll forward ignores inodes. But what about an inode that contains
1569 data. Should that be ignored? I think not.
1570 FIXME delete adir/big2 then delete adir and it cannot release:
1571 Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1572 presumably there is orphan processing or something to complete???
1573 FIXME when files are deleted, the space isn't returned!
1574 This seems to be mostly fixed - need to test.
1575 FIXME when I "rm [b-z]*" it waits for writeback on something???
1576 zfile again!!! OK, I think that is fixed.
1581 seg_apply_all dirties dblocks. When should they be reserved?
1582 The originally get reserved by a lafs_reserve_block call in
1583 segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1584 However: that block might get written before *and* after a checkpoint.
1585 So we need N* Credits. These are usually only used for Index blocks.
1586 We can set these easily enough if inode type is TypeSegmentMap.
1587 We move them across to Credit in seg_apply_all.
1588 But when to we clear them if they aren't needed? I guess
1589 when we drop the last segref. Yes, we already do that.
1590 FIXME need to make sure these get flushed on next checkpoint
1591 if we cannot allocate new credits after a checkpoint.
1593 New Problem. The 'cleanable' table reports a size of 3, but it is empty!
1594 Think that is fixed.
1597 1/ see above: rm x/y; rmdir x -> BUG - FIXED
1598 2/ Spins on 'CURRENT=1' ??
1599 3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1600 4/ When I create/delete a file, ablocks_used increments by one.
1601 The inode hasn't been allocated yet, so it seems the deallocation
1602 isn't adjusting ablocks_used??
1603 5/ open_namei (for dd) got caught on a mutex_lock.
1604 6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1605 I'm not sure where we should and am not thinking very clearly.
1606 Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1607 7/ unlink (at least) can get stuck in iolock_block. Who could be holding
1608 the lock? Writeout that hasn't completed?
1609 Yes. writepage calls lafs_allocated_block without calling flush.
1610 So the block could be sitting waiting for a flush. How long do we
1612 8/ It seems that some datablock can need NCredits. Make sure these
1613 are handled properly re flush-or-refill after checkpoint and
1614 flip_phase rather than unpin.
1615 9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1616 enough, and we lock up (see 7). Need to flush the first block
1617 straight away, and the next one as soon as the first finishes, etc.
1618 Or something like that. Then remove the comment from lafs_writepage.
1622 I seem to be getting only 4 blocks to a cluster at the moment.
1623 This is good as it motivates the code to handle block splitting in
1624 the Btree. But it shouldn't happen.
1627 Block spliting might work - it doesn't crash at least.
1629 After deleting all files, the tree is full of stuff.
1630 Lots of inode data/InoIdx blocks.
1631 Many but not all a Pinned. The others are OnFree
1632 The Pinned ones have outstanding references.
1636 Problem with the block splitting, when adding an index block.
1637 The index block is initially empty - we need to find things by looking
1638 at children. But we don't. We BUG_ON the iphys==0.
1639 In general, when we add a block below and index block and before we incorporate,
1640 the block must be found by finding the first indexed block and looking to
1641 see if there is a 'next' block that contains the address we need.
1644 But if we truncate a file while an index block is pinned and dirty,
1645 we spin on trying to incorporate it, which should make it empty.
1649 sync is trying to get lock in lafs_cluster_flush
1650 pdflush holds the lock and is stuck in cluster_flush_0xa40
1651 some wait_event I expect.
1652 Maybe we need an unplug ??
1654 - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1655 This is in clean_free. We try to update the 'youth' to mark
1656 the segment as free, and we don't have a reservation to do it.
1657 Maybe just reserve it there and then.
1661 When doing a lookup in an index block, we need to check the unincorp
1662 address list. It isn't enough to look for unincorp blocks as they
1663 might have disappeared.
1664 For INDIRECT and EXTENT this is easy enough as full information is in
1666 For INDEX it is a little tricky as we need to look at the full set of
1667 addresses to know where a particular address fits.
1668 We could force and incorporate first, but that has awkward implications
1669 if it requires a split.
1670 Maybe if we get from the lookup "start+range"....
1671 That is not enough as the 'start' might get zeroed by an update.
1674 rm adir/* doen't work as readdir doesn't get all the entries
1676 Reason is that they are being put in the wrong block.
1677 lafs_find_next doesn't correctly find the 'next' block if it
1678 hasn't been incorporated yet.
1680 in index tree -- easy to find
1681 in uninc_table -- not too hard
1682 in only in the ->children list, or attached to a page.
1683 It would be nice to use find_get_pages but that isn't exported so try
1684 something else for now.
1686 Look in index block for 'next
1689 FIXME when we split an index block, we need to hold a reference to
1690 the original so it doesn't disappear until the split-off copy is
1691 written. This is because we search from an index block to find
1693 [ note from Feb09. This should be OK now. Both will need
1694 incorporation, and we now hold on to blocks until they are
1700 - index block. What changes are allowed exactly.
1701 - splitting certainly makes sense.
1702 - merging two adjacent blocks is fine, of which a special case
1703 is finding that a block is empty and so removing it.
1704 - What about a 2->3 split which would require removing a block
1705 and adding another at the same time?
1706 or noticing that the first blocks addressed are all missing, so
1707 moving the index forward?
1708 In each case, searching down by indexes will find a block that
1709 has been replaced by a later address. We could manage that as
1710 long as the new block is attached after the replaced block.
1711 So we cannot move a block. We must delete and replace.
1713 - unincorporated index blocks..
1714 unincorporated data blocks are not pinned in memory. Once they have
1715 been written out, they can be freed. Their address is stored in the
1716 uninc-table. This means we can delay incorporation while many
1717 extents are written out and freed. When we come to incorporated, we
1718 may have many hundred of address in a few extents that can be incorporated
1719 efficiently without holding all that data pinned in memory.
1720 The same scale doesn't apply to index blocks. An index block can
1721 reference only 102 blocks (for 1K block size). And the uninc table can
1722 hold far fewer so we will naturally incorporate more often.
1723 So keeping index/indirect/extent blocks pinned until they are incorporated
1724 is reasonable. And it makes lookup a lot easier, as we have
1725 guarantees about ordering of block in the children list that we
1726 don't have in the uninc table.
1728 Incorporation could have some atomicity issues. There is no
1729 concern about bad stuff appearing on disk as the phase-change
1730 process handles that. In memory it might be awkward if we split
1731 an index block before incorporating a block what would span them.
1732 That could conceivably happen if we only incorporate 8 blocks
1733 (size of uninc table) at a time.
1734 So maybe we should incorporate a full uninc list (not table) at
1736 This means quite different code paths for incorporating leaf
1737 and internal index blocks....
1740 - uninc_table lists are a real problem.
1741 They can only be created during roll-forward so they hardly ever
1743 But if the block is split while processing earlier things on the
1744 list, then splitting an uninc table would be very messy.
1745 Is there any way around this?
1746 Why not just do incorporation during roll-forward?
1747 We only need to incorporate leafs, not internal blocks because we
1748 don't use uninc_table for internal blocks any more.
1749 So during roll forward, all index blocks that are touched need to
1751 I think we live with that. If it every becomes a problem, we will
1752 need to perform the roll-forward twice. The first time collects
1753 the usage information so that we know where we can start writing,
1754 then the second just applies all the changes. to the rest of the
1759 uninc table only used for leaves, and has no linked list
1760 unincorporated index block are stored on a list, which we
1761 sort before applying.
1762 All uninc index blocks are therefore kept in the index tree.
1763 Their order on the children list allows us to find the correct
1764 index. Each block for which the fileaddr is in the parent is
1765 followed by any blocks that have been split off and end after
1766 this one starts. Blocks that have been emptied are Hole and are
1767 skipped over when looking for a block.
1769 When we split an internal block, the remaining uninc blocks
1770 must not start with a Hole.
1772 FIXME: what locking do I need around lafs_incorporate?
1773 i_mutex?? i_alloc_sem??
1774 i_alloc_sem is imposed by truncate (inode_setattr) and
1775 direct_io possibly. So it is really about adding/removing
1776 blocks. Not updating internals.
1777 Maybe our own mutex. Could even be per-index-block !!
1778 Whatever it is, we need to protect walking ->children too.
1782 "rm -r" problem from 12/dec/2008 fixed now.
1783 incorporate code got a make-over and is probably much better.
1785 New problems: After test runs, cannot create files due to no space
1786 on devices!! But directory tree is empty.
1789 free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1791 The problem is that we think 1425 has been allocated to data that
1792 might still need to be written, leaving not enough room for more.
1794 ====================414 credits ==============================
1795 which doesn't explain everything, but does explain a lot. There
1796 really should be nothing in the Index tree (except fs-root and
1799 Some inodes which are OnFree and hold no credits.
1800 0 DATA (1) 52 [0]ESegRef,Claimed,PhysValid
1801 52 1 (0) 0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1803 Some other inodes which are pinned with lots of credits and are
1804 on the phase_leaf list
1805 0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1806 299 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1808 And that is about it. some are not Valid, some are...
1809 checkpoint just wants to 'flip' them.
1810 They mostly have a refcnt of 1... I wonder who is holding that....
1811 The reference of on the dblock is held by the iblock.
1812 But what is the iblock remaining? Who holds that reference?
1814 I restored some code to clean iblock, and now:
1815 free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1816 ====================244 credits ==============================
1817 which saved 130 credits. That helps.
1818 There seem to be many fewer of the many-credits blocks
1819 Lot of index blocks in tree are 'OnFree' and have a
1820 0 refcnt, but haven't been removed. Why?
1821 It seems that the have ->parent == NULL, so lafs_refile never
1822 bothers to remove them. I guess it should...
1823 OK, lots of InoIdx block have gone now with their DATA blocks.
1825 So, remaining blocks are pinned to their phase with lots of Credits,
1826 have not pincnt, mostly have physaddr==0.
1827 It is just the stray refcnt that keeps them there..
1828 inums are 40, 56, 62-73, 275-278, 280
1831 63-69 are directories 2/3/4/5/6/7/8/9
1832 70-73 are looooong symlinks
1834 276 is dfile - same as cfile but truncated.
1835 Then some nbfile-X that were big enough.
1837 So: what do they have in common:
1838 Several only use the in-inode data block, but
1841 Can it be that it is refcounted on the Leaf list, and so
1842 cannot get off?? Yes, I think so!
1843 We only unpin things that have a zero refcount.
1846 checkpoint takes it off the list, then flips the phase and puts it
1847 on the other list with refile. During that time it has a refcount
1848 it doesn't lose the pinning.
1850 1/ Not have it on the list despite being pinned.
1851 2/ Drop the PIN despite the refcnt.
1852 3/ have refile do the phase_flip so it has a chance to
1853 notice the refcount has hit zero.
1855 2 isn't really an option. We need PIN to persist whenver we have
1856 a reference. We could possibly use PinPending for index blocks too,
1857 but that would require a lot of thinking.
1858 1 requires another criterea for being on the list. I suspect that would
1860 3 we used to do I think... But refile is in a big lock, and we
1861 cannot really do a phase_flip under that.. and phase flip calls
1862 refile anyway so we would get recursion.
1863 So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1865 FIXME use kzalloc where appropriate.
1867 FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1871 Only 54 credits in Index Tree now.
1872 Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1873 plus '74', which seems to be schedules for deletion - root has uninc_table.
1874 ... and 'sync' got rid of that and left 44 credits.
1875 Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1881 These seem to be the files that used data-in-the-inode
1882 They still have a refcnt of 1 (or 2 for adir).
1883 ... OK, that's gone now. I fould a refcount leak.
1885 So now: 42 Credits in Index Dump. No stray files.
1887 df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1888 So we still seem to have 1085 blocks allocated. 42 are accounted
1889 for, so 1043 still missing... either we lost the count, or lost the tree.
1891 create a finy file, remove, and sync, now
1892 df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1894 so I lost 15, b ut now 48 are in tree. Lets try again...
1895 df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1898 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1900 Definitely losing more thant the difference in the tree.
1902 Try creating empty files...
1903 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1905 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1906 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1907 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1908 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1911 very strong pattern there.
1912 What about 2 files at a time.
1913 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1914 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1915 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1916 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1919 Slightly different pattern - not as bad.
1921 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1922 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1923 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1924 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1926 Strange, isn't it....
1928 Making sure we clear UnincCredit... result looks worse.
1931 I fixed up the credit accounting 'incorporate' and then fixed a couple
1932 more little bugs. And now:
1936 ====================48 credits ==============================
1937 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1939 So we still have 720 allocated credits that aren't accounted for.
1940 But we are nicely under 100...
1945 ====================76 credits ==============================
1946 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1948 That is different. The count of missing blocks is way down,
1949 but there is some extra cruft in the index tree.
1951 0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1952 0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1954 0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1955 330 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1956 Time for a commit though....
1959 ====================46 credits ==============================
1960 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1962 so the strays in The index tree are gone. but still have 159 outstanding