2 So, let's try to write a kernel module that implements this filesystem.
3 It would be good to have a plan.
5 - Mount filesystem, providing empty root directory
6 o parse mount options - DONE
7 o find/load superblocks and stateblocks - DONE
8 o present empty directory - DONE
9 o Compile external module - DONE
12 - Mount filesystem read-only with no roll-forward
14 sync_page_io or bread? - not bread I think
15 o Index blocks management
16 o search cluster-header for root inode
18 o Directory lookup/read
21 - Support roll-forward for blocks, orphans, whatever
22 o manage segusage files
27 o cluster creation / block sorting
31 - Interface for snapshots and other admin
35 ------------------------
37 If a device is removed from the filesystem, we cannot reliably
38 tell from the other devices or state that this is so.
39 Maybe we need to update all devblocks with a new 'seq' number...
41 How do we specify mounting subordinate filesets?
42 What superblock do they have?
43 I suspect we do a -F lafs-sub mount from the original filesystem.
46 If mount fails, we seem to be leaving a super lying around,
47 and sync_supers dies on it. - DONE
50 Umount appear to work, but a sync_supers dies. - DONE
53 subordinate supers aren't being locked as much - is that a problem?
56 index pages never get put on an LRU - how is this supposed to work?
59 --------------------------
61 Inodes live in an address-space, much like a file. To load the
62 first inode, we need an address-space, so may as well have an
63 'struct inode' as we may want to expose it to user-space.
65 Loading an inode, need
66 fs (lafs filesystem structure)
67 which subfs (maybe a lafs inode)
68 which snapshot - this is implied by the subfs inode.
69 and fs can be obtained from inode, so just inode, inum
75 review block_leaf_find and make_iblock
76 need to do setparent and block_adopt next
79 need to resolve locking for ->siblings list
86 I can read a file.....!!!!!
87 Code review / tidy up.
88 resolve locking buffer vs page
90 Export on a web page somewhere??
93 (I spent a while getting large-directories to work again in prototype..
95 - Priority: clean mount and unmount
99 FIXME how do we record and handle write errors???
101 The iput in lafs_release - which is needed - is oopsing
105 Ok, I finally have a clean mount/unmount.
106 .. not quite. blocks being freed at unmount still have a refcnt, which is bad.
109 - make sure we can handle 'large' directories.
110 - make sure we can handle files with indexes
111 - handle filesystems that span devices.
114 Hurray - clean unmounts!!!
115 There is a nasty circular reference of the root inode which is stored in
116 a block that it manages. Maybe this should not happen, rather than having to be
117 explicitly broken - the root-block can live elsewhere, not in the inode.
119 Next multi-level index blocks.
121 But first, need to understand memory pressure and pageout.
122 How are dirty pages found to be cleaned?
123 How is pressure put on a filesystem to clean up?
124 How are clean pages reaped?
126 - call pagevec_lru_add{,_active)(pvec) to put the page on an LRU
127 lru_cache_add{,_active}(page) might be easier, but isn't exported.
128 - call mark_page_accessed(page) to keep the page 'active'.
131 - make sure indexes work...
135 eax,bx,cx,dx,s1 all zero
136 from block_leaf_find 203
138 ... OK, indexes seem to work.
139 But 'lafs' have problems creating some large files.
142 This is due to not handling error properly.. fix it later FIXME
146 Must make sure the index address-space gets clearred up... I wonder
147 how we find all the pages to free. This might be one reason to keep them
148 in a radix tree. Though we should be able to walk our own data structures.
151 Then work on mounting a 2-device filesystem.
154 FIXME dir_next_ent always starts from the beginning rather than
155 remembering where it is up to... can this be fixed??
158 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
160 Mounting snapshot needs a way to identify that it is a snapshotmount
161 and which snapshot, and which filesystem.
162 We could use a different filesystem type, but that isn't really needed
164 mount -t lafs -o snapshot=name /original/mount/point /new
166 This grabs the named snapshot of /original/mount/point and places it at
168 The 'snapshot=' option is the trigger.
171 mount -t lafs -o control /original/mount/point /new
173 To grow a filesystem, we initialise a device (super/state blocks) and
174 mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
176 as the dev_name isn't passed to remount
178 So, mount options are:
185 pairs matching what is exposed in the control filesystem
188 - factored out super-block finding preparatory to finding snapshots.
191 superblocks for snapshots and sub-ordinate filesystems do
192 not get stored in the 'state'. There is, however, a usage count so that
193 the prime filesystem cannot be unmounted until all snaps and subs are gone.
194 This should just refcount the prime_sb I suspect.
196 So: a snapshot sb points to the 'struct fs' but doesn't .... what???
199 - remove the super-block finding code by changing the layout to store
200 superblock locations explicitly :-)
202 - teach 'mount' to mount snapshots.
204 - need to audit for bad use of ss[0]
205 - need to find better way to map 'sb' to snapshot number.
206 - need to make unmount work.
208 01apr2006 (no, really!!)
209 - rewrite index to kmalloc index blocks and use a shrinker to free them.
210 This means that indexblock no longer has a 'page', which makes sense.
211 It also means they cannot live in highmem, which is sad, but could
214 Notes: superblocks and refcounts.
215 Each device holding the filesystem gets a superblock.
216 One of these (arbitrarily) is the 'prime' superblock and gets to
217 manage the whole filesystem.
218 Each snapshot also gets a superblock, as does each
219 subordinate filesystem. These are anon sbs - using anon dev.
220 Each anon sb takes a reference to the 'struct fs', and also to the
221 prime sb.... how about the reference relationship between fs and prime_sb???
225 - problem with getting parent superblock due to semaphores...
226 - when unmount, put_super isn't being called, so inode 0 isn't released!
229 (Took a week off to play with rt2500 wireless cards)
230 - Use different filesystem type for snapshots and subordinate filesystems.
231 This removes the semaphore problem
232 + OK, mount and unmount works for snapshots... what next?
233 - review index block - worry about himem?
234 - review ss[0] usage - OK
235 - general code review
237 FIXME - what should leaf_lookup/index_lookup return on format error?
238 The currently return '0' which will quietly make an empty block.
239 Many '-1' would be better to make an error block.
240 FIXME check how other filesystem lock the setting of PagePrivate
241 Maybe just need to lock_page
242 FIXME combine find/load/wait into one operation
243 Review dir, super, roll, link
245 FIXME module refcount increases on failed mount!
248 I've been sick for too long, and not much has happened... However I think more than
249 the above comment says. I started looking at roll-forward and have the
250 basic block parsing in place so that it reports what it sees in the roll.
251 Also, the format has been changes a little: the address in the state block
252 is the CheckpointStart cluster, and we simply roll forward to the
253 CheckpointEnd, and then keep going beyond there - there is no longer any
254 walking back to find the start.
256 Next step is to start incorporating rolled elements into the filesystem
258 - data blocks: shouldn't be too hard. Don't need to update the
260 - inode updates: should be straight forward enough, but care is needed
261 as the data might be in multiple places
262 - directory updates: these are probably most interesting..
265 Question: how are symlinks created?
267 log the inode creation
269 log the directory update.
270 This allows the 'value' stored in the inode to appear after the directory
272 That might be OK for files (Which are created empty and then extended)
273 but is bad for symlinks (which are created atomically).
275 - ensure inode is in a previous cluster to directory updates.
276 This slows things down too much I think
277 - log the content as well. This is awkward if it is big, certainly if more
278 than a block, which is possible.
279 - directory updates could be dependant on the inode being valid.
281 - log content if it is small, else write inode, flush, then create link.
283 So the fast option is:
284 log inode create, log content, log filename
285 and the slow/safe option is
286 log inode ceate, sync file, log filename
288 So on roll-forward if we see the inode we just save the data.
289 Saving the whole inode seems attractive, but we want minimal order
290 dependance: an inode update in the same cluster as the new inode should
291 still over-ride, even though it is earlier.
293 Ok, rollforward is proceeding slowly. I think I am now incorporating
294 new blocks into the tree properly, though the code probably won't compile.
295 It will be nice to test this and see the file have the right data.
297 Next step would be to include the index incorporation code.
305 - what exactly should happen when rollforward finds a file with a linkcount of 0?
306 Currently all updates get lost - I wonder if they are lost safely?
307 - rollforward is getting the size right, but not the content
308 - do I need to flag a block that ->phys is valid?
310 : Ok, roll-forward picks up new blocks in a file OK,
311 but umount has stopped working.
312 Presumably because there are pages attached to the inode which aren't
313 getting released. What do we want to do here?
314 Normally those pages, or their addresses need to be recorded before
315 they are lost. But on a read-only mount we don't care so much.
317 22jun2006 continuing above thought..
319 When we roll-forward and pick up the pieces of a file, we don't
320 want to allocate pages to hold those pieces (and definitely don't
321 want to read them all). We just want to attach the addresses
322 to the parent for incorporation. Similarly after writing
323 dirty blocks in a file we want to be able to release them
324 immediately rather than waiting for the addresses to be
325 incorporated (as incorporation can be more efficient when delayed).
327 We could just allow the page associated with a block to be released,
328 except that the page provides the indexing to find a block. We might
329 be able to live without the indexing, and hunt down the indexblock tree,
330 but living without the mutual-exclusion provided by block indexing would
332 And the 'struct datablock' still contains a lot more than is needed.
334 So maybe we should just have a completely separate structure attached to
335 the indexblock which lists fileaddr/physaddr. This could include
336 extent information. The trick would be guranteeing allocation.
337 We could either allocate-late with a fallback of attaching the 'struct block'
338 or performing an immediate incorporation, or allocate-early and block
339 the dirtying of a page until there is space to record the new address.
340 This last is bound to be easiest.
342 So: what exactly do we use to store addresses?
343 Probably a linked list of tables.
344 Each table contains a link pointer and an array of
345 fileaddr/physaddr/extentlen
346 But we would need to allocate lots of these if there are hundreds of
347 dirty pages, but possibly only end up using a few if they made
348 extents very nicely. That might be wasteful.
350 Or we could allocate just one. When it is full we perform an
351 incorporation. But if that causes a page split we are in trouble.
352 We could have a spare page, split to it, write out one
353 and wait for the spare page to be written and free.
354 But we cannot just release the index page as it might still have
357 (I think I've been here before).
358 A worst-case scenario involves writing one block and that requires
359 spliting every index up the tree to the inode. This requires
360 arbitrarily many pages to be allocated. To accomodate this we either
361 pre-allocate a spare page at every level of the tree down to the data
362 block (a bit like storage space allocation) which seems very wasteful,
363 or we make sure we can release one of the split pages, which seems impossible.
365 I could decide not to worry about it. Have a pool of index pages and hope
366 it always works. Afterall, most pages are data pages, and they can be
367 freed successfully. We would only have a deadlock if all dirty memory were
368 index pages, and that seems unbelievably unlikely. If we trigger a
369 checkpoint when the count of locked-pages hits some limit we should be
372 So: Keep one table per index block. Use simple append and sequential search.
373 When table gets full, force an incorporation
375 Do we allocate the table separately, or embed it in the indexblock??
377 Probably embed it. indexblocks that don't need it can be freed at any
378 time so that space waste hopefully isn't significant.
381 If the file is written sequentially, then everything should gather into
382 extents, and so it doesn't need to be enormous.
383 If the file is written randomly then the index block can be expected to
384 be 'indirect', so incorporation will be cheap.
385 So 'small' seems ok in both cases.
389 But wait a minute.....
390 On a checkpoint we can be getting phys updates for prev and next phases.
391 next-phase updates cannot be incorporated until the indexblock has passed
392 on to the next phase. So in that case, I think we still keep a linked
393 list of unincorporated blocks and live with the fact that we cannot
394 free them until the phase change passes. That shouldn't be a big problem
395 as it is a limited time frame - especially for data blocks..
397 But does this solve our initial problem??
398 During roll-forward we want to keep the addresses but not the blocks,
399 and we don't want to force incorporation. That means an arbitrary list
400 of addresses attached to an index block.
401 I guess we could possibly allow incorporation, but I would rather not
402 as I want the fs to be able to be read-only nicely.
403 So that means we need to have a list of address tables.
404 Maybe the normal approach is 'add a table if possible, else incorporate'?
406 OUCH... we may write a block a second time before incorporating the
407 new address, so when adding an address to the table we need to check
408 if it already exists. That could be expensive.
409 For index blocks might it even be a different address? I think
410 not but the vague possibility (in the future?) does complicate
411 things somewhat. Maybe we just keep thing in chron order and
412 don't worry about duplicates until incorporate time, when we have to
418 free_block must free tables DONE
421 Unmounting still doesn't work.
422 Problem is that an index block is holding a reference on parent,
423 and parent references aren't getting cleaned up.
424 On read-only unmount I guess we need to walk the list of leafs,
425 discard any address info, and unlock the blocks.
426 So that should be the first task for next time.
429 Leafs are locked blocks which have no locked children.
430 So any locked data block (non-inode) is a leaf
431 Any locked index block with lockcnt[phase] 0 is a leaf.
433 OK - fixed numerous bugs, but I can unmount now!!
434 I can even rmmod and insmod and all is cool.
438 - review refile and get all the code in there from prototype
440 - write a combined find/load/wait function and use it
442 - allocate inodes in single memcache and avoid generic_ip
443 HALF DONE. (still using kmalloc, not doing initonce well)
444 - review recording of new block addresses
445 + make sure we lookup there on index lookup - YES
446 + make sure ->uninc_next gets tranferred to table at phase change.
447 + write incorporation code as it is tricky
448 - review how directory updates can be incorporated into a RO filesystem.
449 No, they cannot. We need to update the directory.
450 - write directory update code
451 - write cluster construction code
452 - make sure indexblocks with unincorporated addresses get on to inc_pending
453 ?? or is locking them enough?
456 INCORPORATION - ARgggghhhhh.
457 The current uninc_table doesn't really lend itself to building
458 index block... though maybe....
459 Question: what happens when an index block disappears? i.e. it has no
461 We clearly need to remove it from the parent. This should be trivial,
462 a direct operation on the parent index block. etc some number to 0.
463 Then the next incorporation pass with simply lose that entry.
465 OK, that might be all well and good, but how do we sort unincorporated
466 addresses so we can merge them?
467 A linked-list merge sort is nice and open-ended, but does waste
468 quite a bit of space in pointers.
470 Or maybe I should just always do small-table incorporations.
471 Is there a way that a bad ordering of writes could force very bad
472 index layout in this case? i.e. cause a table split every time,
473 but new blocks go in the first (full) table.
474 OK Decision: always do small-table incorporation.
475 i.e. not a list of blocks: just a table of addresses.
477 FIXME check validity of index type when it is first read in,
478 and reject early if it cannot be recognised.
481 Took a break from incorporation.
482 Looking at directories.
483 Wrote dir.doc in module to sum lots of stuff up.
485 dir blocks have an info structure attached.
486 This included a counted reference to the parent.
487 How long does this need to hang around for??
489 - when there is any orphan issue happening, it must stay, via
491 - when actually performing a dir op, we need to create and
494 When last ref of a dir block is dropped, should drop
495 the parent reference.
499 free list management mostly done.
501 create/delete prepare/commit/abort
503 dirty_block lock_block
506 FIXME should dir_new_block zero out the block?
507 How will commit_create know what to do with this block?
509 NOTE another type of directory orphan is a free leaf block which
510 is on the part-free list.
512 -------------------------------------------------------------
513 09spe2006 0 on the plane to Frankfurt
514 Don't tell me I am rethinking preallocation again ???
517 dirty_inode needs to record the phase it is dirty in
518 inode_fillblock needs to check current phase and act accordingly.
520 Make sure the B_Orphan flag is set and used - or discard it.
522 How do we commit creating a symlink?
523 If it is a full block in size we cannot make an update record.
524 - maybe have two update records? We cannot guarantee they are in
526 ... but if we put the 'make dir entry' last it should work.
528 Change 'struct descriptor' definition
529 the 'block_type' aka 'length' 16 field becomes
530 0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
531 0x8001 -> 0xc000 -> miniblock upto 16K+
532 0xffff -> index block.
534 Need to write IO routines which decrease pending-block-count in
538 Thinks. a 1TB filesystem with 1K blocks and 4096 blocks/seg
539 gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
540 - 512 segments per block - is 512 blocks in each seg usage file
544 - lafs_lock_{d,}block DONE
545 Make sure the block has parents and allocation and set the locked
549 Given a datablock, wait for it to be written out
550 This is needed before updating a block that is still locked in the
553 Used when creating a new object/inode
554 Given a datablock which is to hold the inode
555 and a type (Type*) and a mode,
556 Fill in the data block with appropriate data so that
557 when lafs_import_inode looks at it, the right stuff happens.
570 lafs_cluster_update_abort
571 lafs_cluster_update_commit_buf
572 lafs_cluster_update_commit
574 lafs_cluster_update_prepare
575 lafs_inode_phase_check
578 lafs_cluster_update_lock
579 lafs_checkpoint_unlock_wait
585 - I need to know if a block is undergoing write-io so that I can
586 avoid modifying it in certain circumstances. But I don't track
587 this information. Options:
588 1/ track the info. This means an extra field in the 'struct block'
589 because I still need to know which wc has had a write.
590 2/ For blocks that we care about copy the data on write...
591 But we care about all inodes and directory blocks. That is a waste.
592 I think we put extra info in the block.
593 We need to know which wc was used (0,1,2) and which pending cluster
594 in there (0-3) which comes to 4 bits.
595 But we only care about the block for wc=0. and we could include the
596 which-pending in the b_end_io, or maybe put it all in low bits
597 of the block pointer.... Need max 4 bits. Can only be sure of 2...
600 'which' goes in bottom two bits of bi_private
604 4apr2007 (What a long gap !!)
606 - lafs_cluster_update_*
607 How do we prepare for a cluster update? How do we lock it.
609 The important thing is that the update can be written. That
610 requires that there is space available. So we need to preallocate
611 space and then release it.
612 It is possible that each update might go in a different cluster, so maybe
613 we need to preallocate one block per update. That sounds a little expensive.
614 After all, we aren't preallocating a cluster block for every data block
616 So: prepare does nothing
617 lock preallocates the space - a full block.
623 - Can now create and delete lots of files. This is cool.
625 Orphan slots just grow and grow - never to be reclaimed - why?
626 After rm f*, 7 files remain. but rm f* again and the go.
627 FIXED - readdir wasn't returning them
628 Size of directory remains large.
629 And sometimes, files become ghosts... (try just removing one after first rm f*).
631 TODO - process those orphans to clean up the directory.
633 20June2007 (Happy Birthday Dad)
635 - Creating lots of file and then deleting them leaves 5 orphan slots
636 for the directory busy, and one for inode 0??
638 Directory handling uses the following orphans:
640 A new index block is created by splitting. This needs to be linked in.
642 The dirent block we are deleting from
643 If it becomes empty, it needs to go on free list
644 The index block we are deleting from
645 If it has lots of free space it might need to be rebalanced.
646 The inode that was deleted.
649 - When a file is fully deleted, we need to drop any orphan info... DONE
650 - Need to do orphan handling of free blocks in directory, and
651 unmerged parents - but there doesn't seem much point as I am going to
652 change the directory layout (again).
654 So: writing to a file.
655 We need prepare_write, commit_write, and writepage.
656 Prepare loads and links the page and checks there is space.
657 commit marks it as dirty so writeout is possible.
658 writepage chooses a page to write out
660 25June2007 - HACK week, thanks Novell!!
664 Need to revise the process whereby async completion
665 clears PAgeWriteback,
666 We need locking in there, and need to worry about
667 'which' wrapping too soon.
668 Need to not start IO before we set page writeback
670 Maybe, but syncing to disk needs more thought.
672 Partly done, need actual content.
674 Can make directory, but creating first entry fails. - FIXED
677 - new directory structure.
679 27Jun2007 - More HACK week :-)
681 - new directory layout done - much easier!!
682 - If I delete a file that was created, the blocks still have a ref-count
684 - mkdir doesn't increase link count on parent. - FIXED
688 Infrastructure to process orphans
689 Handle specific cases
690 flush orphans at key times.
691 load orphans at roll-forward
694 Write out a checkpoint (when?)
695 Make sure refcount goes back to zero on blocks I write.
697 Check on inode_phase_check and checkpoint_unlock and inode_dirty
698 in all directory operations.
700 FIX: Writing a small file leaves something non-dirty but
701 due to be written, and lafs_cluster_allocate complains.
704 FIX: dir_handle_orphan doesn't lock the orphan transaction required.
706 FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
708 FIX: lafs_allocate hasn't been written!!!
710 FIX: before updating any block in a depth=0 file, we must first load
713 29Jun2007 - still HACK week.
714 Summary of how incorporation works.
716 Each index block has a small table for unicorporated changes. i.e.
717 blocks number and their addresses.
718 This supports efficient storage of extents, and is extensible by allocating
719 more tables. This last is done rarely.
721 When a block gets a new address, this is added to the table or, if
722 there is a phase missmatch, it is added to a list until a phase change
723 happens (so the whole block is pinned pending the phase change).
725 If the table is full then:
726 - if the filesystem is read-only (including during roll-forward),
727 a new table is allocated (else rollforward fails).
728 - otherwise we incorporate the table into the block, then add the new
729 address to the (now empty) table.
731 If incorporation requires that we split the index block we allocate one
732 from a pool. If there are none in the pool, we wait.
734 As the table is much smaller than a block, the incorporation into
735 two block will always succeed.
736 The 'uninc_next' and 'children' lists will then need to be shared
737 between the two blocks before the new address is added to whichever
738 table is appropriate.
740 When looking for a block address, we must always check the table and
741 then children lists. We do not need to check uninc_next as they will always
744 How to ensure that the pool always has sufficient index blocks and we don't
746 We have two halves of the table, one for each phase. Before we allow
747 a block to be dirty in a phase, we ensure that the pool has adequate
748 index blocks for that phase. e.g. twice the depth of the block. If it
749 doesn't we block the dirtying until space becomes available.
750 For syscall writes, this is easy as we catch in prepare_write.
751 When we perform a phase change, we must be sure there are enough index
752 blocks for the deepest bloc that will stay dirty. If there aren't, we need
753 to flush all dirty block, and unmap all writable mappings before
754 starting the checkpoint.
757 FIX: need to work out life time rules so that inodes hang around while they have blocks.
758 currently have an igrab that is never put.
760 FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
763 Checkpoint flushing is getting close.
765 InoIdx blocks are not changing phase.
766 Phase change should happen when all children have been incorporated, and
767 then the write has been triggered marking us clean.
768 For InoIdx blocks, we need to be marked clean when the data block
771 5jul2007 - a week off
772 Checkpoint flushing seems to work !!!!
773 FIX: what should filesize of symlink be?
774 other filesystems use len, but still zero-terminate for vfs.
776 Problem. A chmod is followed immediately by an unlink then a checkpoint.
777 The chmod update gets into the checkpoint cluster, but the unlink completes
778 before the checkpoint is finished so the new superblock sees the file
779 as gone. Roll-forward find the update and want to update a missing file.
781 This isn't a big problem, but with slightly different details, it could be.
783 One option is to ignore updates that preceed the updated block. That might
784 be awkward with e.g. directory updates and checkpoints that cross multiple
787 Another option might be to prohibit updates once a checkpoint has started
788 unless they are known to be after the phase change.
790 FIX: unlink isn't punching a hole in the inode file.
791 Inode usage map isn't being updated. - FIXED (For create, not unlink).
793 FIX: roll forward does not pick up inodes, only data blocks.
794 But tiny files are synced to inode, so they might not be picked up.
795 So we must process a level=0 inode like a data block.
798 Time for lots of clean up.
800 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
801 DONE 2/ rename 'lock' -> 'pin'
802 3/ Review and fix up all locking/refcounts. See locking.doc
803 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
804 B_Alloc inside the semaphore
805 Also lock inode when copying in block 0 and probably
806 when calling lafs_inode_fillblock (??)
807 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
808 more allocations can come in at any time.
809 NotYet 3c/ cluster_flush should start all writes before calling _allocate
810 as _allocate might block on incorporation/splitting.
811 No. We really want _allocate to not block, but to queue...
812 I think this is too hard to get perfect just now, so I will leave it.
813 DONE 3d/ introduce PinPending for data blocks. remove fs->phase_depth.
814 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
815 so that locking of lru doesn't have to be too global
816 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
818 DONE 3g/ revise refile lru handling based on new understanding
819 3h/ Utilise WritePhase bit, to be cleared when write completes.
820 In particular, find when to wait for Alloc to be cleared if
821 WritePhase doesn't match Phase.
822 - when about to perform an incorporation.
823 3i/ make sure we don't re-cluster_allocate until old-phase address has
824 be recorded for incorporation.
825 3j/ Check that index blocks cannot race when getting locked....
826 k/ Check what locking is needed to set PagePrivate exclusively.
827 DONE l/ cluster_done needs to call refile, but is called in interrupt context.
828 We need to get it done in process context I think and lock
829 ->waiting access with fs->lock after changing it to ->lru
830 DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback
831 only when *all* have finished.
832 DONE n/ on phase change, uninc_next blocks need to be shared out.
833 NO 3o/ Make sure lafs_refile can be called from irq context.
834 3p/ lock all lru accesses.
835 3q/ Lock those index blocks!!!
836 3r/ Can inode data block be on leafs while index isn't, what happens if we
837 try to write it out...
838 FIXED Why are extent entries only grouped in 4s?
839 If InoIdx doesn't exist, then write_inode must write the data block.
840 4/ resolve length of symlink
841 FIXED - long symlink followed by 'sync' crashes.
842 FIXED - rollforward isn't calling 'allocated' on blocks, or something
843 FIXED - I cannot find 'bfile'. (inode isn't written)
844 SEEMS OK...- Must flush final segment of a cluster properly...
845 5/ Review what does, and does not need to be initialised in a new datablock
846 6/ document and review all guards against dirtying a block from a previous phase
847 that is not yet safe on storage.
848 See lafs_dirty_dblock.
849 7/ check for proper handling of error conditions
850 a/ checkpoint_start might fail to start a thread!
851 b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
852 8/ review checkpoint loop.
853 Should anything be explicit, or will refile do whatever is needed?
855 What should checkpoint_unlock_wait wait for?
856 When do we need to wait for blocks the change state. And how?
857 DONE 10/ rebase on 2.6.current
858 DONE - use s_blocksize / s_blocksize_bits rather than fs->
860 11/ load/dirty block0 before dirtying any other block in depth=0 file
861 12/ Add writecluster flag for old-phase updates.
862 Why is this needed? updates should always go in the new phase???
863 13/ use kmem_cache for 'struct datablock'
864 14/ indexblock allocation.
866 allocate the 'data' buffer late for InoIdx block.
867 trigger flushing when space is tight
868 Understand exactly when make_iblock should be called, and make it so.
869 15/ use a mempool for skippoints in cluster.c
870 16/ Review seg addressing code in cluster.c and make sure comments are good.
871 DONE 17/ Make sure create inherits uid etc from process.
872 18/ consider ranges of holes in pending_addr.
874 DONE 20/ Implement rest of "incorporate"
875 DONE 21/ Implement staged truncate
876 DONE use for setattr and delete_inode
877 DONE 22/ block usage counts.
878 23/ review segment usage /youth handling and make a todo list.
879 a/ Understand ref counting on segments and get it right.
880 24/ Choose when to use VerifyNull and when to use VerifyNext2.
881 25/ Store accesstime in separate (non-logged) file.
883 make sure files are released on unmount.
886 Support 'peer' lists and peer_find. etc
887 31/ subordinate filesystems:
888 a/ ss[]->rootdir needs to be an array or list.
889 b/ lafs_iget_fs need to understand these.
890 32/ review snapshots.
892 how they can fail / how to abort
895 - need to clean up checkpoint thread cleanly - be sure it has fully exited.
896 34/ review roll-forward
897 - make sure files with nlink=0 are handled well.
898 - sanity check various values before trusting clusters.
900 34/ Configure index block hash_table at run time base on memory size??
902 Review everything that needs to handle laying out at cluster
903 aligned for striping.
905 36/ consider how to handle IO errors in detail, and implement it.
906 37/ consider how to handle data corruption in indexing and directories and
907 other metadata and guard against problems (lot of -EIO I suspect).
909 - check all uninc_table accesses are locked if needed.
912 1/ fs->pending_orphans and inode->orphans are largely unused!
913 2/ If a datablock is memory mapped writeable, then when we write it out,
914 we need to with fill up it's credits again, or unmap it.
915 3/ Need to handly orphans asynchonously.
919 Free index block are on two lists, both protected by the global
921 1/ The per-inode free_index, so they can be destroyed with the inode
922 2/ The global freelist so they can be freed by memory pressure.
924 11feb2008. Where was I up to again?
925 reviewing phase_flip and lafs_refile.
928 Reading through modify.c, at 'add_indirect'. Plan to fix all this code.
929 Need to thnik about how index block really change. How old blocks get
930 dis-counted from segment usage, and what optimisation are really good
931 for re-incorporating index blocks.
932 Operations to consider are:
933 i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
934 i/ leaf block splits, index block gets new entry at end, and replacement
935 for other entry. Easy to handle
936 ii/ trailing entries are zeroed. Should be easy, but isn't yet.
937 iii/ probably caught in leafs. May cause internal split so we add new
938 index address, which is easily handled if there is space.
939 iv/ same as iii, though split more likely.
941 What about merging index blocks. That just makes addresses disappear, which
942 we handle the slow way.
943 Do we ever re-target index blocks? Would need to be careful about that.
944 Make it look like a split where one block ends up empty as a hole.
946 grow_index_tree (DONE - untested)
947 ib is a leaf inode that is getting full. Copy addresses
948 into 'new', and make 'ib' an index block pointing at new.
950 add_index/walk index (DONE - untested)
952 end of do_incorporate (DONE - untested)
953 new contains the early addresses. Some remain in ib
955 the buffers much be swapped, so ib has the early address.
956 ui needs to be attached to new
957 return 2; - then new uninc needs to be split
960 case 2 - horizontal split
961 case 3 - vertical split
963 Bother - uninc_table is a problem (again).
964 We can currently add at any time with just a spinlock.
965 So when we split a block horizontally,
969 share out children and uninc_table in do_incorporate
970 share out credits in do_incorporate
973 Still need to do incorporate as above but took a break to...
975 Counting allocated blocks now works - stat show right info, hopefully
976 storage is correct too. - DONE
978 next: truncate? orphan thread?
979 Then segment usage and the cleaner.
983 truncate - removing blocks doesn't need to erase them...
984 - nothing forces a cluster_flush promptly!!! We need a timeout
985 or at least we need a flush before truncate_inode_pages...
987 - in lafs_truncate we need to make the block an orphan an pin in
990 21Feb2008 (Research morning)
991 Discard checkpoint thread created on demand in favour of a cleaner
992 thread that runs all the time. It cleans and checkpoints and
996 do segment scan and get a real list of free segments and
1000 - segment usage scanning to count free blocks
1001 - fix up re-reading of erased blocks
1002 - FIX truncate can still block waiting for writeback to complete.
1003 - FIX allocations aren't failing when we run out of free space
1004 - FIX df doesn't agree with du.
1007 Truncate when an index block has addresses in uninc_table.
1008 The summary for the new address has already been performed.
1009 We need to deallocate the new without disturbing the old.
1010 However a simple allocation may not be possible.
1011 I guess we can prune them all to zero, then incorporation
1014 TOFIX: when truncating a recently created file, it is still depth=0 so
1016 We really need to increase the depth to 1 as soon as we dirty
1017 any block, then reset back to 0 if it fits.
1020 We have a file that we have written to, and the data blocks have been
1021 written out and the addresses stuck in uninc_table.
1022 We then truncate the file. Who releases the usage of those blocks?
1023 And who removes them from uninc_table?
1025 OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1026 I really should make sure that inodes are getting freed properly and the
1027 inode map is clean and everything.
1030 Do we reserve segment-usage blocks.
1031 We cannot do it naively as we get infinite recursion.
1032 But we need it to be allowed to dirty the segment block.
1033 But we cannot pin them to this phase as we want to write them out
1035 This still needs more thought. I avoided the recursion by setting SegRef
1036 before getting the ref. But that isn't safe.
1039 The table of cleanable segments is not working out. Each segment appears multiple
1040 times which wastes space and adds confusion.
1041 We really want to be able to lookup by dev/seg and also find the least.
1042 'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1044 We could have a skiplist for dev/segment lookup and do a merge-sort on
1045 a different link when we want to find the best segment.
1046 We then remember the best number found since a sort, and re-sort if the top
1047 is worse than the best.
1049 We keep all this in a fixed size table. Each entry has
1050 seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1051 addr-sort-skip links.
1052 This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1053 Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1054 One page of 16byte entries (256 of them)
1055 2/3 page of 24byte entries, 1/3 of 32byte entries.
1056 Total 2 pages, and 256+113+43 = 412 entries.
1058 But deleting random elements is awkward... but not too awkward. We can delete
1059 lots of entries by marking them as old, then performing a single pass of the skip
1062 We should keep free segments here too, on a separate list.
1065 2 pages of 16byte entries
1069 free list randomly threads through all.
1071 When using from 24 or 32, randomly choose height of 2-5 or 2-9
1072 Two lists run through the skiplist entries. One for cleanable, one for free.
1073 Remember the nth element for some small n (10, but it decreases as we pull
1074 things off the front) and if we add something less than that, we trigger a
1075 mergesort on the next time we want to clean.... maybe.
1077 Remember end of free list and add to there. Maybe merge-sort the free list
1078 by addr occasionally.
1081 When can we clean, when can we free wrt checkpoints?
1082 - we an clean a segment as soon as we have a checkpoint after it.
1083 So we record the youth of the segment holding the (start of the)
1084 checkpoint, and can clean any segment with a lower youth.
1085 - we can free a segment after the checkpoint after itfs usage has reached
1086 zero. So if usage is zero and youth....
1087 We could offset the usage by one (say - for the first cluster header..)
1088 then when we find a segment with usage of '1', we schedule an update to
1089 0 in the next checkpoint...
1090 Have about segments with different sizes - they get different weights.
1091 Need to divide by segment size: usage * youth / size.
1094 - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1095 - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1097 - Lots of 'creates' makes lots of little clusters - need to optimise!
1098 Or it could be deletes as we currently cluster_flush for each
1100 - I think this is fixed
1103 Started looking at the cleaner.
1104 Need to understand how much to clean each checkpoint
1105 Need to track free-space-in-active-sectors while scanning.
1109 - the cluster head is currently limited to one page. This is not good.
1111 - Should the cleaner start before the scan is complete after a checkpoint?
1112 Probably it can, but while the scan is still happening it might be best
1116 try_clean is taking shape and has a few FIXMEs.
1117 need to write async find_block code and get it to watch for
1118 block in a cleaning segment.
1121 - where can padding appear in a cluster? between miniblocks? at
1122 end of device blocks?
1123 - need to track phys block while parsing headers for cleaning.. why?
1124 - determine rules for avoiding block lookup during cleaning
1125 based on youth/snapshot age, and truncate generation.
1126 We need to load the inode from each snapshot
1127 Can we optimise based on snapshot age?
1128 only if we know the block is newer than the snapshot.
1129 So when we relocate blocks (cleaning) they must go in a segment
1130 that is marked as being old. we cannot really guarentee that.
1131 I guess blocks that are marked as 'new' can safely be skipped if
1132 segment is newer than snapshot. This 'age' is not the youth, but
1133 is the cluster_head->seq which is stored in creation_age.
1135 - Store the rootdir for a filesystem in the metadata for the root inode.
1136 Then 'struct snapshot' doesn't need rootdir. It can have a root
1139 Looking at lafs_find_block_async.
1140 Needs async flag to make_iblock.
1141 Check that. Can we block_adopt if there was an error?
1143 setparent has async flag.
1144 lafs_leaf_find has async flag
1145 lafs_wait_block_async
1147 FIXME I wakeup the cleaner every time an IO completes.
1148 Do I really want that? Maybe only when number of async IOs hits
1149 half the recent maximum??
1151 FIXME need to ensure that lafs_pin_dblock flushed committed
1154 FIXME when we incorporate a dirty (non-realloc) address to an index block,
1155 we need to clear B_Realloc on the indexblock.
1157 FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1158 giving it any credits. Where should they come from?
1160 We don't seem to scan for free/cleanable segments often enough.
1162 FIXME we shouldn't start a checkpoint while cleaning is happening.
1164 FIXME need to be careful when cleaning about finding inodes that
1165 don't exist any more.
1167 FIXME give credits to realloc blocks.
1169 FIXME think about/document transitions between realloc and dirty,
1170 and what locking is needed.
1173 Allowing for the FIXMEs above, the cleaner is now identifying
1174 blocks that need to be cleaned and marking them B_Realloc (I think).
1175 We now need to gather these into a write cluster and write them.
1176 They will all be on the clean_leafs list, so we can iterate that
1177 allocating or incorporating as needed. This will be similar to
1179 Important question is: when?
1180 Ideally we would have some auto-flush mechanism. The cleaner just
1181 keeps finding blocks to clean and when we start running out of
1182 resources we flush the cleaning queue.
1183 However we will still want to flush the cleaner always before a
1184 checkpoint, so for now we cna implement that bit and wait for a
1185 need for the other to arise.
1188 FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1189 don't record that location the same way.. how to handle?
1190 Should check that 'adopt' doesn't do the wrong thing with this block.
1193 Realloc blocks need to be pinned. That makes sense. Only that way
1194 will they get onto the clean_leafs list.
1195 When checkpointing we should probably examine clean_leafs to be
1201 Both of these hold a Credit.
1202 Both can be set at the same time.
1203 Cleaner ignores Dirty and sets Realloc anytime the block is in
1204 the wrong segment. It also Pins the block.
1205 When the cleaner is flushing to the cleaning segment, it
1206 ignores Dirty blocks. They get their Realloc cleared, but
1207 the remain pinned. So they will get moved at the next checkpoint.
1208 How do we know whether an indexblock should be Dirty or Realloc?
1209 The Dirty/Realloc bit is cleared before we get to incorporation.
1210 Maybe we lafs_dirty_iblock the parent of any block we write
1211 out. Then after incorporation, we set Realloc if it is not
1215 I think I'm pinning cleaner blocks now.
1216 Need to make sure the dirty ones are dropped. DONE
1217 Need to make sure the usage is transferred
1218 Need to get free segments back into use
1219 Need some more 'dump' options. Maybe youth/usage files.
1221 Need to make sure scan etc are triggered often enough.
1223 FIXME lafs_prealloc walks up ->parent without locking
1224 I think we want i_mapping->private_lock like lafs_pin_iblock.
1227 1/ a 'dump' option that triggers a scan and prints everything out.
1228 2/ scan must mark freeable as such, then subsequently free them.
1229 3/ Look at code that decreases usage of old segments.
1230 4/ Review lafs_cluster_wait_all and decide exactly how long we need
1232 5/ Review 'FIXME that is gross' HZ/10 thing.
1233 6/ Review 'wait for checkpoint to flush' msleep(500);
1234 Maybe remove that altogether.
1236 FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1237 FIXME BUG in lafs_allocated_block fired.
1238 from lafs_erase_dblock from invalidate_page from .. vmtruncate
1242 An inode data block is dirty and pinned, but the inoidx is no longer
1243 pinned. Presumably it isn't dirty.
1244 Recheck what 'dirty' means on the two blocks and see how this can happen.
1247 Tree gets very big! Lots of 'Realloc' blocks that should
1250 WE are spinning in cleaner again, and not in try_clean.
1252 Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1253 In general it shouldn't be. The flush_cleaner process will remove
1254 the Realloc bits so the blocks fall off clean_leafs. They then either
1255 go onto phase_leafs or get unpinned.
1256 But I currently have a problem with InoIdx/data.
1257 The Pin is transferred to the Data block, but it doesn't go from the
1258 InoIdx block because it has a pincnt. Now that is probably a bug, but
1259 what if it weren't? What if, while we were cleaning, a block got dirtied.
1260 That would pin the whole tree.
1261 I guess the rule about not allocating an inodedata block while the
1262 InoIdx is pinned needs to be revised. If the inodedata block is
1263 Realloc (and not Dirty) while the InoIdx is not Realloc, we
1264 can go ahead (in a cleaning segment).
1267 adir/big1 is garbage.... big1 was removed, so why is it even there?
1269 echo tre > dump # still too much stuff.
1273 Put cond_sched in checkpoint loops!
1276 Thoughts about cleaning and pinning.
1278 When cleaning we need to know how many dependant blocks are being cleaned
1279 so that we know when *this* block can be written - i.e. when the could hits 0.
1280 We cannot use the pincnt for this phase because there may be dependant blocks
1281 which are dirty. They, and therefore this, may get flushed at next checkpoint,
1282 but they may not. If we could be certain they would, we could just write
1283 to the clean-segment blocks which can become unpinned. However if there
1284 is an index block being cleaned, and no dependant is being cleaned, but some
1285 are dirty but not pinned, then the checkpoint can go past without the block
1286 being moved.... but maybe we can detect that.
1289 We set B_Realloc precisely on blocks found in segments being cleaned.
1290 We pin these blocks and leafs which are Realloc go in clean_leafs.
1291 If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1292 That way it gets written at end of checkpoint, but to main cluster.
1293 When we incorporate Realloc blocks into an index block, it gets marked
1294 Realloc. When we incorp dirty blocks, mark dirty. Then see above.
1295 On a checkpoint, we process both phase_leafs and clean_leafs
1298 FIXME do inode reads async better when cleaning...
1300 FIXME if a realloc inode has been allocated to a cluster when we try
1301 to dirty it, confusion can ensue as the writeout won't mark it
1302 clean, but will use up the credits.
1303 Maybe we need something similar to phasewait to not set PinPending...
1304 But normal dirtying doesn't phasewait. I think we just need to
1305 detect this case and wait for the clean-cluster to flush.
1308 FIXME make sure incorporate is doing the right thing with credits.
1310 FIXME lafs_write_inode. We need to be careful about clearing Dirty
1311 when making an update. Need some sort of locking.
1312 Need to review all inode dirty stuff and make sure we do
1313 write thing no matter when it is called.
1315 FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1316 anymore so we don't know how to flag the index block.
1319 UPTO: unlink etc don't prealloc the inode that will be modified.
1320 And a warnon inode.c:579 is very noisy.
1323 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1324 but it doesn't get set until AFTER lafs_reserve_block is called.
1327 Cleaning cleans an InoIdx block which schedules the data block.
1328 Subsequent the InoIdx block gets pinned again.
1329 Now when we go to write the data block, we cannot because InoIdx is pinned
1331 Maybe given that data block is pinned, we write it anyway...
1333 FIXME: when we realloc an block embedded in the inode, don't pluck it out
1334 and put it back in again. Just realloc the inode.
1336 FIXME: when cleaning a directory that has shrunk, we think we have
1337 blocks that don't exist any more. FIXED - we thought '0' was in
1341 FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1342 flush finds no credit. for InoIdx block of 8501
1344 FIXME: do we do SEGREF on all the index blocks? do we need to?
1348 FIXME: seg usage for segment 0/5 isn't dropping to zero.
1349 Part of a file got moved off, but count is still there.
1350 FIXED - seg_move wasn't being called.
1351 FIXME: segusage file has inconsistent extents:
1356 FIXED several bugs in walk_extent
1358 FIXME qphase: any locking between that changing and lafs_seg_move??
1359 I don't think so. Just that seg_apply_all must be called after qphase is set.
1361 FIXME make sure we don't try to clean the current segment!!
1363 FIXME 'Available' goes negative!
1364 Creating large file doesn't instantly reduce 'Used'.
1365 Deleting files plus sync doesn't increase Avail?
1367 FIXME a segment is in the table but doesn't print out!
1369 FIXME we don't cope with running out of free segments (not that we ever should).
1371 FIXME check all Credit usage and make sure credits are returned when
1372 ->parent is dropped.
1373 provide visibilty into credit counts.
1374 Make sure we are keeping enough space for cleaning. We should always
1375 have a few segments unallocatable.
1378 FIXME cannot do io completion in cleaner thread as it can block on
1379 a i_mutex which might be waiting for completion. FIXED (keventd).
1381 FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1382 If we 'know' we have a reference, e.g. a child with a ->parent
1383 link, we can access it without locking.
1385 lafs_make_iblock should return a counted reference.
1387 If we own an (indirect?) reference to iblock, we can access
1388 both iblock and dblock for free... but iblock can change???
1389 If not, we need to get a reference to on or other under a lock.
1391 FIXME block->inode should be a counted reference?
1395 lafs_inode_handle_orphan OK
1396 inode_handle_orphan_loop FIXED
1400 lafs_find_next FIXED
1404 lafs_inode_handle_orphan
1408 FIXME root->iblock should always be refcounted. Is it?
1409 FIXME walking siblings - what lock?
1412 FIXME several times we clean PinPending without refiling, in dir.c in particular.
1413 that looks wrong. FIXED
1415 Maybe lafs_new_inode should return a reference to the dblock
1416 Or pin it. or something. FIXED And pinned (when needed).
1418 FIXME lafs_inode_dblock might return a block without valid data...
1419 Need to get valid data, then load block 0 in find_block rather than
1422 FIXME we really should own a reference to ->dblock before calling
1423 lafs_pin_inode. We don't want IO during a pin request.
1426 FIXME review use of PhysValid FIXED
1428 lafs_orphan_abort - what if lafs_orphan_pin not called?
1429 or if 'b' is NULL. FIXED
1431 Do I Need to clean PinPending when retrying??
1432 Well, we need to be phase-locked when we set PinPending, so
1433 it must be Pinned to the current phase.
1434 So when we unpin a datablock, we must clear PinPending.
1435 FIXED we now clear PinPending in do_checkpoint.
1437 Does phase_wait do the right thing when pinning an inoidx block
1442 Need to understand and document the lifetime of a page with datablocks.
1443 who hold what refcount, and when can it be freed?
1444 Then fix up locking in lafs_refile, __putref.
1446 FIXME how keep what refcount on orphan blocks/inodes??
1447 FIXME should dirty/pinned/etc hold a refcount? they don't.
1451 FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1453 FIXME make sure empty files have depth of 1.
1455 FIXME Truncate proceeds lazily. All data blocks need to be gone
1458 If I call lafs_erase_dblock while a write is underway, we have a problem.
1459 We need to wait potentially for a checkpoint to let go of the block and
1460 a write to complete.
1461 This should be done with waiting for PG_writeback on the page to disappear.
1464 When end_page_writeback is called, we must have dropped all references to the
1466 When we commit to writing a block, we have to set PG_writeback on the page
1467 so that truncate et al can wait for it. Before we have committed, truncate
1468 can just remove the page. Internally we differentiate by B_Alloc.
1469 So before setting B_Allocated we need to test_set_page_writeback(page).
1470 Be careful of races.
1471 I don't think we can ensure all references are dropped. After all, that is
1472 the point of refcounts. So dblock array must exist without page!
1473 But we need to ensure that we don't start a writeout after truncate
1474 has done wait_on_page_writeback.
1475 This is done with the page locked so when we want to write a page
1476 in a checkpoint, we need to lock the page first. Once we have the lock,
1477 we check if the page is still dirty. If it has been truncated it
1479 But how do we safely reference the page if b->page can be cleared?
1481 When we clear PagePrivate, we take a counted reference to the page
1482 for db->page. This is dropped when the page is freed by lafs_refile.
1483 But while it is held, it is still safe for db->page to be dereferenced.
1484 So before we commence writeout we have to lock the page and set
1485 PG_writeback. After locking, we need to test if writeback is still
1488 Maybe not. I think we can submit blocks for writeout without setting the
1489 page to writeback. If we do, then we need to be sure those writes
1490 finish before invalidatepage calls releasepage (block_invalidatepage
1491 calls discard_buffer which calls lock_buffer which waits).
1492 In our case invalidatepage need to make sure that no new write commenses.
1493 Maybe we should lafs_iolock_block before we allocate to a cluster and check
1494 again if the block is dirty.
1497 lafs_cluster_allocate does:
1499 check if still dirty. If not, unlock and return
1502 when write completes, allocate is cleared.
1507 clear Valid,Dirty,Realloc
1512 2008 aug 28 - happy birthday.
1513 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1514 lafs_prealloc complains.
1516 mark_cleaning does too, but cleaning only happens well away from a checkpoint
1518 segsum_find is being called to reference a new segment when we flush a cluster.
1519 segment usage blocks are special. Their index information doesn't
1520 need to be written out in the current checkpoint. We can do that, but
1521 the backstop is to write just the data block in the tail of the
1522 checkpoint and write indexing information later.
1525 unlink is getting "No space left on device". This is when trying to
1526 pin the directoory block, the physaddr is 0, so it looks like we want
1527 NewSpace. But we should even be trying to prealloc in that case becase
1528 there should already be a prealloc on the block. i.e. there should be
1530 Hmmm. after multiple 'syncs' how can the block not be written out.
1531 Maybe it is embedded in the inode?
1532 When we pin a block that was embedded in the inode it isn't clear what to
1533 do. If we might grow the file so it doesn't fit any more, we need to
1534 allocate NewSpace. If we know it won't grow. we use Release.
1535 This still needs a proper fix.
1537 Cleaning seems to be working nicely. However we don't get all the space
1538 back that we should because lots of blocks still have credits that
1539 aren't being returned.
1541 So when should credits be returned?
1542 They are set when a block is pinned. It then gets dirtied which
1543 consumes a credit. Then gets unpinned. I guess if it isn't pinned,
1544 then it doesn't need any credits.
1547 It seems that cluster_flush is not always writing things in the correct
1548 order. Root gets written before some other things below it.
1549 Maybe they are temporarily out of the loop??
1550 No. There are dirty blocks which one checkpoint doesn't pick up, but
1551 they aren't holding the index block pinned. so they lose allocation.
1553 But they must hold the indexblock pinned, even though they aren't pinned
1554 themselves. We maybe do this just with the refcnt... maybe. That will cause
1555 it to phase-flip rather than drop pinning, which I think is right.
1557 So: too many credits remain allocated. Where are they? There are 1464
1558 outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1559 But things removed from the tree have credits removed.
1563 FIXME roll forward ignores inodes. But what about an inode that contains
1564 data. Should that be ignored? I think not.
1565 FIXME delete adir/big2 then delete adir and it cannot release:
1566 Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1567 presumably there is orphan processing or something to complete???
1568 FIXME when files are deleted, the space isn't returned!
1569 This seems to be mostly fixed - need to test.
1570 FIXME when I "rm [b-z]*" it waits for writeback on something???
1571 zfile again!!! OK, I think that is fixed.
1576 seg_apply_all dirties dblocks. When should they be reserved?
1577 The originally get reserved by a lafs_reserve_block call in
1578 segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1579 However: that block might get written before *and* after a checkpoint.
1580 So we need N* Credits. These are usually only used for Index blocks.
1581 We can set these easily enough if inode type is TypeSegmentMap.
1582 We move them across to Credit in seg_apply_all.
1583 But when to we clear them if they aren't needed? I guess
1584 when we drop the last segref. Yes, we already do that.
1585 FIXME need to make sure these get flushed on next checkpoint
1586 if we cannot allocate new credits after a checkpoint.
1588 New Problem. The 'cleanable' table reports a size of 3, but it is empty!
1589 Think that is fixed.
1592 1/ see above: rm x/y; rmdir x -> BUG - FIXED
1593 2/ Spins on 'CURRENT=1' ??
1594 3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1595 4/ When I create/delete a file, ablocks_used increments by one.
1596 The inode hasn't been allocated yet, so it seems the deallocation
1597 isn't adjusting ablocks_used??
1598 5/ open_namei (for dd) got caught on a mutex_lock.
1599 6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1600 I'm not sure where we should and am not thinking very clearly.
1601 Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1602 7/ unlink (at least) can get stuck in iolock_block. Who could be holding
1603 the lock? Writeout that hasn't completed?
1604 Yes. writepage calls lafs_allocated_block without calling flush.
1605 So the block could be sitting waiting for a flush. How long do we
1607 8/ It seems that some datablock can need NCredits. Make sure these
1608 are handled properly re flush-or-refill after checkpoint and
1609 flip_phase rather than unpin.
1610 9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1611 enough, and we lock up (see 7). Need to flush the first block
1612 straight away, and the next one as soon as the first finishes, etc.
1613 Or something like that. Then remove the comment from lafs_writepage.
1617 I seem to be getting only 4 blocks to a cluster at the moment.
1618 This is good as it motivates the code to handle block splitting in
1619 the Btree. But it shouldn't happen.
1622 Block spliting might work - it doesn't crash at least.
1624 After deleting all files, the tree is full of stuff.
1625 Lots of inode data/InoIdx blocks.
1626 Many but not all a Pinned. The others are OnFree
1627 The Pinned ones have outstanding references.
1631 Problem with the block splitting, when adding an index block.
1632 The index block is initially empty - we need to find things by looking
1633 at children. But we don't. We BUG_ON the iphys==0.
1634 In general, when we add a block below and index block and before we incorporate,
1635 the block must be found by finding the first indexed block and looking to
1636 see if there is a 'next' block that contains the address we need.
1639 But if we truncate a file while an index block is pinned and dirty,
1640 we spin on trying to incorporate it, which should make it empty.
1644 sync is trying to get lock in lafs_cluster_flush
1645 pdflush holds the lock and is stuck in cluster_flush_0xa40
1646 some wait_event I expect.
1647 Maybe we need an unplug ??
1649 - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1650 This is in clean_free. We try to update the 'youth' to mark
1651 the segment as free, and we don't have a reservation to do it.
1652 Maybe just reserve it there and then.
1656 When doing a lookup in an index block, we need to check the unincorp
1657 address list. It isn't enough to look for unincorp blocks as they
1658 might have disappeared.
1659 For INDIRECT and EXTENT this is easy enough as full information is in
1661 For INDEX it is a little tricky as we need to look at the full set of
1662 addresses to know where a particular address fits.
1663 We could force and incorporate first, but that has awkward implications
1664 if it requires a split.
1665 Maybe if we get from the lookup "start+range"....
1666 That is not enough as the 'start' might get zeroed by an update.
1669 rm adir/* doen't work as readdir doesn't get all the entries
1671 Reason is that they are being put in the wrong block.
1672 lafs_find_next doesn't correctly find the 'next' block if it
1673 hasn't been incorporated yet.
1675 in index tree -- easy to find
1676 in uninc_table -- not too hard
1677 in only in the ->children list, or attached to a page.
1678 It would be nice to use find_get_pages but that isn't exported so try
1679 something else for now.
1681 Look in index block for 'next
1684 FIXME when we split an index block, we need to hold a reference to
1685 the original so it doesn't disappear until the split-off copy is
1686 written. This is because we search from an index block to find
1688 [ note from Feb09. This should be OK now. Both will need
1689 incorporation, and we now hold on to blocks until they are
1695 - index block. What changes are allowed exactly.
1696 - splitting certainly makes sense.
1697 - merging two adjacent blocks is fine, of which a special case
1698 is finding that a block is empty and so removing it.
1699 - What about a 2->3 split which would require removing a block
1700 and adding another at the same time?
1701 or noticing that the first blocks addressed are all missing, so
1702 moving the index forward?
1703 In each case, searching down by indexes will find a block that
1704 has been replaced by a later address. We could manage that as
1705 long as the new block is attached after the replaced block.
1706 So we cannot move a block. We must delete and replace.
1708 - unincorporated index blocks..
1709 unincorporated data blocks are not pinned in memory. Once they have
1710 been written out, they can be freed. Their address is stored in the
1711 uninc-table. This means we can delay incorporation while many
1712 extents are written out and freed. When we come to incorporated, we
1713 may have many hundred of address in a few extents that can be incorporated
1714 efficiently without holding all that data pinned in memory.
1715 The same scale doesn't apply to index blocks. An index block can
1716 reference only 102 blocks (for 1K block size). And the uninc table can
1717 hold far fewer so we will naturally incorporate more often.
1718 So keeping index/indirect/extent blocks pinned until they are incorporated
1719 is reasonable. And it makes lookup a lot easier, as we have
1720 guarantees about ordering of block in the children list that we
1721 don't have in the uninc table.
1723 Incorporation could have some atomicity issues. There is no
1724 concern about bad stuff appearing on disk as the phase-change
1725 process handles that. In memory it might be awkward if we split
1726 an index block before incorporating a block what would span them.
1727 That could conceivably happen if we only incorporate 8 blocks
1728 (size of uninc table) at a time.
1729 So maybe we should incorporate a full uninc list (not table) at
1731 This means quite different code paths for incorporating leaf
1732 and internal index blocks....
1735 - uninc_table lists are a real problem.
1736 They can only be created during roll-forward so they hardly ever
1738 But if the block is split while processing earlier things on the
1739 list, then splitting an uninc table would be very messy.
1740 Is there any way around this?
1741 Why not just do incorporation during roll-forward?
1742 We only need to incorporate leafs, not internal blocks because we
1743 don't use uninc_table for internal blocks any more.
1744 So during roll forward, all index blocks that are touched need to
1746 I think we live with that. If it every becomes a problem, we will
1747 need to perform the roll-forward twice. The first time collects
1748 the usage information so that we know where we can start writing,
1749 then the second just applies all the changes. to the rest of the
1754 uninc table only used for leaves, and has no linked list
1755 unincorporated index block are stored on a list, which we
1756 sort before applying.
1757 All uninc index blocks are therefore kept in the index tree.
1758 Their order on the children list allows us to find the correct
1759 index. Each block for which the fileaddr is in the parent is
1760 followed by any blocks that have been split off and end after
1761 this one starts. Blocks that have been emptied are Hole and are
1762 skipped over when looking for a block.
1764 When we split an internal block, the remaining uninc blocks
1765 must not start with a Hole.
1767 FIXME: what locking do I need around lafs_incorporate?
1768 i_mutex?? i_alloc_sem??
1769 i_alloc_sem is imposed by truncate (inode_setattr) and
1770 direct_io possibly. So it is really about adding/removing
1771 blocks. Not updating internals.
1772 Maybe our own mutex. Could even be per-index-block !!
1773 Whatever it is, we need to protect walking ->children too.
1777 "rm -r" problem from 12/dec/2008 fixed now.
1778 incorporate code got a make-over and is probably much better.
1780 New problems: After test runs, cannot create files due to no space
1781 on devices!! But directory tree is empty.
1784 free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1786 The problem is that we think 1425 has been allocated to data that
1787 might still need to be written, leaving not enough room for more.
1789 ====================414 credits ==============================
1790 which doesn't explain everything, but does explain a lot. There
1791 really should be nothing in the Index tree (except fs-root and
1794 Some inodes which are OnFree and hold no credits.
1795 0 DATA (1) 52 [0]ESegRef,Claimed,PhysValid
1796 52 1 (0) 0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1798 Some other inodes which are pinned with lots of credits and are
1799 on the phase_leaf list
1800 0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1801 299 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1803 And that is about it. some are not Valid, some are...
1804 checkpoint just wants to 'flip' them.
1805 They mostly have a refcnt of 1... I wonder who is holding that....
1806 The reference of on the dblock is held by the iblock.
1807 But what is the iblock remaining? Who holds that reference?
1809 I restored some code to clean iblock, and now:
1810 free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1811 ====================244 credits ==============================
1812 which saved 130 credits. That helps.
1813 There seem to be many fewer of the many-credits blocks
1814 Lot of index blocks in tree are 'OnFree' and have a
1815 0 refcnt, but haven't been removed. Why?
1816 It seems that the have ->parent == NULL, so lafs_refile never
1817 bothers to remove them. I guess it should...
1818 OK, lots of InoIdx block have gone now with their DATA blocks.
1820 So, remaining blocks are pinned to their phase with lots of Credits,
1821 have not pincnt, mostly have physaddr==0.
1822 It is just the stray refcnt that keeps them there..
1823 inums are 40, 56, 62-73, 275-278, 280
1826 63-69 are directories 2/3/4/5/6/7/8/9
1827 70-73 are looooong symlinks
1829 276 is dfile - same as cfile but truncated.
1830 Then some nbfile-X that were big enough.
1832 So: what do they have in common:
1833 Several only use the in-inode data block, but
1836 Can it be that it is refcounted on the Leaf list, and so
1837 cannot get off?? Yes, I think so!
1838 We only unpin things that have a zero refcount.
1841 checkpoint takes it off the list, then flips the phase and puts it
1842 on the other list with refile. During that time it has a refcount
1843 it doesn't lose the pinning.
1845 1/ Not have it on the list despite being pinned.
1846 2/ Drop the PIN despite the refcnt.
1847 3/ have refile do the phase_flip so it has a chance to
1848 notice the refcount has hit zero.
1850 2 isn't really an option. We need PIN to persist whenver we have
1851 a reference. We could possibly use PinPending for index blocks too,
1852 but that would require a lot of thinking.
1853 1 requires another criterea for being on the list. I suspect that would
1855 3 we used to do I think... But refile is in a big lock, and we
1856 cannot really do a phase_flip under that.. and phase flip calls
1857 refile anyway so we would get recursion.
1858 So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1860 FIXME use kzalloc where appropriate.
1862 FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1866 Only 54 credits in Index Tree now.
1867 Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1868 plus '74', which seems to be schedules for deletion - root has uninc_table.
1869 ... and 'sync' got rid of that and left 44 credits.
1870 Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1876 These seem to be the files that used data-in-the-inode
1877 They still have a refcnt of 1 (or 2 for adir).
1878 ... OK, that's gone now. I fould a refcount leak.
1880 So now: 42 Credits in Index Dump. No stray files.
1882 df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1883 So we still seem to have 1085 blocks allocated. 42 are accounted
1884 for, so 1043 still missing... either we lost the count, or lost the tree.
1886 create a finy file, remove, and sync, now
1887 df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1889 so I lost 15, b ut now 48 are in tree. Lets try again...
1890 df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1893 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1895 Definitely losing more thant the difference in the tree.
1897 Try creating empty files...
1898 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1899 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1900 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1901 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1902 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1903 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1906 very strong pattern there.
1907 What about 2 files at a time.
1908 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1910 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1911 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1912 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1914 Slightly different pattern - not as bad.
1916 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1918 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1919 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1921 Strange, isn't it....
1923 Making sure we clear UnincCredit... result looks worse.
1926 I fixed up the credit accounting 'incorporate' and then fixed a couple
1927 more little bugs. And now:
1931 ====================48 credits ==============================
1932 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1934 So we still have 720 allocated credits that aren't accounted for.
1935 But we are nicely under 100...
1940 ====================76 credits ==============================
1941 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1943 That is different. The count of missing blocks is way down,
1944 but there is some extra cruft in the index tree.
1946 0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1947 0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1949 0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1950 330 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1951 Time for a commit though....
1954 ====================46 credits ==============================
1955 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1957 so the strays in The index tree are gone. but still have 159 outstanding
1960 ====================36 credits ==============================
1961 df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
1964 That is a little weird...
1966 ====================48 credits ==============================
1967 df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
1970 ====================34 credits ==============================
1971 df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
1973 It seems that the unaccounted blocks are (or can be) created by
1974 writing to a file then removing the file without a sync.
1975 ..but why is cb (cblocks_used) so high?
1979 Got onto a bit of a tangent...
1980 What happens if we truncate a block while it is on a list to
1981 be cleaned? Clearly we want to cleaner to drop it ASAP.
1982 But what if invalidate_page wants to drop it *now*
1983 Hopefully it is either still on clean_leafs and we can remove it,
1984 or it is now iolocked and we can wait for it. So should be OK.
1986 I keep getting caught in "looping on..."
1987 We are truncating an inode and some index block which is now empty
1988 is not getting removed from the tree because there is an outstanding
1989 reference.... 327/0 depth=1. I guess I turn on the tracing.
1991 ... and it seems that it is in the process of checkpointing.
1992 I guess I need to lock against that ... maybe with the iolock.
1995 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
1996 ------------[ cut here ]------------
1997 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
2000 Every time I create/delete a file, I get an extra 'ab' which disappears
2003 decremented when +ve summary_update on non-index
2004 increased on lafs_summary_allocate... should not be done for index blocks.
2006 OK: after test run, filesystem is empty, but cblocks_used is around 360.
2008 is loaded at mount time
2009 collects pblocks_used on a phase flip
2010 is updated in lafs_summary_update (unless pblocks is)
2011 So we must be missing a lafs_summary_update when phys->0
2015 truncating big (multi-level index) seems to be bad
2016 Leaves 'pb-338 !!! and cb+689, even after sync.
2017 still 'looping on' occasionally
2018 Haven't found cblocks_used leak yet.
2019 Occasionally non-B_Valid blocks are actted on.
2020 I think I need to improve io locking.
2024 Need some improvements to iolock locking.
2025 We use this lock to wait for a block to be written out (if that is happening)
2026 before we allow lafs_invalidate_page to complete.n
2027 It is also use in lafs_erase_{d,i}block (Similar purpose)
2028 We take the lock in lafs_cluster_allocate, and then make sure the block is
2031 Also lock in lafs_new_inode as initing the inode is a form of IO ??
2032 load_block takes the lock
2033 We only clear_bit(B_Valid, ) under this lock.
2035 So the issue is this:
2036 A block that is going to be written is passed to lafs_cluster_allocate.
2037 This happens either after taking it of a _leafs list, or when
2038 lafs_writepage requests the write.
2040 lafs_invalidate_page needs to be able to release the page, so there needs to
2041 be no transient references. In particular, once the block has been
2042 removed from a _leafs list it must already be iolocked.
2043 Invalidate_page can then either remove from that list and erase the block,
2044 or use io_lock_block to wait for the IO to complete.
2045 So when a datablock comes out of get_flushable it must be iolocked, and must
2046 remain iolocked until after Dirty and Alloc are clear
2047 Index blocks belong entirely to the fs, so we can be more relaxed with them.
2048 If get_flushable finds the block already iolocked, it is either being invalidated
2049 or already has IO pending, so it can be dropped.
2054 FIXME When we sync a small file, we just write out the inode.
2055 rollforward currently ignores data in inodes I think.
2056 Thanks needs to be fixed to ensure this data is safe.
2058 - stop iblock from disappearing so much.
2061 While cleaning a file, I truncate it. This makes it appear
2062 to fit in the inode but it is very big and we get confused.
2063 We cannot allocate block 0 until all the others have been
2064 allocated to 0 and forgotten.
2065 But what if we truncate a file to 10 bytes, then fsync?
2066 We need to write the data promptly, but we like doing truncate
2068 When we extend a file we already need to wait for truncation
2069 to complete (FIXME do we do that?) We could wait on fsync too.
2070 We cannot just delay block0 as it might be part of a checkpoint
2071 that has to complete promptly while truncation can take a long time.
2072 i.e. we have a very large file. We update the first byte, then
2073 truncate to 2 bytes.... we don't need to write until fsync which will wait...
2074 Directory?? delete lots of entries so it shrinks to one block?
2075 There is no delayed truncate there.
2076 ?? Never clean an I_Trunc file.
2077 If we try to allocate a file with other indexes:
2079 if Dirty and Pinned, just do normal alloc
2080 if Dirty and not pinned, skip.
2083 Sometimes I run out of credits while truncating a file.
2084 I need credits - maybe only briefly - to dirty the index blocks.
2087 An indexblock remains pinned while the refcount is non-zero.
2088 A pinned index block can be on a _leaf lru
2089 The _leaf lru holds a refcount.
2090 This is an awkward referential loop.
2091 We break it at checkpoint time with special code in phase-flip.
2092 But there are other awkward times such as truncate.
2094 We cannot use PinPending like we do with data blocks because there
2095 could be multiple pending Pins (from different children).
2097 We could possibly treat checkpoint_lock like pinpending, but that
2100 We could not count the _leaf lru, but that might just make the race
2103 I think we want to explicitly drop the pin when we truncate a block.
2104 Normally, once we Pin an index block is will become dirty so we don't
2105 want to de-pin before a checkpoint anyway...
2107 Just to clarify: an index block gets dePinned:
2108 - during checkpoint on a phase_flip if it is no longer dirty etc
2109 - on truncation when we erase it
2110 - during pre-emptive write-out which is a bit like an early phase_flip
2111 not sure that we implement that one yet.
2115 - checkpoint calls incorporate call erase_iblock calls iolock_block
2116 - rm calls orphan_pin calls phase_wait
2117 The problem is in lafs_incorporate. It expects the block to be iolocked,
2118 but can call erase_iblock which try to get an iolock itself...
2119 ...fixed that and it still happens.
2120 checkpoint calls phase_flip calls allocated_block (on uninc list) calls
2121 iolock_block before calling incorporate
2122 Maybe all of these should assume an IO lock.
2124 FIXME truncate assume truncate-to-zero. We need proper ftruncate support.
2128 - sort out individual patches and review DONE
2129 - allow compilation without refcount tracking DONE
2130 - don't hold a 'leaf' reference. NO
2131 - clean up *ref calls - differentiate those that can be called when zero DONE
2132 - use enum for B_* DONE
2133 - support truncate to non-zero offset DONE
2134 - "looping on" found an 'OnFree' block!
2135 - clean out lot of debugging
2138 rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
2139 checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
2140 Not cool. i_mutex must not be taken by checkpoint
2141 Fixed that, though it is a bit of a hack....
2143 New deadlock: checkpoint calls phase_flip which calls allocate_block,
2144 to move the uninc_next across, and that tries to iolock the parent to
2145 perform a partial incorporation. But that seems to be iolocked.
2146 Generally that is ugly as ->uninc_next might be very long and require
2147 multiple splits, and direct-driving that from phase_flip is bad.
2148 I should just move the list across
2152 Spent too long trying to remove refcount help by *_leaf lists.
2153 This leaves InoIdx block with zero refcount so Data block can get
2154 lost and bad things happen.
2155 I might be able to fix it up, but it is probably better to try the
2156 checkpoint_lock approach if I can only remember what that is.
2169 ->lru when on freelist
2177 ->children / ->parent within an inode
2185 segsummary counters (in blocks)
2189 ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
2190 Pinned consistent with lru
2191 ->checkpointing / ->phase_locked
2193 ->uninc and ->chain ?? Should use parent->B_IOLock ??
2194 uninc_table - should use B_IOLock
2195 free list / clean list segtrack
2200 wc[0] .. something in prepare_checkpoint
2218 Initialising new inode
2220 IOLock across a page
2223 --------------------
2224 This is a list from 18 months ago, with updates
2226 - Understand how superblock 'version' should be used.
2228 - Review and fix up all locking/refcounts. See locking.doc
2229 Also lock inode when copying in block 0 and probably
2230 when calling lafs_inode_fillblock (??)
2231 - lafs_incorporate must take a copy of the table under a lock so
2232 more allocations can come in at any time.
2234 - We don't want _allocated to block during cluster flush. So have
2235 a no-block version and queue blocks on ->uninc if we cannot
2236 allocate quickly. Find some way to process those ->uninc blocks.
2238 - Use above for phase_flip so that we don't need to _allocated there.
2240 - Utilise WritePhase bit, to be cleared when write completes.
2241 In particular, find when to wait for Alloc to be cleared if
2242 WritePhase doesn't match Phase.
2243 - when about to perform an incorporation.
2244 - make sure we don't re-cluster_allocate until old-phase address has
2245 be recorded for incorporation.
2247 - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
2249 - Can inode data block be on leafs while index isn't, what happens if we
2250 try to write it out...
2252 - If InoIdx doesn't exist, then write_inode must write the data block.
2254 - document and review all guards against dirtying a block from a previous phase
2255 that is not yet safe on storage.
2256 See lafs_dirty_dblock.
2257 - check for proper handling of error conditions
2258 b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
2259 - review checkpoint loop.
2260 Should anything be explicit, or will refile do whatever is needed?
2262 What should checkpoint_unlock_wait wait for?
2263 When do we need to wait for blocks the change state. And how?
2265 - load/dirty block0 before dirtying any other block in depth=0 file
2267 - use kmem_cache for 'struct datablock'
2268 - indexblock allocation.
2270 allocate the 'data' buffer late for InoIdx block.
2271 trigger flushing when space is tight
2272 Understand exactly when make_iblock should be called, and make it so.
2273 - use a mempool for skippoints in cluster.c
2274 - Review seg addressing code in cluster.c and make sure comments are good.
2275 - consider ranges of holes in pending_addr.
2277 - review correct placement of state block given issues with stripes.
2279 - review segment usage /youth handling and make a todo list.
2280 a/ Understand ref counting on segments and get it right.
2281 - Choose when to use VerifyNull and when to use VerifyNext2.
2282 - implement non-logged files
2283 - Store accesstime in separate (non-logged) file.
2285 make sure files are released on unmount.
2288 Support 'peer' lists and peer_find. etc
2289 - subordinate filesystems:
2290 a/ ss[]->rootdir needs to be an array or list.
2291 b/ lafs_iget_fs need to understand these.
2294 how they can fail / how to abort
2297 - need to clean up checkpoint thread cleanly - be sure it has fully exited.
2298 - review roll-forward
2299 - make sure files with nlink=0 are handled well.
2300 - sanity check various values before trusting clusters.
2302 - Configure index block hash_table at run time base on memory size??
2304 Review everything that needs to handle laying out at cluster
2305 aligned for striping.
2307 - consider how to handle IO errors in detail, and implement it.
2308 - consider how to handle data corruption in indexing and directories and
2309 other metadata and guard against problems (lot of -EIO I suspect).
2311 - check all uninc_table accesses are locked if needed.
2313 - If a datablock is memory mapped writeable, then when we write it out,
2314 we need to with fill up it's credits again, or unmap it.
2315 - Need to handle orphans asynchonously.
2318 - implement 'write_super' ??
2320 - pin_all_children has horrible gotos - remove them.
2322 - perform consistency check on all metadata blocks read from disk
2323 e.g. don't assume index blocks are type 1 or 2.
2326 + looking at cleanup for unmount.
2327 - various more refcounts fixed up
2328 - B_SegRef is never dropped! and we take a ref on a segment when
2329 we start a cluster on it, but never drop that reference.
2330 THIS is next thing - review all setting and clearing of B_SegRef.
2333 - SegRef and lafs_reserve_block...
2334 There is room for recursion here, I need to be careful.
2335 To dirty a data block, all parent index blocks must be Pinned and must
2336 be able to be written. That means their segusage blocks must be
2337 available for update. And Pinning a segusage block for update requires
2338 all its parents. So the segment for the block, the indexes, and the
2339 segusage and indexes and so-on must all be pinned.
2340 When we pin a block, we do it from the root down to avoid recursion.
2341 We probably wany whatever reserve_block calls, to return an unreserved
2342 block rather than call reserve_block itself.
2344 When do we clear SegRef?? We set it when Pinning, so I guess we
2345 clear it when unpinning.
2346 pin_dblock, mark_cleaning, prepare_write, truncate
2348 We it is really when Pinning, or Dirtying or Reallocing.
2349 So we clear when unpinning, or when a dblock gets written...
2350 Maybe just when we lose ->parent
2353 - sometimes sugsum counter goes zero for random data block
2354 Something is going wrong in roll-forward. The block looks transiently valid
2355 so doesn't get read, but has no good data in it.
2356 - After deleting a directory, the block might still have incorporation
2357 to happen, but is not marked dirty
2358 - at unmount, there are various blocks that are still dirty.
2359 - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
2362 - that rollforward problem above:
2363 When rolling the checkpoint, if we find segusage blocks we want to include
2364 them directly into file. But by pinning the block we might preread a
2365 segusage block.. but we must be sure not to update it.
2366 So during the early stages of rollforward while still in the checkpoint,
2367 seg_inc must be called with in_phase == 0.
2368 so seg_move is called with phase != qphase.
2369 ditto for summary update.
2370 So the block must be pinned to the previous phase...
2371 Normally 'phase' changes at checkpoint-start,
2372 qphase changes at checkpoint-end
2373 So we probably want to start with qphase being 0 and phase being 1.
2374 When we reach the end of the checkpoint, we flip qphase to 1.
2376 - blocks still in phase_leafs at unmount:
2377 After we force a final checkpoint we still have Pinned:
2379 ino==8 InoIdx due to Dirty block0
2380 ino=16 InoIdx due to dirty block0
2382 inode block 1, inode usage map
2387 inode blocks dirty but not pinned? No InoIdx...
2388 Segusage dirty - probably by seg_apply_all - disable that at umount
2389 orphan dirty ??... but not pinned!
2390 This is possible - we don't pin for clearing entries, just for setting.
2391 The inode problem stems from the datablock being dirty while the
2392 InoIdx block isn't. That is, at best, confusing.
2395 segusage blocks aren't being pinned
2396 They need to be pinned whenever dirty.
2397 and youth blocks aren't even made dirty some times. They need to be
2398 pre-pinned in many cases.
2400 So: segusage gets changed when we write out a cluster, and when we
2401 delete/relocate blocks.
2402 In the first case we pin the block when it becomes part of the free list,
2403 and need to keep it pinned across checkpoint changes.
2404 In the second, we pin when the block is dirtied and again must keep it pinned.
2405 Youth gets changed when a segment becomes free and again when we allocate
2408 Keeping a datablock pinned across checkpoints is awkward - we currently need
2409 to repin for each dirty... I guess we can re-pin for each checkpoint
2410 in lafs_seg_apply_all. That might work for segusage, but not for youth!
2411 If segsnum for ssnum==0 held a reference to the youth block, that might
2412 help. Segstat on 'clean' or 'free' would imply a reference to that segsum.
2414 Is it OK to keep all youth/usage blocks for free/clean blocks
2415 pinned? We can currently have 810 entries. Only half will be clean/free.
2416 For each entry there can be two blocks, youth and usage. So that could be
2417 810 blocks. 1Meg? Normally much less. If it became a problem we could
2418 reduce the number dynamically I guess.
2420 maybe segusage blocks need to get phase_flipped, as other blocks do
2421 depend on them, pin_all_children wouldn't be able to find them though..
2423 1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
2427 I think I want to link dirty block to the space in free segments that we
2428 actually know about. Each of those segments has youth and usage blocks
2429 pinned (at least parent pointer is active). So we have everything we need
2430 to write everything that is dirty. So 'free' or 'clean' implies
2431 a segsum reference which holds youth block.
2433 When we get low on space, we wait for cleaning/finding to progress.
2434 This would limit us to 400 segments, say 16Meg each, so 6Gig of dirty
2435 memory. I guess that we need to scale the 'free' list based on available
2438 When cleaning needs a segment, it needs to load the usage blocks for other
2441 When cleaning in the presence of snapshot we need to be careful never to
2442 duplicate a block that is shared. To allow for v.many snapshots, we don't
2443 even want to duplicate in memory.
2444 So we need to choose a 'primary' copy - probably first one found - and
2445 follow the peers link when possible...
2450 So clean and free segments in the list carry a SegRef. But it could be
2451 excessive if all of them did - we shouldn't be required to pin more
2453 So for segments with a usage of 0, we use the score to record if a
2454 segref is held. 0 means 'no', 1 means 'yes'.
2455 When space_alloc wants more space we need to find an entry and
2456 segref it. Maybe we want free lists - reffed and not-reffed.
2458 Then again, SegRefs are fairly cheap as they are heavily shared.
2459 maybe 512 to a block. If we hold 400 refs they could easily all be
2460 in one block. We could possibly encourage this by sorting the list
2461 and discarding from one end if it is too full.
2462 Sorting is a good idea definitely. It keeps youth/usage updates
2465 Just check the numbers.
2466 a 1TB device with 1K blocks might have 32M segments of which there
2467 would be 32768. 512 per block means 64 blocks or 16 pages (64K).
2468 So total segusage files is 128K plus snapshots. Not worth worrying
2470 For 16TB, that is 2Meg plus snapshots.
2473 - keep a SegRef for all free and clean blocks.
2474 This must include a youthblk reference.
2475 - sort the free list when 'clean' is merged or when a pass
2479 merge as many as fit into free
2482 How is the code flow...
2483 add_cleanable is called during the periodic scan. It could hold
2485 add_cleanable calls add_clean as does lafs_get_cleanable during
2486 clean. That might block getting a segref, might even
2488 add_free is also called by seg_scan
2490 So seg_scan should get a segref and leave it with everything!
2493 A SegRef implies a 'struct segsum' for each segment. We don't
2494 want to allocated one of these for every segment in the table.
2495 We only want a reference to the youth and segusage block, which
2498 But these blocks need to be Pinned and SegReffed etc so we can
2499 write them at any time.
2502 The refcount held by the 'leaf' lru is a problem.
2503 While it holds a count we do not unpin an index block, so it cannot
2504 be removed from the list.
2505 Thus we can only remove from the leaf lru on a phase change.....
2506 Or when doing lru based flushing... Maybe we can remove from the
2507 lru while holding the checkpoint lock.
2508 This happens when truncating..
2510 No, that is just too messy as it is too easy to get put back on the list.
2512 Maybe the leaf lru should not imply a reference count ... or maybe
2513 we need to split the refcount: 'inuse' and 'active'.....
2514 How about we test refcnt against list_empty(->lru)...
2518 During truncate, we need each index block to get unpinned so they can
2520 But the InoIdx block is held pinned by by the inode block being dirty.
2521 In this particular case, the InoIdx block is Invalid as the file is empty.
2522 But.... InoIdx should always be valid until after Inode is destroyed??
2526 I need to stop the cleaner and flush everything before trying to
2529 This is awkward though.
2530 The 'sync' of umount is done by kill_block_super, but I call
2531 that rather late, after checking that the tree is empty.
2532 There are pinned/dirty bits left after sync that we want to magically
2535 - segusage/youth blocks. Maybe if we don't seg_apply_all...
2536 - orphan block. Maybe don't mark it dirty when we remove things?
2537 - inode map?? why is that dirty
2539 - root directory is dirty still?? But it has been erased.
2540 InoIdx is valid-but-empty. Inode Data is dirty
2541 Data block 0 is Dirty at block 0.
2544 Ahh... need to mark page dirty when block is marked dirty !!
2546 The seg usage blocks are now flushed out but not incorporated.
2547 I feel that might be correct - we don't want to care about
2548 incorporation as we will never use it.
2549 For this, segusage and quota are very special cases.
2551 Inode map is no longer dirty, but is pinned
2552 Orphan does have a dirty block still
2553 The orphan table contains the root directory.
2554 root is now clean and gone
2556 Segusage doesn't get incorporated after last checkpoint now
2558 But now we have a circular reference for SegRef. This should not
2559 be surprising given the circular problems we had setting SegRef.
2560 I guess we just erase the references in the segsum table...
2563 Hurray!!! I can unmount without crashing!
2564 Now I need to sort through all the fixes required to achieve that
2565 and make discrete patches, and be sure it is all OK.
2567 DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
2568 DONE - (block.c) Mark page dirty when block becomes dirty
2569 DONE - (checkpoint.c) print orphan_slot with Orphan flag
2570 DONE - Don't incorporate segcount etc after final checkpoint
2571 DONE - Don't apply seg changes after final checkpoint.
2572 DONE - Don't start opportunistic checkpoint after final.
2573 DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
2574 DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
2575 DONE - (cluster) be more flexible about credit usage when flushing InoIdx
2576 DONE - (dir) do add_orphan when we abort as well as on success
2577 DONE - use inode_dec_link_count, not i_nlink--
2578 DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
2579 DONE - change %d/%d to strblk
2580 DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
2581 DONE - (index) refile: when unpinning, remove from lru
2582 - lafs_refile: ->iblock can be non-null for inode 0.
2583 DONE - Make sure I_Deleting gets cleared when deleting finished.
2584 DONE - phase_flip should have something separate to call, not lafs_allocated_block
2585 - inode.c: lafs_dirty_inode: getref_lock used to get dblock
2586 NONO - ?? getref_locked allowed if PagePrivate
2587 DONE - segment: lafs_seg_put_all needed at unmount
2588 DONE - segdelete_all: need to put intable references
2589 DONE - lafs_free_get: put the intable references
2590 DONE - lafs_get_cleanable: put the intable references
2591 DONE - fix sort splitting in add_cleanable
2592 DONE - add lafs_empty_segment_table for unmount
2593 DONE - lafs_release: flush all dirty blocks
2594 DONE - lafs_release: force a final checkpoint
2595 DONE - lafs_release: move kill_block_super before final check
2596 DONE - lafs_put_super: release orphans and segsum files.
2597 DONE - lafs_destroy_inode: putref should be 'iblock'
2598 - lafs_destroy_inode: allow for iblock to be present but no ref held....
2599 DONE - can roll forward call lafs_allocated_block without dirty???
2602 - I've re-arranged lafs_release so that the flush is all done in
2603 generic_shutdown_super. However it calls invalidate_inodes, and that has
2604 problems with pinned inodes. So we need for fsync_super to checkpoint
2605 out all inodes that we don't hold our own reference to.
2606 If we do hold a reference, then invalidate_inodes will skip them,
2607 and ->put_super can be used to drop the references and perform the final
2609 fsync_super calls ->sync_fs. after syncing call files. Maybe I can
2610 do some sort of checkpoint there...
2611 There almost is a checkpoint in there.... But only when called without
2613 I need to understand 's_dirt'.
2614 This is controlled entirely by the filesystem, common code only examines it.
2616 file_fsync (the generic 'fsync' method) will call ->write_super
2617 fsync_super will call write_super
2618 generic_shutdown_super will call write_super
2619 sync_supers will call write_super
2620 sync_filesystems(0) will call ->sync_fs
2622 twice from 'sync', once with '0', once with '1' for 'wait'.
2623 (though in emergency_sync, both are '0').
2624 once from unmount and remount with 'wait' set to '1'.
2625 We don't want two checkpoints for a 'sync', but we want to start
2627 Maybe if we get called with '0', we set a flag and treat the '1'
2628 differently.. There is no locking to make this really safe, but
2629 it will probably be OK... I could take a process_id, but then
2630 parallel 'sync's could race.
2631 write_super is called before the syncs. So it could start the checkpoint,
2632 and sync could wait for it.
2633 write_super is called multiple times at shutdown, We really need
2634 to utilise sb_dirt to avoid some of these.
2635 We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
2636 - when we pin a dblock or dirty a this-phase iblock.
2639 at unmount, we iput the root inode which de-references the dblock
2640 before clearing ->iblock, which fails an assertion ... why?
2641 Apart from the shinker, ->iblock is only set to NULL in refile
2642 when we find an I_Destroyed inode... I guess the root block isn't
2643 getting Destroyed...
2644 The protocol for freeing iblocks is bad. Should be:
2645 - it only gets freed by the shrinker
2646 - when inode dies, set ->inode to NULL
2647 - when InoIdx iblock dies, set ->iblock to NULL
2650 So, what exactly is the protocol?
2651 - index blocks live either in the parent/sibling tree, or
2652 on the inode's free_index list
2653 - when refcnt is 0, they live on 'freelist.lru'. When refcount
2654 is elevated they stay on lru until they need to be
2655 added to some other lru (leafs or cluster)
2656 - when shrinker finds block on freelist.lru with non-zero refcnt,
2657 it just removes from lru
2658 - when shrinker finds free block, it removes from free_index and discards
2659 the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
2660 I think not as such would either have children or be on an lru
2661 - When we destroy an inode, all index blocks get disconnected from the
2662 inode and freed. This must include the ->iblock
2663 - When an index block becomes free due to index tree shrinkage,
2664 we set the ->depth to -1 so that it cannot be found by mistake,
2665 and leave it for shrinker or inode destruction.
2667 Confused about inode<->dblock dependence.
2668 We don't want the inode to refcnt the dblock as that wastes space.
2669 We don't want the dblock to refcnt the inode as that stops it from being freed.
2670 So each must disconnect from other when freed.
2672 inode takes private_lock, then checks dblock
2673 dblock cannot take private_lock before checking ->my_inode..
2674 Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
2678 Tracking down the 'credit' count and making sure it stays correct.
2679 It seems that I have a Dirty InoIdx block which is not pinned.
2680 Due to this it has no refcount and so the data block disappears so
2681 the InoIdx block is not visible in the tree. This isn't a definite bug
2682 but it means I cannot count credits properly.
2683 And surely Dirty index blocks must always be pinned!!??
2685 When as small file is flushed to the inode we were dirtying the
2686 iblock. That seems wrong - should dirty the dblock? Need to
2689 I got a hang in 'rm adir/4'.
2690 rm is in lafs_cluster_update_commit_both
2692 cleaner is in lafs_do_checkpoint+0xe4
2693 pdflush is in writepage/lafs_cluster_flush waiting on a lock
2694 so I guess cleaner is holding a mutex and waiting for something
2698 Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
2699 cleaner is at some point, holding a mutex to stop 'sh'.
2702 ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
2704 So when something locks the checkpoint and needs to flush, we have problems....
2707 I seem to have fixed the above. Now:
2708 Free space is a real problem. When I remount after the successful unmount,
2709 we find a usage pattern like:
2710 CLEANABLE: 0/0 y=10 u=34179
2711 CLEANABLE: 0/1 y=0 u=65144
2712 CLEANABLE: 0/2 y=0 u=65535
2713 CLEANABLE: 0/3 y=32773 u=32910
2714 CLEANABLE: 0/4 y=32772 u=149
2715 CLEANABLE: 0/5 y=0 u=0
2716 CLEANABLE: 0/6 y=32770 u=16529
2717 CLEANABLE: 0/7 y=32769 u=35084
2718 CLEANABLE: 0/8 y=32768 u=31877
2720 Which is ridiculous.
2721 Better fix up what I have first...
2724 In rm /mnt/1/nbfile* we hang..
2725 rm is in lafs_phase_Wait from pin_dblock in unlink
2726 wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
2728 cleaner is in lafs_iolock_block from add_block_address in phase_flip
2729 iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
2731 So cleaner is probably deadlocking against itself via iolock_block.
2733 - in lafs_invalidate_page just to wait for any io - it isn't held long
2734 - in lafs_erase_dblock while we erase and 'allocated_block'
2735 - in lafs_get_flushable to protect blocks being checkpointed
2736 - in lafs_writepage to call cluster_allocate (which releases), both for
2737 data block or for inode when data was flushed there.
2738 - lafs_add_block_address to process pending incorporations to make room.
2739 This is what is trapping the cleaner.
2740 - lafs_inode_handle_orphan when truncate finishes to erase_iblock
2741 - lafs_inode_handle_orphan again to incorporate all removal
2742 - and again to erase_iblock
2743 - and for partial truncate to incorporate some removals
2745 - lafs_new_inode to keep it from being cleaned while being created
2746 - roll_block to add addresses
2747 - lafs_load_block during IO
2749 So: who holds it?.... let's use the code to find out...
2750 And the answer is : lafs_get_flushable.
2751 So get_flushable iolocks the block then calls phase_flip which tries to
2752 incorporate other-phase children which try to iolock the block. Deadlock.
2753 Do we need to hold iolock during phase_flip ??. Not for all of it..
2756 FIXME When erasing a block, do I need an uninc credit? I usually don't
2757 have one and the need certainly isn't as great...
2759 Now... let's try to get free space accounting right.
2761 - unlink sometimes failed with ENOSPC
2762 - usage scan shows segmetns with enormous usage - 23039!!
2764 no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
2765 no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
2767 no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
2770 after umount/remount df says "4608 7 1544" but cannot
2772 df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
2773 ============= Cleanable table (7) =================
2774 pos: dev/seg usage score
2790 --------------- Free table (1) ---------------
2792 --------------- Clean table (0) ---------------
2793 CLEANABLE: 0/0 y=10 u=1
2794 CLEANABLE: 0/1 y=32775 u=3
2795 CLEANABLE: 0/2 y=32774 u=2
2796 CLEANABLE: 0/3 y=32773 u=1
2797 CLEANABLE: 0/4 y=0 u=0
2798 CLEANABLE: 0/5 y=32771 u=1
2799 CLEANABLE: 0/6 y=32770 u=6
2800 CLEANABLE: 0/7 y=32769 u=2
2801 CLEANABLE: 0/8 y=32768 u=3
2806 FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
2807 Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
2808 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
2809 4/ invalidate_page find Realloc set, even after iolock ..
2810 This is during umount in generic_shutdown/lafs_put_super/iput
2815 If we flag a block for Realloc then Dirty before it is allocated,
2817 But if we have already allocated to a cleaning cluster... what happens?
2818 We need to treat this like it was dirties after being written, so
2819 it gets written to a regular cluster as well.
2820 As we only have one uninc bit for both Dirty and Realloc, we need
2821 to *not* incorporate the Realloc update if the block is still dirty.
2823 - block gets chosen for cleaning and allocated to a clean-cluster
2824 - block gets marked dirty. This must not clear Realloc
2825 - cluster is flushed, block is dirty, so don't call lafs_allocated_block
2826 - Return the Realloc credit, but keep dirty and Uninc.
2827 Is there a race if Dirty is set after we enter lafs_allocated_block?
2828 As long as the index block gets marked Dirty, not Realloc we might
2829 be safe... though it gets awkward if the Dirty writeout falls in to
2830 the next phase. But reserve_block will have provided NCredits for that.
2832 1/ don't clear Realloc when setting Dirty
2833 2/ do clear Realloc if cleaner finds the block is Dirty
2834 3/ avoid calling lafs_allocate_block when cleaning a dirty block.
2835 This is an optimisation.
2837 Almost... A B_Realloc block no longer has B_Credit so B_Dirty cannot be
2842 When cleaning blocks we hold no reference to the inode and it can disappear.
2843 We don't want to hold the inode active, but need a reference much like
2844 the truncate code has.
2845 I think we need a subordinate refcount for both cleaning and truncate.
2846 These hold inode present but not active.
2847 Maybe every block->inode should be counted like this.
2848 And this might simplify the my_inode->dblock inter-relationship.
2850 We need to ensure that if a new iget is called on an inode that still
2851 exists, we don't allocate a new one but just reuse the old.
2852 But that won't work as we cannot add an inode back into the hash table.
2853 So I think when cleaning a block we need to ref the inode.
2854 i.e. B_Realloc implies an i_grab
2857 So I have a problem with the cleaner wanting to hold and inode that
2858 the VFS is destroying.
2859 I don't want the cleaner to hold i_count as that delays truncate etc.
2860 So we need a second counter subordinate to i_count.
2861 This is held by the cleaner and by delayed truncate, and by i_count.
2862 Possibly ->my_inode holds this, which means it can be a single bit...
2864 When a lookup wants an inode, we need to load the inode data block and
2865 see if it has my_inode. If it does, we insert that inode in to the
2866 hash table. If not we fall back to regular inode creation....
2868 On reflection, that is too complicated and hard and error prone.
2869 When relocating a file we need the data so it had best be in the page
2870 cache so the filesystem really needs to know that the inode is still
2872 So cleaning needs to keep a reference to the inode.
2873 The cost of this is that if an inode is being deleted while it is
2874 being cleaned the truncate cannot happen until the cleaning
2875 completes. This means that space usage will be wrong.
2876 When nlink becomes zero we can drop the cleaner reference. When
2877 the inode is dropped/destroyed we can tie the cleaning in with the
2878 delayed truncate so that the final destruction doesn't happen until
2879 the cleaner has let go.
2881 So: how to track that the cleaner has a reference to the inode?
2882 Maybe every B_Realloc block owns a ref on the inode.... but dropping
2883 those references when i_nlink hits zero would be difficult.
2884 They could hold a secondary refcount which, if non-zero, implies a
2888 - Set B_Cleaning when we look at a block for cleaning, and clear
2889 it when we find Realloc clear and ....????
2890 - Whenever a block has B_Cleaning set, it holds a counted reference
2891 on LAFSI(b->inode)->cleaner_ref
2892 - When cleaner_ref is non-zero and I_Deleting is not set, we hold
2893 a reference on the inode (i_grab).
2894 - when i_nlink hits zero, set I_Deleting and drop any reference
2895 held by the cleaner.
2896 DONE - cleaner must be careful not to process any block that has been
2897 truncated, or file that is dead.
2898 DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
2899 - What about filesystem inode... how do they fit in??
2902 Question. When are the index blocks for an inode flushed?
2903 We need to have them gone when the inode disappears.
2904 For deleted inodes, this happens in background truncate.
2905 For memory-pressure inodes it will hopefully happen well in advance,
2906 but we need to make sure in destroy_inode that everything is
2910 Thinking again about B_Cleaning, any B_Realloc block will hold a
2911 reference through to InoIdx and so dblock will be present and the
2912 inode won't be freed. So we only need an extra reference during
2913 the first little phase of cleaning when we are collecting blocks.
2914 After that a reference can be useful as it will delay flushing so it
2915 can be more efficient...
2917 Maybe this is all much simpler than I thought.
2918 If we hold a ref on the inode whenever the InoIdx block is Pinned
2919 and i_nlink is non-zero, then we won't be forgotten until all
2920 index blocks are written. We may still be deleted, but as that
2921 is one-way we can hold on to the inode at little cost.
2923 getting/putting that ref at exactly those times turns out to be
2925 It might be best to have a flag to say "We hold an extra ref".
2926 Then we occasionally call a function that validates the setting.
2927 It is most important to drop the count at the right time, so
2928 after unlink/rmdir/rename and when B_Pinned is dropped.
2931 set_phase which is called from:
2932 lafs_cluster_allocated when moving 'pin' across to data block
2933 so don't need checkpin
2935 only need check_pin if dropping spinlock
2937 only pins data blocks (Index are already pinned if relevant).
2939 where "inoidx block pinning" doesn't change
2942 do_incorporate_internal
2944 So only need check in lafs_pin_block_ph and maybe pin_all_children...
2947 - credits get out of sync from
2948 lafs_incorporate->refile->space_return from checkpoint.
2949 counter is one more than we can find.
2951 i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
2952 Note it in an Index but not InoIdx. The parent is still in the tree.
2956 delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
2959 - BUG credits<0 in space_return from lafs_incorporate from add_block_address
2961 Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
2962 from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2963 msg: (1,3,1)(1,1,-1)
2965 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2967 This is a predicted but not handled problem.
2968 The answer is that not all blocks need ICredit/UnincCredit.
2969 The purpose of this credit is to allow for a split in the parent.
2970 pre-existing index blocks can never split the parent themselves
2971 If an index block becomes full, it will split and this might split
2973 If an index block has free space, then it will only over flow if it
2974 gets multiple child updates and this will provide multiple credits.
2975 So an index block with space for 3 or more new addresses does not need
2976 and ICredit/UnincCredit. So when we split we don't need to provide an
2979 When we have a fully InoIdx block and a single new child with 1 UnincCredit,
2980 each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
2981 an ICredit, then create a new intermediate such that
2982 InoIdx is Dirty and has an ICredit
2983 New Index is Dirty with no ICredit - it used the UnincCredit
2984 New child looses its UnincCredit
2985 When another block in the new index arrives, it's unincredit is used to
2988 When a leaf block cannot fit a single address it will have ICredit.
2989 The block is split so that each has 3 spaces and so do not need ICredit,
2990 but as soon as ICredit is available, they take it.
2992 Worst case is that every ancestor is full and the leaf is split
2993 We then get two full branches, each block half empty so not needing ICredit.
2997 free data being used in lafs_refile from cleaner.
2998 b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
2999 Answer: lafs_refile was derefering ->inode when it wasn't safe.
3000 Need to at least have a parent before it is safe.
3003 soft lockup cleaner->lafs_iget->ifind_fast ....
3004 Then (may be caused)
3005 Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
3006 .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
3007 Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
3008 ------------[ cut here ]------------
3009 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
3011 It seems the cleaner gets confused and goes spinning.
3015 After the run, we have -14 used and 2055 available (of 4608), and
3016 cannot create anything.
3017 4 segments ar free, one is cleanable.
3018 free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
3020 free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
3022 df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
3023 free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
3024 and very little free
3026 ablocks_used is going negative - why?
3027 Probably we erase a dblock without clearing Prealloc.
3028 Then when Prealloc later gets cleared, ablocks_used is
3029 wrongly decremented.... no...
3032 10aug2009 (don't forget above problems)
3034 read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
3035 getiref_lock triggers BUG.
3036 This is presumably because I have just fixed it to get the correct
3037 iblock and not the iblock of the filesystem.
3039 FIXME I hacked around this but I'm not sure the result is right.
3040 The question is about when the InoIdx should be dirty and when
3041 the inode data block should be dirty.
3042 In this particular case we are writing a page of a small file.
3043 cluster_allocate calls flush_data_to_inode which tried to dirty
3044 the inode dblock but finds that iblock is not pinned...
3045 When we dirty a data page we aren't pinning the parent!
3046 That might be OK - we only need to count and reserve the parent.
3047 We don't need to pin it until it becomes dirty.
3049 Still need to resolve when which block gets to be dirty, and also
3050 exactly when an index block needs to be pinned. And how does that
3051 related to holding a ref on the inode when the inoidx is pinned.
3052 Maybe it should be when the inoidx is referenced.
3056 Another problem. unlink->handle_orphans->erase_dblock->allocated_block
3057 and get a zero from lafs_add_block_address but parent is not pinned.
3058 And... One unmount, orphan file still has pinned blocks so the inode
3060 And ... root still old phase after lots of 'rm' then sync.
3061 Inode 244 has pinned inode block held by writepage0 and writepage
3065 - lots of bugs introduced by change to marking inode blocks dirty:
3066 writepage/cluster_allocate wants to Dirty inode data block with no credits.
3067 because I put credit in iblock!
3069 - ohhh.... The phase contour is broken. When a block is added to a
3070 cluster for allocation it isn't in the phaseleafs any more, but prevents
3071 it's parent from joining. So we cannot assume that if dblock is on
3072 list then iblock or a child will be too.
3073 So when we find dblock we do need to remove it.... done that.
3075 - root not changing because Data 1/0 is Pinned and IOPending
3076 and held by writepage!!
3077 Problem is that IOPending blocks aren't put back on lru.
3078 But that should only be blocks on the cluster list.....
3079 But that is where I am putting it.
3080 Maybe I need exclusion between checkpointing and any other
3081 code that writes to checkpoint so checkpoint can wait
3082 for that ... can we use wc->lock?? That doesn't lock
3083 against cleaner, but that isn't a problem...
3084 But now 0/228 is still pinned and in writepage and IOPending
3085 So there is more to it than that.
3086 When checkpoint finds an IOLocked block, it might be about to
3087 join a cluster, in which case we don't really want to wait, or it
3088 might be undergoing incorporation in which case we want to wait.
3089 or it could be being erased, so wait..
3090 Maybe I wait until it appears on some list.... yes.
3093 At unmount Index 8/0 with child and leaf is still pinned
3094 This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3098 A problem is that something goes wrong in the erase process.
3099 We find new children after we erase the inoidx block!
3101 This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
3103 When/how do we erase indexblock and particularly inoidx blocks?
3104 Does and inValid InoIdx simply mean there is no indexing and does not
3105 reflect on the Data block?
3107 .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
3112 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3113 This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3114 [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3115 [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
3116 [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
3117 [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3122 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3123 This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3124 [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3125 [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
3126 badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
3129 erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
3130 erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
3131 ------------[ cut here ]------------
3132 WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
3133 unlink/orphan/erase_dblock_allocated_block
3134 ---[ end trace 61b8bd59512ea4da ]---
3135 zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
3136 [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3137 [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3138 ------------[ cut here ]------------
3139 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
3141 BINGO. When we remove last entry from directory we erase the InoIdx block,
3142 then when we add entries, we hit problems.
3150 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3152 This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3153 [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3154 [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
3156 This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3157 [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3158 [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
3160 This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3161 [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3162 [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
3164 We have stray 'cleaning' references.
3166 on a data block that was in a to-clean segment
3167 at which point we igrab the inode
3168 the block is put on the ->cleaning list.
3170 when we get an error finding the block
3171 when we find that it isn't in the segment
3172 when an error occurs loading the block-to-be-relocated
3173 and when we mark that block for cleaning.
3174 i.e. always unless we got EAGAIN or some space error.
3175 If we still hold some blocks, try_clean returns 0.
3177 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3178 This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3179 [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3180 [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
3181 [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
3182 [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3184 NOTE these inode data blocks are not pinned and so did not get written!!
3186 FIXME I should wait for the checkpoint to finish
3190 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3191 This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3192 [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
3193 [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
3196 When I clean and find an inode that is already deleted, I need to be
3197 very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
3198 to be. lafs_delete_inode gets called a lot, but mostly for dead inodes.
3201 FIXED orphans don't get cleaned up. It seems a 'create' fails and leaves
3202 and orphan block un-released.
3203 - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
3204 - Not sure that we handle complete truncation, then adding blocks properly.
3205 - what should the state of the InoIdx block be?
3206 - On remount, the filesystem contains rubbish.
3207 - create fails even when there should be free space.
3208 - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
3209 - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
3210 and 74 has similar issue
3211 327 = adir/big1 74=adir
3215 Segusage blocks aren't always Pinned when we make them dirty.
3216 Yes. That is correct. They are not forced out by phase change but by
3217 lafs_seg_flush_all at the end of a checkpoint. So they need to be
3218 preallocated, but not Pinned.
3219 But, once we have finished the last checkpoint we don't want to
3220 dirty Segusage blocks any more.. I wonder if we are.
3221 No, but we were Pinning inodes without PinPending and they
3222 lost the pinning straight away!
3224 OK, other annoyance.
3225 InoIdx block and similar are getting erased at the wrong
3227 We can only safely erase them when they have no children.
3228 I guess what we really want is the incorporation leaves them
3229 existing but empty, and when we go to write them out, if they
3230 are empty we register an address of 0.
3231 When we drop the ->parent pointer of an Index block it
3234 When incorporate or truncate produces and empty index block
3235 it simply clears B_Valid.
3236 When incorporate want to add to an index block, we set B_Valid
3237 When cluster_allocate gets a non-Valid index block it call
3238 block_allocated with phys of 0.
3240 Yes, that seems to work. Mostly
3243 On remount, check_credits dies: 16/20-0
3244 In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
3247 OK, this index block clearing is a mess. There must be a neat model I can
3248 follow that will make it "just work".
3249 The key seems to be children. If an index block has children, then it
3250 really must exist. If it has no children and no content, then it can
3251 be discarded, in which case it needs to be unlinked from its sibling list.
3252 What locking do we use here? Probably IOLock on the parent index block.
3253 So we need iolock while looking in a parent for children, and we take
3254 IOLock while incorporating or pruning.
3255 Once the empty index block has dropped out it will never be found again.
3256 When we incorporate the zero address, the index block becomes invisible
3257 unless it is shortly after it's predecessor in the sibling list. But
3258 that is hard to ensure, especially if the first child is the one that
3259 is being erased. So if an index block is erased, then it must be
3260 discarded quickly and any children need to be relocated...
3261 Or maybe not.... maybe if there are children, we just write and empty block?
3264 We need better locking of the index information.
3265 It seems best to use IOLock as that is already held during incorporation.
3266 So any code that accesses or updates and index block must hold IOLock.
3267 This might be a bit of a restriction if we try to do a lookup while
3268 writeout is happening.... Maybe we need a separate writeback flag for that.
3269 But I think it is good to use IOLock for now.
3270 Places we need this are:
3271 flush_data_to_inode needs to lock the InoIdx block
3273 lafs_leaf_find as it recurses down. This should return a locked leaf.
3275 callers of clear_index
3276 erase_dblock for depth=0??
3278 incorporate should lock new blocks for consistency
3281 Locking dependency rule is that if we hold a lock, we are allowed to
3282 lock a child index block, but not a parent. IF we hold a data block,
3283 we are allowed to lock the an index block.
3286 The read/write completion seems all wrong. It unlocks if the page was locked,
3287 and that isn't really safe, because it might not have been locked for read..
3288 We need to flag block0 to say if lock or writeback need to be cleared.
3289 Given that, I don't need IOPending any more:
3290 Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
3291 Write: We queue all writes, then set 'do_clear_writeback', then check.
3293 Now... can we use a writeback flag to avoid waiting to read while writeout
3294 is happening? We would need:
3295 set writeback in cluster_allocate
3296 wait_writeback after some lock_block
3297 clear_writeback when writeout finishes.
3298 Extra checks where we already check for IOLock
3302 Lots of progress but....
3303 cluster_flush calls cluster_done calls refile call iput call
3304 drop_inode call write_inode_now calls writepage calls cluster_flush
3305 and we get a locking loop.
3306 I think we need the run that cluster_done from a different thread.
3309 We seem to have a refcnt problem with segsum.
3312 Lots more progress but.....
3314 orphan_release is finding that the orphan block has no credits.
3315 We can allocate credits and simply not do the update if they
3316 are not available: having an extra entry in the orphan file isn't
3317 a problem. However we need some mechanism to clean up other than
3318 waiting for a remount..
3319 I think we leave that until we redo orphan handling.
3321 and: adir sometimes loses one block so it and the contents don't get
3324 and: it seems we sometimes try to clean the segment being written
3325 to. We must avoid that.
3328 FIXME When pin fails, we need to remove PinPending from everything!!!
3329 and never followed up ... I wonder?
3334 Every orphan block goes on a per-fs list and gets removed only
3335 if the B_Orphan bit is clear.
3336 There are two times when we want to expedite orphan handling.
3337 1/ on rmdir we need to know if the directory is really empty.
3338 This requires that we expedite the orphan handling of all
3339 blocks. As soon as we find a non-orphan, we can give up.
3340 Then we need to make sure the index tree has collapsed. WE
3341 can borrow that code from truncate.
3343 2/ When writing past Trunc_next. We just pass the block to
3344 special orphan handling.
3346 This requires that orphan handling is re-entrant.
3347 For dir, that is protected by i_mutex, but rmdir needs to come
3349 For trunc, the iolock on the index blocks should be enough.
3350 I wonder if IOLock can be used on dir as well... allowing
3351 parallel orphan handling in the one dir even!!.
3353 We need to ensure exclusion of orphan handling, including:
3354 - only one orphan handler at a time
3355 - don't run orphan handler while still processing action
3356 that makes it an orphan.
3357 Maybe if we just use IOLock for that? Does that work? Maybe
3358 but it gets messy for directories (on first attempt anyway).
3359 For directories we can just use i_mutex.
3360 Maybe i_mutex for files as well?
3363 Orphan handling is going well... but not perfect.
3364 I'm using IOLock to ensure exclusion for orphan handling.
3366 I'm not really implementing that on directories
3367 Inodes go bad because lafs_erase_dblock needs the lock too.
3368 The call from rmdir will always faile because we hold i_mutex.
3370 Bigger problem. I'm IOLocking inodes across checkpoints to preserve
3371 Orphan status. But that might stop the checkpoint proceeding.
3372 .. so use i_mutex, not IOLock - find.
3374 Now... it seems I've confused myself. Orphans don't get handled
3375 immediately. In particular, inodes should not be handled until
3376 they final delete_inode. So setting the B_Orphan flag and putting
3377 on the list are two separate events. The flag must come first,
3378 but the list may come much later. So some of that mucking around
3379 with i_mutex is pointless.
3381 make_orphan makes sure it is in orphan file, sets bit, and removes
3382 from list (if present).
3383 add_orphan puts it on the list for handling.
3385 For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
3386 as does any unlink/rmdir/rename that fails.
3388 For directories: put it on list in commit/abort.
3392 I hit the BUG where find_leaf wants and address of 0.
3393 If an index block gets cleaned out it doesn't disappear
3394 immediately.. there is no leaf to find in that direction.
3395 We probably need to avoid non-Valid blocks or something...
3397 Orphans 0/299 to 0/329 and 0/280 are still on the list
3398 but are not orphans.
3399 Maybe I need to catch mutex_unlock to run the orphans??
3401 We underflow a segment through orphans are unmount.
3402 We are cleaning and truncating at the same time.
3403 The same block gets allocated to 0 and to 1225
3404 in quick succession.
3405 Problem is that we apply new address while in writeback
3406 so a new lafs_allocated_block
3410 Review of inodes in orphan list:
3411 lafs_new_inode makes are orphan for a non-existant inode.
3412 If the inode cannot be created, orphan_release is called.
3413 If it can, a 'struct inode' is filled in with valid type
3414 and nlink==1 (!!) and attached. The inode will only be
3415 detached when the refcnt hits 0, and the orphan list implies
3416 a refcount, so if we ever find something on the orphan list
3417 with a NULL my_inode, it must be very new and can be ignored.
3419 When we find an inode block with a my_inode there are a few options:
3420 if I_Trunc is set, we must progress truncation providing we can
3422 else if I_Deleting we must delete the inode
3423 else if nlink is 0, we remove from the list
3424 else nlink > 0 and we must remove orphan status.
3425 This means that if nlink is elevated, we need to be holding the mutex...
3426 So don't elevate nlink any more...
3428 When nlink becomes non-zero the block need to be put back on the
3429 orphan list (it must already be an orphan). Also when we set
3430 I_Deleting or I_Trunc it must go on the list.
3431 .. OK, I think I have all of that.
3435 I have some wierdness that seems to be caused by the orphan stuff,
3436 probably due to it all being async now.
3437 - A deleted inode clears I_Trunc and then sets it again. The only
3438 explanation seem to be that delete_inode is being called again,
3439 so I must be igrabing it again, maybe from cleaning.
3440 - bits of directories aren't getting deleted. Sometimes single
3441 blocks, though the referred files are deleted. Sometimes
3442 the whole directory... More interestingly, those blocks then
3443 don't get cleaned, so something about them means that they
3444 don't get deleted and don't get cleaned either.
3446 Even weird... I just had a case where file 331 had a different
3447 index block for every 4 data blocks...
3451 - What stops pinned blocks from being flushed by bdflush in middle
3452 of operation and so losing allocation? Must make sure to set
3453 them dirty very late.
3454 - orphan_release can fail, so much make sure we can always call
3455 it, even if my_inode is NULL.... but how?
3458 - make_orphan could fail due to lack of space, which is not OK.
3459 I made it loop, but I'm not 100% sure that is right... it isn't.
3460 I need to pass down the 'I'm freeing space' flag, and I need to
3461 not require Credit of Dirty is set, etc.
3464 - I seem to have a deadlock and unmount.
3465 umount is waiting for lafs_checkpoint_lock_wait in
3467 pdflush is in down_read in sync_supers
3468 lafs_cleaner is iget_locked/ifind_fast/inode_wait
3469 This is waiting for I_LOCK to be clear.
3473 - When a file shrinks and becomes level-0, make sure
3474 old addresses get deallocated. I seem to have
3475 a directory where they didn't.
3477 - Due to the fact that we over-preallocate, we really shouldn't
3478 return ENOSPC until we have flushed dirty data and performed
3482 - When I removed the last index from an inode
3483 (Indirect type) it seems that I didn't write
3484 out the corrected block..??
3487 I ran my simple test run repeatedly overnight.
3488 It ran 208 times before I stopped it.
3489 There are 3 possible failure modes:
3490 1/ didn't completed within 500 seconds
3492 3/ appeared to complete, the number of blocks
3493 in use was not the correct '7'.
3495 74 (35%) did not fail!
3496 31 () did not complete
3497 40 () triggered a BUG
3498 2 did not complete but did not trigger a bug
3500 94 of those that failed did not have a BUG
3501 92 actually completed. Of these:
3513 1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
3514 1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
3515 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
3516 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
3517 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
3518 2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
3519 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3520 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
3521 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
3522 6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
3523 7 BUG: unable to handle kernel paging request at 6b6b6bfb
3524 11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
3527 super.c:655 is "block is still pinned" at unmount time.
3528 The block was always an InoIdx with a child.
3529 Either inode 0 or 16.
3530 child is held by various things:
3531 [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
3532 [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
3533 [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
3534 [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
3535 [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3536 [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3537 [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3538 [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
3539 [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
3540 [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
3541 [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
3543 The "unable to handle kernel paging request" is always in
3545 invalidate_inode_buffers(26/46)/lock_acquire
3549 This is iblock valid when erasing a block
3550 The block we are erasing is always 0/327 or 0/328. It is
3551 an orphan we are handling, iolocked but not always pinned
3554 Map an iblock which is not IOLocked
3555 always in lafs_clear_index for the InoIdx block for a directory
3556 which is in Writeback.
3557 Call is in lafs_allocated_block from cluster_flush.
3560 seg_inc reduces seg usage below 0
3561 - lots of blocks (inode 327) that were cleaned, where then erased twice.
3562 - 2 block (inode 328) were erased twice, both from prune
3566 The free list is empty.... odd as only first segment is currently
3570 Still orphan: 0/328 Index(1) is in Writeback and Dirty
3571 again inode_handle_orphan2 is in Writeback
3574 inode_handle_orphan are end, child list is not empty.
3575 The children seem to be in Realloc - cleaner need to let go.
3578 my_inode is null while cluster_flush an inode and want to set
3583 no ICredit for unincredit in dirty_dblock from dir_delete_commit
3587 spinlock lockup in subsequent to real bug
3588 ditto for sleeping function.
3590 Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
3591 appear to have other strange values....
3593 A select '9' has two extra block for the directory '74'.
3594 But that directory is long gone.
3595 These dir blocks are currently fully populated with numbers.
3596 This seems to be the pattern with all non-7 blocks.
3600 Found a problem, possibly related to the dir blocks not being
3602 When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
3603 so that fact is never copied in to the datablock.
3604 On further exploration, the I_Dirty bit is set but never used, which
3606 So: exactly when do we copy inode into datablock, and what do we do
3607 when dirty_inode is call (if anything).
3608 We could just set I_Dirty when dirty_inode is called, checking that
3609 the block is Pinned which it usually will be.
3610 Then we copy inode to data just before writing data block.
3611 However that defeats transactional properties. We to copy in the
3612 same transaction, and that means either straight away, or when
3613 the data block's phase changes.
3614 So dirty_inode either copies to the block, or sets I_Dirty.
3615 When lafs_refile unpins an inode data block, it need to check
3616 I_Dirty and possibly re-dirty it.
3618 To redirty it we must steal the NCredits. Any further dirty attempt
3619 will have to allocate more.
3620 The stealing is done automatically by dirty_dblock, so we just flip
3621 the phase and call dirty_inode ... making sure it doesn't try to
3624 Need to review when inodes get dirtied.
3625 - commit_write only sets I_Dirty !
3627 We call lafs_dirty_inode:
3628 dir_create_commit - a child of inode is PinPending
3630 lafs_link - before dir_create_commit
3631 lafs_unlink, lafs_rmdir - data block is pinned
3632 lafs_symlink - before create_commit
3633 lafs_mkdir - before create_commit, or block pinned
3634 lafs_mknod - before create_commit
3635 lafs_rename - (moved to) before create_commit/update_commit
3636 or data block is pinned
3637 lafs_dir_handle_orphan - (assured that) child is pinned.
3638 choose_free_inum - child is pinned
3639 lafs_incorporate - block is pinned
3641 So either the data block is pinned, or the index block is pinned.
3642 In either case it is OK to set something to Dirty.
3644 (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
3645 this is called from:
3646 inode_inc_link_count
3647 inode_dec_link_count
3648 ..various quota ops...
3650 __set_page_dirty (Which we don't use)
3652 other quota stuff we won't use
3657 only the time updates are interesting. Others we have locking
3659 file_update_time is called from generic_file_aio_write_nlock etc
3660 before ->prepare_write/->commit_write. So they can pick up the
3662 Similarly before set_page_dirty is called.
3663 touch_atime is called from do_follow_link and readlink and
3664 file_accessed which is called all over the place.
3667 If block is pinned, then dirty it to ensure writeout.
3668 If not, don't. But copy data in any case.
3673 OK, I've decided that I don't like clearing B_Valid when an index
3674 block contains no indexes. The final straw was that I seemed
3675 to need to initialise the index block when I didn't hold IOLock.
3676 That was probably fixable, but I'm sure more problems were coming.
3678 So: what to do instead?
3679 One issue that must be resolved is that an index block can still
3680 have valid children even when it become empty.
3681 This can happen if we erase blocks from a file, then add them back
3682 after a checkpoint, and so in the next phase.
3683 The checkpoint writeout could need to show an empty index block,
3684 but the next phase will see real addresses.
3685 We cannot easily avoid this, so we must handle it.
3686 This interact badly with the index lookup algorithm that finds
3687 the best index block currently in the parent, and then scans
3688 the children. If there is no index block in the parent, we
3689 cannot find any children.
3690 This could be handled by responding to an empty index block by
3691 scanning all children. But that isn't a full solution as if
3692 just one index block got erased, it's unincorporated siblings
3693 would still be lost.
3694 We could treat empty index blocks like orphans. i.e. don't
3695 discard them immediately but leave them with possibly real
3696 addresses. Then when they have no children we allocate the
3698 But we still need to ensure that index blocks off which siblings
3699 have been split but not yet incorporated remain present in the
3700 tree to mark the place for their siblings.
3701 There is another problem. A horizontal split could leave the
3702 new block with no addresses and everything in the uninc list.
3703 Nothing can be found in there.
3705 So maybe we need to revise the lookup mechanism.
3706 The goal is to find an index block that starts at or before
3707 the target and contains an address at or after the target.
3708 Then out search can stop.
3712 I thought about this more over the weekend and think I have an answer.
3713 We need to treat internal and leaf index blocks somewhat differently.
3715 An internal index block must never be empty (while unlocked).
3716 Any child block which has not had it's address incorporated must be
3717 attached (simply in the sibling list) to a block which has been
3718 incorporated. This will be the block that it was split off.
3719 The uninc block needs to hold a reference so that the primary isn't
3721 When a 'primary' becomes empty it cannot be discarded, so the
3722 addresses in the first dependent index block must be copied
3723 across. This is awkward for indirect blocks so they might be
3724 allowed to be empty (they aren't internal so don't violate the
3726 When a horizontal split break a sequence of dependent blocks
3727 between two parents, the second parent must be incorporated
3728 immediately so that the first block in the second half of the
3729 sequence is incorporated.
3730 If an internal index block does become empty and it has no
3731 dependent blocks to fill from, it must be invalidated immediately.
3732 It cannot have any children - even in next phase - as at least one
3733 would have to be incorporated and so the block would not be empty.
3734 Invaliding involves allocating to address 0.
3735 If index lookup finds a block with PhysValid address of 0, it
3736 must look to the previous index block. If there was none .... it
3739 Leaf index blocks can become empty, but we try to avoid it.
3740 If a leaf has blocks which have been created in the next phase,
3741 and others which have been deleted in this phase, it can be empty
3742 but still have children. In this case we just treat it as a real
3743 index block that doesn't actually have any addresses. We still
3744 write it out even though that is a waste of space.
3746 We have been working on the assumption that every address always
3747 has a corresponding leaf index block. It is the leaf with the
3748 highest index at or below the target address.
3749 However this requires the every internal index block has a child
3750 with the same address as the parent.
3751 Preserving this requirement when the first child of an internal
3752 become empty requires either:
3753 - loading the 'next' child and reassigning this to the start
3754 - changing the address of the parent to match the first child.
3755 The former requires possibly reading a block from storage.
3756 The latter only involves modifying blocks that are due to be
3757 written out anyway, but makes block look up slightly interesting.
3758 When lookup finds an invalid block that is 'first', it needs to
3759 start again from the top.
3760 When incorporation creates an invalid block that is first, it
3761 needs to walk down from the top and any index block at the same
3762 address needs to be relocated/rehashed. If the block is
3763 incorporated, the incorporated address needs to be updated.
3765 - flag for unincorporated index blocks which implies a reference
3767 - after split, immediately incorporate second block
3768 - change lookup to retry when finding invalid block
3769 - When internal block becomes empty, either merge with
3770 first dependent or invalidate. If first in parent,
3771 update address and parent and recurse.
3772 Need some 'clever' locking here.
3773 Before unlocking the invalidated block, we take i_alloc_sem,
3774 then walk up the ->parent tree locking blocks as
3776 The index lookup, when it finds an invalid block will take
3777 i_alloc_sem, then drop it, then start again.
3778 Or maybe some other lock than i_alloc_sem...
3779 - When leaf becomes empty, invalidate only if it has no children.
3780 When internal leaf becomes unpinned, check if empty.
3783 That locking doesn't look like it will work, and we can never 'merge
3784 with first dependant' as it is not valid to have a index block
3785 where the first child is at a different address.
3786 And we cannot always change the parent address, particularly if it
3787 is zero - increasing it then cannot work.
3788 And there is no need to load a block if we are just going to change
3789 its start address (not internal index blocks anyway).
3790 Let's drop the idea of relocating the parent.
3791 If an internal index block becomes empty:
3792 If it is last in parent, no loss, just discard
3793 If parent would be empty, need to recurse up.
3794 If it is not last relocate the next sibling to this location,
3795 rehashing it and updating the parent.
3796 If a leaf index block becomes empty we cannot just delegate to
3797 next as it might be indirect... not a problem if address is
3798 stored. But that requires a format change... now might be a
3803 If we hold an index block locked and it becomes empty and we choose
3804 to invalidate it, we need to ensure that doing so does not
3805 break any indexing paths.
3806 So we take a separate lock (i_alloc_sem??) and flag the block as invalid
3807 by setting physaddr to 0 while PhysValid is set, and unlock the block.
3808 Any lookup that finds such a block must take and release i_alloc_sem,
3809 and then restart from the top.
3810 - If the block was not incorporated, we just remove from sibling list
3811 and all is done - the space in implicitly included in
3813 - If the block has a different fileaddr than the parent then update
3814 the parent directly, either removing the entry, or changing it to
3815 point to the first unincorporated sibling (if there is one).
3816 This requires taking the lock on the parent of course. That is
3817 why we dropped the lock on the child.
3819 - If the block has the same address as the parent we need to find
3820 a 'next block' to relocate to the start of the parent.
3821 It is either the first unincorporated sibling, or the next
3822 block in the index block, or nothing, meaning the parent is
3823 about to become empty.
3824 We lock the parent (still holding i_alloc_sem), and rehash the
3825 chosen child. If it doesn't exist, or is not dirty, we need
3826 to update the phys address directly in the
3827 accordingly, erasing or replacing the first address.
3828 Then we need to rehash the index block, but we need to lock
3829 the parent for that.
3830 So set a 'busy' flag on the block, unlock it, lock parent,
3831 rehash, clear busy flag, and repeat.
3832 - We can never relocate a block with fileaddr of zero, as the
3833 InoIdx block cannot be relocated. So leaf index block 0
3834 must never be erased unless the file is empty. So
3838 We store the start address of an indirect block in the block.
3839 These means that the meaning of any index block is completely
3840 independent of the location of the block, so we can change the location
3841 easily and without touching the block.
3842 So if a block becomes empty, we simply move the next block back to
3844 i.e. when an index block becomes truely empty (i.e. no children)
3845 - if it wasn't incorporated, simply remove it
3847 - if there is a dependent block, rehash it to take my address
3848 - if there is a next block that is dirty, rehash it
3849 - if there is a next block that is not dirty,
3850 update parent to merge my entry with next, and rehash next
3852 - if there is no next block but we are not first, just update
3854 - if no next block and we are first, parent becomes empty,
3858 - too long, I've forgotten what I was up to..
3859 + I've changed the format of indirect blocks to store an address.
3860 + I've handled incorporation of an empty block
3861 So now internal index blocks can never be empty - they get immediately
3862 unlinked if they are.
3863 Leaf index blocks can be empty while they have children. We don't
3864 flag them as empty, but rather wait until another child gets incorporated.
3865 But I don't think I really like that. It is an external ugliness based
3866 entirely on internal implementation details. Empty index blocks should
3867 not get written out. We need some way to reliably find an empty index
3868 block. The address won't appear in the parent so a lookup will find the
3869 previous block which we cannot link to now as it may not exist yet.
3870 Worse - if first index block goes empty, we can only unlink it by moving
3871 the parent to start at the next block. That would make this index block
3873 So I think we have to stick with writing out empty index blocks very
3874 rarely. So we need to be sure they disappear properly.
3875 The difficult case is if an index block becomes empty while it has some
3876 children which don't end up getting dirtied. e.g. an update aborts.
3877 We need to leave the block with enough credits to be written out.
3878 I guess the Ncredit should be enough...
3879 Maybe worry about that later.
3881 - what about InoIdx blocks when they become empty? It would be helpful
3882 to flag them so that inode deletion can check....
3883 Maybe just set depth to 0..
3885 ARRGGG... I've completely lost it. In need another ITO week.
3886 I just got a bug in summary.c:71!!
3890 ablocks_used has hit zero too soon.
3891 This should be the count of blocks for which space has been allocated
3892 (B_Prealloc is set) but have not been given a phys address yet - at which
3893 point the usage count is moved to cblocks_used or pblocks_used.
3894 The last block (which may not be the cause of the problem) does not have
3895 B_Prealloc set, yet physaddr == 0.
3896 The block is 0/1, so the inode for the inode usage map. This should have
3898 We did find 8, then change to 73, but then changed to 0!
3899 Ahhh... recent fix exposed a subtle bug ... fixed.
3901 Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3902 cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3903 cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3904 cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3905 cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3906 We are allocating an InoIdx block, but data block is not valid??
3908 That isn't very reproducible so I'll have to leave it for now...
3909 erasedblock had been called on the data block .. inode 17??
3911 Problem is that I keep changing the rules.
3912 I don't erase the InoIdx block any more.
3913 I used to, then change it to iolock_block/cluster_allocate->0
3915 Problem: When all files are removed, usage is still quite high, two
3916 segments have over 400 blocks (out of 512). Cleaning keeps running and
3917 not making much progress.
3918 segment 6 has usage of 484.
3919 'cluster 3072' shows: cluster 3072, 3085, 3086 3092
3920 Inode 0: blocks 267 272 276
3921 Inode 277: blocks 0/4 6/2
3922 Inode 0: blocks 0/2 8 16
3923 Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
3929 All 'old', so must be the product of cleaning, as you would expect.
3930 All (most) of this has been deleted though, but count didn't drop.
3931 'Count' add to 508, plus the 4 cluster heads makes 512 - good.
3932 lafs_seg_move definitely isn't being called on these blocks.
3933 it is only called from lafs_summary_update
3934 cblocks_used "exactly" matches the number of un-removed blocks.
3938 bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3939 /home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3940 bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3941 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3942 bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3943 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3946 free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
3949 ------------[ cut here ]------------
3950 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3951 free list is empty - that should not be.
3954 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3955 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3956 [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
3957 [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
3958 [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
3959 [<c02256bf>] ? autoremove_wake_function+0x0/0x33
3960 [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
3964 Weirdness with truncating.
3965 The cleaner relocates a file resulting in the InoIdx block being
3966 Maybe-dirty and phys_addr == 0.
3967 Then truncate doesn't prune but just incorporates, finding
3968 something weird there..
3969 file 278, blocks around 4100
3970 seem to find 1949 instead??
3972 Note: When a non-InoIdx block is erased we set PhysValid
3973 and physaddr == 0 to record the fact because it will not be stored...
3975 modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3977 modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3978 Still Async ... wonder what it means.
3980 - directory block got corrupted. Maybe conversion to indexed??
3983 Getting bug in remove_from_index because the addr isn't
3984 there, possibly block is empty. But incorporation is
3985 ??? instant? No it isn't.
3986 If an index block hasn't be incorporated it has B_PrimaryRef
3987 set as it hold a ref to something earlier index.
3988 But what if nothing is incorporated?
3991 Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
3992 looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
3994 Then spin in a soft-lockup in lafs_inode_handle_orphan
3998 - grow_index_tree needs to do initial incorporation so things can be found.
3999 just like end of do_incorporate_internal.
4000 NO - cannot incorp yet as do not have phys addr. Don't need to as
4001 lafs_leaf_find explicitly handles this.
4002 For truncate case we don't use the stored address, but ensure all
4003 leaf indexes must be dirty (or gone) so whole tree must be
4004 accessible for walking around.
4005 - do_incorporate_internal needs to set B_PrimaryRef and take the ref
4006 - when we remove a B_PrimaryRef without incorporating it, we need to
4007 drop a ref if the *next* in the list is B_PrimaryRef
4008 - need to use a constant to identify 'async' calls etc.
4009 - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
4010 it isn't found as async....
4013 STILL struggling with incorporation.
4014 We have a premise that any file address is coverred by precisely
4015 one leaf index block. Every leaf index has an implicit address
4016 and it covers all addresses from there to the next leaf. The last
4018 So there must always be a leaf at address 0.
4019 This applies within the tree from an internal index block too.
4020 Beneath an internal index block there must be a leaf covering every
4021 address up to the next internal index block. So there must be
4022 a first. So storing the first address is pointless. And harmful.
4023 When an index block becomes empty and disappears its coverage is
4024 included in the previous block unless there is none, in which case
4025 the next index block must be re-addressed. If there is no 'next',
4026 this index block must be empty and so must disappear.
4028 BUT if we re-address an index block, we implicitly re-address the
4029 first child - recursively - so we need to move/rehash them all
4030 or lose them... or record where they are. Or do lookup not by
4032 I think just rehashing them all - with an iolock - is simple
4033 and safe. So just do that.
4036 So: I cleaned up index handling a truncation somewhat.
4037 Now running looptest to see what patterns emerge:
4039 block.c:197 (*9+1) During umount, the Root datablock is
4041 Maybe just need for cleaner to become inactive
4042 during umount - hope that doesn't deadlock
4043 didn't event work...
4044 block.c:529 (*4+1) erase dblock while iblock depth > 0
4045 When pruning InoIdx we want to set depth to 0.
4046 FIXME is this really want I want, or is depth=0
4047 only for data-inode ... FIXME
4048 cluster.c:533 (*2) cluster_allocate on invalid block
4049 Block is 8/0 in writepage from sync_inodes
4050 This is the orphan file.
4052 I guess the file gets truncated while we wait for it.
4053 Just need to re-test.
4054 index.c:1936 (*2). An index block is Root - FIXED??
4055 modify.c:1056 - secondary bug, ignore for now.
4056 modify.c:1650 update_index fails to find target.
4057 second call, phys==0
4058 Code was bad ... may not be the cause though.
4059 modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
4060 from orphan handler.
4061 Maybe just change the do/while back to 'do'.
4062 modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
4065 uninc list gets set in lafs_add_block_address (parent of iblk),
4066 do_incorporate_internal,
4067 Maybe the InoIdx still had children.
4068 segments.c:1028. (*4) The free list becomes empty.
4069 super.c:655 (*3) Busy inodes after umount, and root InoIdx block
4070 is still pinned as inode 16 data block was still dirty.
4071 segusage slow. Maybe same as block.c:197 ??
4072 invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
4074 presumably the inodes was freed before invalidated.
4075 spin on writeback during truncate (r3a) 8 times. now 10
4076 Probably because writeback cannot proceed while
4077 orphan processing keeps looping.
4078 kmalloc-1024 problems - (*2)
4079 A block - should be start of page - isn't not what it appears...
4081 Others complete with 'cb' ranging from 202 to 715
4086 Looking at segment.c:1028
4087 We run a seg_scan every checkpoint, so that should keep free segments
4089 Ahh.. do_checkpoint is looping because root isn't changing phase.
4091 Lowest block pinned to old phase is
4092 [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
4093 which is not on leaf list because it has IOLock
4094 With more debugging:
4095 [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
4096 or better (that was in lafs_iolock_written)
4097 [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
4098 FIXED - I didn't unlock if it wasn't dirty any more.
4099 Well almost - it occurs much less now.
4101 8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
4102 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4103 2 BUG: unable to handle kernel paging request at 6b6b6bfbt
4104 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4105 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
4106 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
4107 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
4108 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4109 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
4111 So we now have 1/12 rather than 2/3.
4112 a/ pinned by IOLock from file.c:220 - FIXED
4114 c/ Root is pinned by 4 children
4115 328/0 with 196 of data blocks in writeback/realloc, in a cluster
4116 0/1, 74/0, 0/8 all in a cluster waiting writeout.
4117 Don't understand this.
4120 Of the 48, 11 ran to completion leaving blocks from 286 to 899
4123 Looking at the loss of blocks when truncating.
4124 tracing show small number of files with remaining blocks at delete.
4125 sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
4126 next attempt: 14+24+26*11 =324 cf cb=1124
4127 next attempt 26+6+15+68+29 == 144 cf cb=383
4128 26+18+14+19+284 = 361 cf 379
4129 files are (in order)
4138 Thinking about truncate and index blocks becoming empty while
4139 they still have children.
4140 For leaf indexes, we need to leave the block in place in case
4141 the children get written. We need to find a time to ultimately
4143 For internal indexes,.... uhm, it just works, OK??
4145 When I drop an uninc block, I need to remove it from the
4146 uninc list, and from phase_leafs
4147 clearing dirty and refiling should remove from leafs.
4149 When we recurse to a parent, we need to remove
4150 *this* block from the uninc list for said parent.
4151 It should be the only thing in the list.
4152 But even when we don't recurse, the fact that we have
4153 incorporated means that we should tidy up the ->uninc
4159 unmount hung after lafs_run_orphans from lafs_put_super
4160 There are two orphans in Writeback which cannot progress
4161 until the current cluster is written...
4162 But they keep getting re-written!
4163 Other time, one orphan, index block is Dirty on a leaf ???
4165 orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
4166 [cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1)
4167 LAFS_cluster_flush 1
4170 orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
4171 [cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0)
4173 OK, problem is that when we truncate and remove an index block, the
4174 next index block expands backwards to fill the space.
4175 Then we apply prune_some, but don't check if anything was done.
4176 We always mark it dirty, so it has to be written and then
4177 we loop through again...
4178 So need to check if prune_some did anything.
4181 - prune_some need to get more done at a time
4182 - let cleaner finish up before umount
4183 - use early segments first ??
4184 - look at write-clusters and check OK
4185 - check that df:cb= drops properly.
4188 1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170 - SECONDARY BUG
4189 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4190 3 BUG: unable to handle kernel paging request at 00100104
4191 5 BUG: unable to handle kernel paging request at 6b6b6bfb
4192 1 BUG: unable to handle kernel paging request at 7fffffff
4193 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4194 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
4195 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
4196 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
4197 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828!
4198 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
4199 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708!
4200 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4201 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
4202 30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4207 Pinned block in lafs_release:
4208 0/2 is Dirty with plenty of credits, so it is a child
4209 0/16 is Dirty/Realloc, or once Async
4210 Dirty, but not on a leaf list, not pinned
4213 seg_deref with refcnt , 2 in lafs_seg_put_all
4216 No free segments - no real pattern.
4219 lafs_incorporate on non-dirty/realloc block
4220 328/0 Index(1). 1 in uninc_table - probably during truncate.
4221 Either we add uninc while not dirty
4222 Or we clear Dirty while uninc present
4223 or there is a race between the two.
4225 Don't know: add a bugon
4226 Bugon in get_flushable didn't fire.
4229 children present in truncate after final incorp...
4230 328/0. 64 children, no uninc list. Maybe we ran the orphans too early??
4231 or invalidate_page isn't removing the children.
4232 Might want print_tree here?- added that.
4233 Answer: all the children are in Realloc on Clean_leafs
4234 Maybe erase_page needs to disconnect from cleaner too??
4237 Orphan handling - uninc but not dirty: is Realloc (sometimes)
4241 delref 'primary' from modify.c:2063 in the q2 branch.
4242 nxt has PrimaryRef... Maybe move earlier, but that shouldn't make a diff.
4243 ditto at modify.c:2035 nxt is primary as was I, so drop mine.
4244 Don't know - looks like sibling list got broken.
4245 Tidied up a bit and added a print-tree.
4246 v.interesting result. Lots of consecutive index blocks all holding primary-ref
4247 on single primary - which is wrong.
4248 1/ When setting PrimaryRef, if next holds PrimaryRef, then must take reference
4249 on self, as are being inserted into chain
4250 2/ When splitting, new block must be addressed as first block which cannot
4251 fix, not first block which doesn't fit. Else incorping in reverse order
4252 can make lots of tiny index blocks.
4255 erase with index depth > 1.
4256 0/328 in orphan handling. Still have 8 or 15 blocks registered!
4257 Maybe caused by index block errors. Added some printks.
4260 not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
4262 16/1 in seg_inc/seg_move...allocated_block/cluster_flush
4264 - writepage wrote the page??
4265 - checkpoint wrote it and didn't replenish the credits?
4268 invalidated pages finds dirty block after EOF, after iolock_written
4269 0/0 Dirty/Realloc in unmount - all Realloc!
4270 Need to wait for cleaner etc to finish at unmount time.
4272 NULL deref in 1b4 YY
4273 cleaner->cluster_flush->count_credits->lock??
4274 Trying to get a lock on an inode that has since been free??
4275 spin_lock(&dblk(b)->my_inode->i_data.private_lock);
4279 generic_drop_inode -- extra iput?? in lafs_inode_checkpin from refile
4281 invalidate_inode_buffers!! in kill. use-after-free
4284 seginsert from scan_seg
4285 MAX/number-elements confusion. Worked around for now.
4289 After a couple of fixes:
4290 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4291 1 BUG: unable to handle kernel paging request at 00100104
4292 5 BUG: unable to handle kernel paging request at 6b6b6bfb
4293 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4294 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:496!
4295 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!
4296 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:531!
4297 16 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4298 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4299 Realloc blocks confusing truncate
4300 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:118!
4301 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1699!
4302 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4303 19 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4307 - truncate gets confused by blocks being cleaned.
4308 Need to flush cleaner, or just removed the blocks.
4309 - when add PrimaryRef in middle of list, take the right ref.
4310 - fix up wait-for-cleaner at unmount time.
4314 3 BUG: unable to handle kernel paging request at 6b6b6bfb.
4315 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4316 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1890!
4317 22 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4318 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:835!
4319 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:852!
4320 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4321 17 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4322 251 SysRq : Resetting
4323 3 SysRq : Show State
4325 - We can erase a dblock while it is in the uninc_pending or
4326 uninc_next - need to be careful
4327 - At umount, 0/2 is Dirty but not Pinned, so not written out
4329 16/0 sometimes is Async
4330 16/0 Async might be from the segment scan - so wait for that.
4331 Dirty but not pinned can happen when InoIdx is pinned.
4333 - I think the uninc_next list (At least) should be sorted before
4336 - root block dirty/realloc/leaf in final iput
4337 Could be it was changed during last checkpoint so
4338 pushed in to next phase? But why Realloc?
4339 Maybe still issue with losing inode data block.
4341 20 June 2010 Happy Birtyhday Dad!!
4344 4 BUG: unable to handle kernel paging request at 6b6b6bfb.
4345 26 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:209!
4346 87 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:601!
4347 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:839!0
4348 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:856!9
4349 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1719!3
4350 12 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1033!
4351 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:656!
4354 - inode in i_sb_list has been freed.
4355 - block 0/0 is dirty/realloc/leaf after final iput
4356 - not all blocks freed by truncate
4357 - Index block with uninc is not dirty - not FIXED: more iolock in phase_flip
4358 - still children when truncate should have finished.
4360 Maybe inode has become unhashed and we re-load it??
4361 it is invalid after all!!
4362 - Index block not dirty when incorp - has uninc. ??
4363 - didn't wait for free segments
4364 - Data 16/0 is dirty but not pinned after final checkpoint - FIXED
4367 watch -d 'awk -f checkseg /tmp/log; echo ====== ; grep -h -E "(blocked for more|BUG|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4368 watch -d 'echo ====== ; grep -h -E "(blocked for more|BUG|Busy inodes after|SysRq : )" /var/tmp/lafs-logs/log* | sort | uniq -c ; echo === ; ls /var/tmp/lafs-logs/log* | wc -l'
4371 Unclear on dirtying index blocks.
4372 We normally mark it dirty first, then add the address to the uninc list.
4373 Note that this is the reverse of data blocks which are changed first, then
4374 dirtied. So maybe we should mark dirty afterwards. We then need to
4375 avoid incorporation while we are adding addresses else we might find it
4376 has addresses but is not dirty. Only try if dirty?
4377 Maybe we should iolock the parent. We need to do that anyway to flush
4378 incorporations when the table is full. Yes, that fits the VM model
4379 better. Always lock while updating and preparing to write. Set
4380 writeback once write has started, then unlock. Cool.
4381 Only a block is iolocked when we allocate (to 0), so we cannot lock the parent..
4384 Apart from tracking down the remaining bugs, I need to:
4385 1/ Decide on locking for incorporation and attaching new address to a block
4387 In particular we need to not lose the Dirty flag before the update is done.
4388 2/ Resolve handling of pinned inode data/index blocks
4389 3/ Correct handling of empty index blocks, particularly when parent is in
4390 different phase. Make lookup be more careful?
4391 4/ Wait for there to be enough free segments before allowing allocation.
4393 2: Problem is that we cannot handle a pinned inode-data block while the
4394 InoIdx block is pinned in the same phase.
4395 We currently unpin it so it drops off the leaf list. But then we
4396 need to re-pin it when the InoIdx is unpinned or phasefliped, and that
4397 gets ugly. Possible though.
4398 An alternate is to treat it like a parent and keep it off the list
4399 while the InoIdx is pinned/same-phase. So we would need to
4400 re-assess it after unpinning or flipping the InoIdx. That is probably
4401 a lot easier than re-pinning it.
4403 1: We would normally set 'dirty' after changing the block. But we need
4404 to differentiate Dirty from Realloc, so we set before adding addresses.
4405 This requires that are careful not to write an index block while there
4406 are pending changes. The fact that pinned children stop any writing,
4407 as do pending addresses in a list should ensure this.
4409 3: When an index block becomes empty we need to make sure that
4410 future lookup doesn't get confused by it. Specifically future
4411 index lookup must avoid the block so nothing new gets added.
4412 Possibly a previous block will split again, but this block must remain
4414 However we cannot update the parent block immedatiately as it might
4415 be in a different phase.
4416 So we must record both "don't touch this" and "where to look instead"
4417 elsewhere - in children.
4418 If the block being deleted is *not* the first child in the parent,
4419 then we direct index lookup to the earlier block.
4420 If the block being deleted *is* the first child in the parent,
4421 then redirect to the second child if there is one and we weren't just there.
4422 If there is no other block we flag the parent as empty and retry
4424 We flag a parent as empty with B_EmptyIndex.
4426 What locks do we need to walk around the sibling list?
4427 the inode private_lock is minimal, but we cannot hold that to take a
4428 iolock - just to get a reference.
4431 - try to find a good block using private_lock
4432 - get a ref and wait for it.
4433 - check if it is still a good block. If not, start again
4435 If we find an EmptyIndex block, it must be directly addressed by parent.
4436 It will never be followed by a PrimaryRef block because if there were
4437 such a block, we would have readdressed it back and hidden the EmptyIndex.
4438 So we need to look around for an address in the parent that leads to
4439 a non-EmptyIndex block.
4441 If all children are empty, we need to make the parent empty. But
4442 what if it is InoIdx?
4443 Maybe I am making this too hard. I could just use i_alloc_sem to
4444 block lookups while truncate is happening. That doesn't address
4445 single block removal e.g. from directories.
4446 So I need to be able to wait for incorporation to happen on an
4447 empty index block. We hold iolock on the parent. If there blocks
4448 on ->uninc, we just process them immediately. If there are blocks on
4449 ->uninc_next, we wait for the checkpoint to complete
4451 What does lafs_incorporate actually do with EmptyIndex blocks?
4452 Providing that match currently incorp addresses, they just cause
4453 those addresses to disappear.
4455 If a block is in the uninc list for its parent, then is phase_flipped
4456 and changed and written out it could get a new physaddr before
4458 I guess we never allocate a B_Uninc block which is in a different phase
4459 to the parent. Currently we wouldn't do that anyway except in truncate
4460 though memory pressure on index blocks might one day??
4461 Truncate? We cannot allocate directly in lafs_incorporate.
4462 We should get lafs_cluster_allocate to notice and DTRT.
4464 Only hash index blocks when they are incorporated. Not needed before then.
4465 When processing an uninc list, if an address appears twice, prefer the one
4466 that isn't EmptyIndex...
4469 I need a clear picture of the "Steady state" for an internal index block
4471 The internal index block contains 1 or more addresses. For each address there
4472 maybe a child index block. If there is it maybe the head of a list of
4473 blocks with B_PrimaryRef set thus holding the whole list in place until
4474 incorporation happens.
4475 Each of these children can be on either ->uninc_list or ->uninc_next,
4476 or possibly neither if they haven't been queued for writing yet. Any
4477 PrimaryRef block will be Pinned.
4479 When a child is incorporated and found to be Empty it is flagged as such
4480 and then must never be returned by index lookup. Index lookup will either
4481 add a block to a leaf index so it doesn't appear empty, or will git an EmptyIndex
4482 block and so have to start again from the top.
4484 When a PrimaryRef block becomes empty it is simply removed from the
4485 PrimaryRef chain so it cannot be found. The space now belongs to the
4487 When a non-PrimaryRef block which isn't the first becomes empty it is
4488 flagged and left in place so that following blocks can be found. The
4489 address space now belongs to the previous block.
4490 When the first child (fileaddr matches parent) becomes empty - what?
4491 We could re-address first child but that forces early address change -
4492 old might not be incorp yet
4493 We could re-address the parent, but that doesn't work for InoIdx
4494 We could leave it there with physaddr == 0
4496 Last sounds promising. So we never re-address an index block.
4500 Index blocks, Indirect blocks, extent blocks each have an address
4502 When a block becomes over-full it splits - a new block appears with
4503 a new address thus implicitly limiting the address space covered
4506 When an index block becomes empty and has no pinned children it is
4507 marked as EmptyIndex (under IOLock).
4508 When an EmptyIndex is allocated it goes to phys==0
4509 An EmptyIndex which is not first (->fileaddr != ->parent->fileaddr)
4510 is never used again. Its address space is ceded to the previous
4511 index block - which could split several times...
4512 An EmptyIndex which is first can be re-used. Once it gets pinned
4513 children the EmptyIndex is cleared.
4515 An Index block always has an entry for the first address. It might
4516 be implicit to phys==0. Loading such a block creates an empty
4519 InoIdx doesn't get EmptyIndex, rather it gets ->depth=1
4521 Indirect *doesn't* store the first address any more.
4524 DONE - remove forcestart from layoutinfo
4525 DONE - remove start-address from Indirect blocks
4526 DONE - only hash index blocks when they are known to be incorporated.
4527 DONE - when incorporating an uninc list, ignore phys==0 if also a block with
4528 same fileaddr and phys!=0. so sort phys==0 first
4529 DONE - Create EmptyIndex flag
4530 DONE - Clear the flag when adding child pin to index block
4531 DONE - avoid EmptyIndex non-start blocks during index lookup
4532 DONE - allow index blocks to be loaded with ->phys==0
4533 DONE - allow EmptyIndex index block to be "written" to phys 0
4534 DONE - ensure index lookup finds implicit start address, possibly 0
4536 So now after 36 runs
4537 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1939!
4538 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:403!
4539 10 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:605!
4540 14 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4541 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:624!
4542 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4547 block 0/2 is Realloc and being allocated from cluster_flush while
4548 parent is not Realloc or dirty
4549 That is bad as Realloc gets set in lafs_allocated_block ... except
4550 that the code was bad. FIXED.
4553 cleaner is pinning a block (299/25) which is not Realloc,
4554 and phase isn't locked. We are only meant to pin data blocks
4555 for updates while holding a phase lock.
4556 Ahhh - bad code again. FIXED
4559 Truncate doesn't clean up properly.
4564 No sign of any children.
4566 Very weird. Signed in incorporation going wrong.
4567 Added more debugging.
4569 Found 4084 4 12 at 890
4571 Found 4089 4 16 at 878
4573 Found 4094 2 20 at 866
4575 Found 2561 2 22 at 854
4577 Found 2564 4 24 at 842
4578 Found 2569 2 28 at 830
4581 Why are 2564 etc lost? No sign of alloc-to-0
4584 no free segments - need to wait somewhere.
4587 allocated_blocks has gone over free_blocks!
4588 in lafs_prealloc/reserve_block/free_get/ss_put/new_segment.../checkpoint.
4589 Wanted CleanSpace to reserve the youthblk
4590 Maybe related to not waiting - ignore for now.
4593 block 0/2 was dirty but not pinned. Should not happen to inodes.
4594 block 0/0 was Pinned because it had a child - as above.
4596 Maybe we don't carry the pin across when we collapse dir
4597 into inode??... looks quite likely
4603 1 BUG: unable to handle kernel paging request at 6b6b6bfb
4604 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:497!
4605 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/dir.c:710!
4606 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:606!
4607 61 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1034!
4608 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:657!
4609 42 SysRq : Resetting
4613 invalidate_inode_buffers called on at shutdown.
4617 block 16/1 is not dirty with no credits.
4618 Maybe writepage got to it?
4621 ouch! dir lookup failed in unlink.
4622 No real hints. Must be hash based - some off-by-one probably.
4623 Need to stare at the code.
4626 Blocks still present after truncate.
4627 typically about 60, but in 1 case '4'. No index blocks.
4628 So probably content of second index block.
4629 Yes, lafs_leaf_next was doing the wrong thing for addresses
4630 before start of block.
4636 dir inode 0/2 is still Dirty but not pinned.
4637 Maybe lafs_dirty_inode should be pinning the block
4639 But now this triggers for 16/X still dirty.
4642 How and when to write blocks in a SegmentMap file?
4643 - We don't want normal write-back to write them unless they have
4645 - We need to write them in tail of checkpoint, and index info must
4646 follow in the next checkpoint.
4648 lafs_space_alloc is called from
4649 - mark_cleaning: always CleanSpace, failure is OK
4650 - lafs_cluster_update_pin: ReleaseSpace. -EAGAIN is OK (CHECK THIS) but failure
4651 is not - or shouldn't be.
4652 - lafs_allocated_block: CleanSpace, checking if parent of Realloc block
4653 can be saved separately from any Dirty version. Failure OK, blocking not.
4654 - lafs_prealloc - general space allocation.
4656 lafs_cluster_update_pin is call from:
4657 - lafs_create, lafs_link, lafs_unlink, lafs_rmdir, lafs_symlink, lafs_mkdir
4658 lafs_mknod, lafs_rename,
4660 So best to return -EAGAIN, and it should be handled adequately.
4662 lafs_prealloc is called from:
4663 - lafs_reserve_block, after modifying the alloc_type extensively.
4664 - lafs_phase_flip to re-fill the 'next' credits. If they aren't available
4665 we simply pin all children so they aren't needed.
4667 - lafs_seg_ref_block: getting CleanSpace to save segusage blocks.
4668 If this fails .. what?? lafs_reserve_block fails. so...
4670 lafs_reserve_block is called from
4671 - mark_cleaning - CleanSpace
4672 - lafs_pin_dblock - type is passed int...
4673 - lafs_prepare_write - on failure write will fail or retry after checkpoint
4674 - lafs_inode_handle_orphan - to help with delete. On failure we allow
4676 - lafs_seg_move - should be elsewhere. Failure BAD !
4677 - lafs_free_get - as above, failure BAD
4678 - clean_free - update youth for new clean blocks - Failure BAD
4680 lafs_pin_dblock is called from
4681 - dir_create_pin - fail or again handled
4685 - lafs_dir_handle_orphan
4690 - lafs_orphan_release !! cannot handle failure
4691 - roll_block should use AccountSpace
4693 So: It seems we need a new allocation class that will never fail.
4694 Maybe it is allowed to BUG though?
4695 AccountSpace - i.e. space need to account for the use of space.
4696 Must never ever fail.
4698 Then we must ask where blocking should happen on -EAGAIN.
4699 dir.c does "lafs_checkpoint_unlock_wait", then tries again.
4700 prepare_write does too.
4702 For that to work we must start a checkpoint on returned EAGAIN.... Don't
4703 we want to wait for some cleaning to happen first though? Maybe an extra
4704 flag, and a count of the number of empty (but not clean) blocks.
4706 - Should I skip orphan handling when tight on space? Probably not. It will
4707 just keep failing while we keep cleaning...
4708 - roll_block should use account_space .. or not
4710 - lafs_space_alloc simply allocates space, or fails. 'why' is used to
4711 guide watermark choice.
4712 - lafs_prealloc allocates space to a block and all its parents base on
4713 'why' for watermarks. It either succeeds or failed.
4715 - lafs_cluster_update_pin and lafs_reserve_block decide whether to respond
4716 to failure as -ENOSPC or -EAGAIN based on 'why'.
4718 - lafs_pin_dblock simply passes on the failure, which must be handled.
4720 So: What to do when we return -EAGAIN?
4721 We need to wait until there are *enough* clean segments, then cause a checkpoint
4722 so they become free.
4723 So a flag that says 'waiting for free space' and a count of segments
4726 But how do we differentiate ENOSPC and EAGAIN for NewSpace requests?
4727 Maybe we don't ?? Or do it later.
4730 - Audit all AccountSpace and justify them
4731 + lafs_seg_move is probably wrong. Should have allocated when the
4732 free segment was allocated
4733 - lafs_orphan_release called lafs_pin_dblock but cannot handle failure
4734 - Need to wait not just for "enough space" but for "enough clean segments".
4736 - how is 'free_blocks' set - what does this tell us??
4738 free_blocks is the sum of known-clean segments.
4741 remainder for each active segment
4742 then reserve some segments for cleaning.
4743 And separate 'allocated_block' for each ?
4746 segments.c:647 fired: AccountSpace had no space available.
4747 Reserving space to write the segusage of youth block for a newly
4750 0/2 is Dirty but not Pinned Maybe we need PinPending
4753 Maybe I need cond_resched??
4755 Maybe I want two separate 'free_blocks' counters.
4756 One that includes all free blocks for use in 'df' etc.
4757 One that only includes completely free segments for use in allocation...
4762 Something is wrong with cleaning and segment tracking
4763 We have 5 free segments and we get them all without writing
4764 anything! We consumer them all with cluster_flush!
4765 It seems that the root inode is not changing phase!
4766 Nothing is on the phase leafs.
4767 Most children are in Writeback on cluster. and are Realloc
4768 Others have pinned children.
4769 They are all in 'cluster', but 'flush' doesn't flush them,
4770 so they must be in a different clister??? Is the cleaner still
4771 cleaning? Yes, they are on the cleaner 'wc' list so they are
4772 queued but not flush for the cleaner.
4775 At last it looks like I nearly have a working FS. Out of 361 test
4776 runs, 9 triggered BUGS and one hung at umount.
4778 I need a new TODO list, starting with 6 jul 2007(!) and adding any
4781 DONE 0/ start TODO list
4782 DONE 1/ document new bugs
4783 DONE 2/ Tidy up all recent changes as individual commits.
4784 DONE 3/ clean up the various 'scratch' patches discarding any tracing that
4785 I don't think I need, and making the rest 'dprintk' etc.
4786 DONE 4/ check in this README file
4787 DONE 5/ Write rest of the TODO list
4789 DONE 5a/ index.c:1982. Data block with Phys and no UnincCredit
4790 It is Dirty but only has *N credits.
4793 DONE 5b/ phase_flip/pin_all_children/lafs_refile finds refcnt == 0;
4794 I guess we should getref/putref.
4796 DONE 5c/ dirty_inode might find InoIdx is allocated but datablock not
4797 and doesn't cope well.
4799 DONE 5d/ At unmount, 16/1 is still pinned.
4801 6/ soft lockup in unlink call.
4802 EIP is at lafs_hash_name+0xa5/0x10f [lafs]
4803 [<d0a56283>] hash_piece+0x18/0x65 [lafs]
4804 [<d0a564c3>] lafs_dir_del_ent+0x4e/0x404 [lafs]
4805 [<d0a56256>] ? lafs_hash_name+0xfa/0x10f [lafs]
4806 [<d0a4b35c>] dir_delete_commit+0xdb/0x187 [lafs]
4807 [<d0a4be3f>] lafs_unlink+0x144/0x1f4 [lafs]
4808 [<c02602c1>] vfs_unlink+0x4e/0x92
4810 Don't know. Looks like cleanup up a chain in dir_delete_commit.
4813 Would we be spinning on -EAGAIN ?? 4 empty segment are present.
4815 6a/ index.c:1947 - lafs_add_block_address of index block where parent
4817 looping on [cfbd4690]327/336(0)r3F:Index(1),Pinned,Phase0,Valid,SegRef,CI,CN,CNI,UninCredit,PhysValid,PrimaryRef,EmptyIndex,Uninc{0,0}[0] uninc(1) inode_handle_orphan2(1) leaf(1)
4818 /home/neilb/work/nfsbrick/fs/module/index.c:1947: [cfbd5c70]327/0(0)r2F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,SegRef,CI,CN,CNI,UninCredit,PhysValid,EmptyIndex,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1)
4820 6b/ check_seg_cnt sees to be spinning on the 3rd section
4821 the clean list has no end!
4823 CLEANABLE: 0/0 y=0 u=0 cpy=32773
4824 CLEANABLE: 0/1 y=0 u=0 cpy=32773
4825 CLEANABLE: 0/2 y=0 u=0 cpy=32773
4826 CLEANABLE: 0/3 y=32773 u=6 cpy=32773
4827 CLEANABLE: 0/4 y=32772 u=124 cpy=32773
4828 CLEANABLE: 0/5 y=32771 u=273 cpy=32773
4829 CLEANABLE: 0/6 y=32770 u=0 cpy=32773
4843 6c/ at shut down, some simple orphans remain
4846 DONE 7/ block.c:624 in lafs_dirty_iblock - no pin, no credits
4847 truncate -> lafs_invalidate_page -> lafs_erase_dblock -> lafs_allocated_block / lafs_dirty_iblock
4848 Allocated [ce44f240]327/144(1499)r2E:Writeback,PhysValid clean2(1) cleaning(1) -> 0
4850 Oh dear: [ce44f240]327/144(0)r2E:Writeback,PhysValid clean2(1) cleaning(1)
4851 .......: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,PhysValid{0,0}[0] child(1) leaf(1)
4852 Why have I no credits?
4853 /home/neilb/work/nfsbrick/fs/module/block.c:624: [cfb69180]327/0(349)r2F:Index(1),Pinned,Phase0,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
4855 Cleaning is racing with truncate, and that cannot happen!!
4856 Actually it could - if i_size changed at the wrong time.
4858 DONE 7a/ block.c:507 in lafs_dirty_dblock - no credits for 0/2
4859 block.c:507: [cfa63c58]0/2(4348)r2F:Valid,Dirty,Writeback,PhysValid cluster(1) iblock(1)
4860 in touch_atime. I think I know this one.
4862 7b/ soft lockup in cleaner between 0x5e6, then 0x799-7f6 then 0x990 of 0x1502
4863 i.e. 1510, 1945-2038, 2448 of 5378
4864 Appear to be looping in first loop of try_clean, maybe
4865 group_size_words == 0 ??
4868 DONE 7c/ NULL pointer deref - 000001b4
4869 Could be cluster_flush finds inode dblock without inode.
4870 Have a BUG_ON of this now.
4872 DONE 7d/ paging request at 6b6b6bfb.
4873 invalidate_inode_buffers called, so inode_has_buffers,
4874 so private_list is not empty. So presumably use-after-free.
4875 But is on s_inodes list.
4876 Probably cleaner is still active (if this is first call to
4877 invalidate_inodes in generic_shutdown_super) so list gets broken.
4878 We need locking or earlier flush.
4880 DONE 7e/ Remove BUG block.c;273 as cleaner can cause this.
4881 Check for Realloc too.
4883 PRESUME-FIXED 7f/ index.c:2024 no uninc credit
4884 [ce532338]0/306(2996)r1F:Pinned,Phase0,Valid,Dirty,Writeback,SegRef,Claimed,PhysValid cluster(1)
4885 found during checkpoint. Maybe inode credit problem.
4887 PRESUME-FIXED 7g/ inode.c:831 InoIdx 283/0 is Realloc, not dirty, and has
4888 ->uninc blocks. This is during truncate. Need some
4889 interlock with cleaner maybe?
4890 Probably the same race between cleaner and truncate.
4892 DONE 7h/ inode.c:845 truncate finds children - Realloc on clean-leafs
4894 NOLONGERRELEVENT 7j/ resolve space allocation issues.
4895 Understand why CleanSpace can be tried and failed 1000
4896 times before there is any change.
4898 DONE 7k/ use B_Async for all async waits, don't depend on B_Orphan to do
4900 write lafs_iolock_written_async.
4902 DONE 7l/ make sure i_blocks is correct.
4903 set on 'import_inode'
4904 decreased when lafs_summary_update assigned block to '0'
4905 changed when lafs_summary_allocate changes e.g. quota.
4907 lafs_summary_update is called when a block is assigned to a location,
4908 or to zero. It is real usage.
4909 lafs_summary_allocate is called when we set Prealloc on phys==0 or
4910 clear Prealloc on phys==0
4911 So allocate must be followed exactly.
4912 update is already counted for setting !=0, so only dec on ==0.
4914 What about quota? - hidden in quota_allocate / qcommit
4916 7m/ delete inode could not progress through inode_map_free, so
4917 ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
4918 was permanently an orphan.
4920 DONE 8/ looping in do_checkpoint
4921 root is still i Phase1 because 0/2 is in Phase 1
4922 [cfa57c58]0/2(2078)r1E:Pinned,Phase1,WPhase0,Valid,Dirty,C,CI,CN,CNI,UninCredit,IOLock,PhysValid</file.c:269> writepageflush(1)
4923 Seems to be waiting for writeback, but writeback is clear.
4924 Need to call lafs_io_wake in lafs_iocheck_writeback for when
4925 it is called by lafs_writepage
4927 DONE 9/ cluster.c:478
4928 flush_data_To_inode finds Realloc (not dirty) block
4929 and InoIdx block is not Valid.
4930 [cfb5ef50]2/0(3)r1F:Index(0),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,IOLock,OnFree,PhysValid{0,1}[0]</cluster.c:435> child(1)
4931 I wonder if it was PinPending, or where it was IOLocked (or if).
4933 I guess we truncated, then added data, then tried to clean.
4934 Probably just a bad 'bug' given recent changes.
4935 No, I think it is the race between truncate and clean which is now fixed.
4937 SEEMS TO BE GONE 10/ inode.c:606
4938 Deleting inode 328: 2+0+0 1+0
4941 first index at level 1 was full and prune properly.
4942 Nothing else found empty.
4943 Somehow the second index block and contents were lost.
4945 ASSUME_DONE 11/ super.c:657
4946 Root still pinned at unmount.
4947 0/2 is Dirty: [cfa53c58]0/2(1750)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4948 [cfa5fc58]0/2(2852)r0E:Valid,Dirty,SegRef,CN,CNI,UninCredit,PhysValid
4949 [cfa53c58]0/2(3570)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4950 [cfa53828]0/2(2969)r0E:Valid,Dirty,CN,CNI,UninCredit,PhysValid
4951 [cfa75c58]0/2(579)r0E:Valid,Dirty,UninCredit,PhysValid
4952 maybe dir-orphan handling stuffed up
4953 Or maybe it is the I_Dirty issue. Assume fixed.
4956 ASSUME_DONE 12/ timeout/showstate in unmount
4957 umount is in sync_inodes / do_writepages / lafs_writepage / lafs_iolock_written
4958 That looks similar to 8
4960 DONE 13/ delete_inode should wait for pending truncate to complete.
4961 Document I_Trunc somewhere - including that i_mutex is needed to set it.
4962 Verify that assertion.
4963 Actually it requires i_alloc_sem, or the inode to be deleted.
4966 DONE 14/ Review writepage and flush and make sure we flush often enough but
4968 Probably just remove the cluster_flush from write-page as lafs_flush
4970 But leave for now as it encourages heavy indexing.
4972 DONE 14a/ use bio_add_page to write clusters.
4974 DONE 14b/ Figure out what backing_dev to present for the filesystem.
4976 DONE 15/ The inode map file lost some credits. I think it losts a PinPending because
4977 it isn't locked properly. Don't clear PinPending if someone else might
4980 DONE15a/ Find all FIXMEs and add them here.
4983 DONE 15b/ Report directory size less confusingly
4985 DONE 15c/ roll-forward should not update index if physaddr hasn't changed (roll_block)
4987 DONE 15d/ What does I_Dirty mean - and implement it.
4989 FIXED 15e/ setattr should queue an update for the inode metadata.
4990 and clean up lafs_write_inode at the same time (it shouldn't do an update).
4991 and confirm when s_dirt should be set. It causes fsync to run a
4994 15f/ include timestamp in cluster_head to set mtime/ctime properly on roll-forward?
4995 ## Items from 6 jul 2007.
4997 15g/ test directories with non-random sequential hash.
4999 DONE 15h/ orphan deadlock
5000 lafs_run_orphans- lafs_orphan_release can block waiting for written
5001 in erase_dblock, but that won't complete until cleaner gets to run,
5002 but this is the cleaner blocked on orphans.
5005 DONE 15i/ separate thread management from 'cleaner' name.
5007 DONE 15j/ review rules in getref_locked - and document them
5009 DONE - fix accesses to iblock
5011 DONE 15k/ newblocks should probably be a count of segments. Review that.
5013 DONE 15l/ make sure checkpoint_youth is decayed properly. Review youth decay.
5015 DONE 15m/ consider combining .orphans and .cleaning lists. If something is an
5016 orphan, we probably don't want to clean it just now(?).
5018 DONE 15n/ consider if lafs_pin_dblock should check for iolock. Maybe
5019 iolock or PinPending (which must be set under iolock).
5020 Just require PinPending and always get iolock_written for that
5021 except in special cases.
5023 DONE 15o/ Can there be async blocks when checkpoint starts? Could they
5024 pin blocks in old phase? Do I need to check for them?
5026 DONE 15p/ Review and remove the 'if cleaner is active then don't checkpoint just
5027 yet' thing - or somehow avoid the yuckiness.
5029 DONE 15q/ check checksums when reading cluster_header for cleaner
5030 This is already done!
5032 DONE 15r/ consider further optimisation in cleaner to avoid lookups.
5034 DONE 15s/ memory barrier for i_size check in cleaner???
5036 DONE 15t/ review usable-space calculations in clean.
5038 DONE 15u/ Do I need a SegRef when pin-dblock-by-hand in flush_data_to_inode
5040 DONE 15v/ tidy up all code that fiddles bits and credits - maybe make some
5043 DONE 15w/ review cluster updates and make sure space used is accounted properly.
5045 DONT BOTHER 15x/ Consider caching result of a failed dir lookup in case we immediately
5046 try to create it. Would this actually save anything significant?
5048 DONE 15y/ Don't make dir blocks into orphans if it cannot be needed?
5050 DONE 15z/ make sure symlink creation is safe - do I need to log the body??
5052 DONE 15aa/ lafs_rename should flush orphans just like lafs_rmdir does.
5054 DONE 15ab/ Does writepage need to recheck if my_inode and/or iblock have appeared
5055 after lock is taken on block?
5057 DONE 15ac/ if lafs_shrinker cannot reclaim enough index blocks, trigger some
5060 DONE 15ad/ review lafs_phase_flip's call to lafs_add_block_address and wonder
5063 DONE 15ae/ refile wonders about a race with cluster_allocate which gets IOLock
5064 before removing from lru.
5066 DONE 15af/ Review all locking in lafs_refile
5068 DONE 15ag/ Don't allocate data part of InoIdx block.
5070 DONE 15ah/ Is there a problem with lafs_allocated_block putting an
5071 about-to-be-truncated block on an uninc list?
5073 DONE 15ai/ When allocating a new segment during checkpoint, delay the
5074 youth-block update until after the checkpoint
5076 DONE 15aj/ When roll-forward finds a new segment, make sure youth number is
5079 DONE 15ak/ Load orphan file during roll-forward and make every block an
5082 DONE 15al/ set filesystem update_time somewhere.
5084 DONE 15am/ filesystem 'name' needs to be handled uniformly.
5086 DONE 15an/ can we be sure 'b' will be non-null in delete_inode?
5088 DONE 15ao/ determine what locking is needed to walk the children list
5089 in lafs_inode_handle_orphan. Probably the address_space private lock.
5091 15ap/ Make sure write_inode has been cleaned up. See if this applies to
5092 rollforward of a symlink (see FIXME)
5094 DONE 15aq/ change inode map to be little-endian, not host-endian
5096 DONE 15ar/ understand what to do about errors in lafs_truncate
5098 15as/ handle errors from lafs_write_super ???
5100 DONE 15at/ More wait_queues to wait for different blocks.
5101 just use wait_on_bit / wake_bit
5103 DONE 15au/ How should iocheck_block set the page error?
5104 and block_loaded <- this gets it right.
5106 15av/ ditto for write errors?
5108 DONE 15aw/ when lafs_incorporate makes a new block where the
5109 old is Realloc, the new should be Realloc too.
5111 15aw2 / When a block is a snapshot block it can never be dirty
5112 so we only need credits for realloc...
5114 DONE 15ax/ Think about what happens when we relocate a block
5115 in the orphan list (lafs_orphan_release), particularly
5116 if the block isn't actually loaded.
5117 FIXME still need to make sure errors will loading the orphan
5118 file are handled correctly - I guess we mark all bad orphans as
5119 type==0 and when we find those during release, reduce the size
5122 DONE 15ay/ Wonder if there is any way for run_orphans to get a wakeup
5123 when an inode or dir mutex is released.
5126 DONE 15az/ Sanity check all values in cluster head during roll-forward
5127 i.e. in roll_valid. If the head isn't complete, we can still
5128 use this to commit some previous checkpoints.
5130 DONE 15ba/ roll forward should not BUG on bad data like inodefile in
5131 non-primary filesystem.
5133 DONE 15bb/ Do I need to sync something before copying an update over part
5134 of an inode, then reloading the inode.
5136 DONE 15bc/ Handle DescHole in roll forward.
5138 DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
5139 in roll forward, just for consistency.
5141 DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
5142 are actually the correct type.
5144 DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
5145 a lookup - or at least we can test for that.
5146 lafs_seg_apply_all has similar problems and needs a good solution.
5148 DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
5149 if parent splits. See what to do about that.
5151 DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
5152 or handle if it has.
5154 DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
5157 DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
5160 DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
5162 DONE 15bl/ better checks for 'valid state block address' in valid_devblock
5163 include that segment_count is credible
5164 also in valid_stateblock
5166 15bm/ make sure everything gets free properly on error during mount / lafs_load
5168 15bn/ How does refcounting of 'struct fs' work with multiple filesets?
5170 DONE 15bo/ use put_super to drop last refer to superblocks
5172 DONE 15bp/ review all superblocks - maybe use more anon??
5174 15bq/ check readonly status in lafs_get_sb
5176 DONE 15br/ sync_fs should probably wait for something if 'wait'.
5178 DONE 15bs/ set f_fsid properly in lafs_statfs
5180 DONE - use new write_begin / write_end
5182 15bt/ - review how we ensure that credit remain with block.
5184 15ca/ When pin inode data block, pin it as well as index block I think
5185 It is still kept of the leaf list until the index block is done with
5188 15cb/ Layout issues:
5189 DONE - subset filesys still needs a parent pointer
5190 DONE - cluster head needs mtime/ctime to log these.
5191 - need better tracking of which devices are in this array??
5192 Need to be able to have read-only devices that are shared
5194 DONE - need multiple parallel write-clusters to allow parallel writes.
5195 - record tuning in state block:
5197 DONE - use crc or something, not toy checksum (e.g. cluster - state already has)
5198 - flags for inconsistencies found, at layout/fileset/file levels(?) (see 60)
5199 - policies of whether old or new data is allowed on each device
5200 - policies of how much duplication of metadata is required
5201 DONE - inode map - not host-endian
5202 DONE - segments > 16bit:
5203 segusage file - what about youth?
5204 cluster_head Clength
5206 15cc/ free any stray B_ASync block found in destroy_inode
5208 15cd/ Some code assumes a cluster header does not exceed 1 page.
5209 Is this safe? Is in true? Is it enforced?p
5210 roll-forward now handles large cluster_head.
5211 Need cleaner to handle it, and need to possibly write large
5212 cluster head when making new clusters.
5214 15ce/ classify BUGs as
5215 - internal logic errors
5217 - unusual conditions I want a warning of
5218 - data corruption errors
5220 DONE 15cf/ lafs_iget_fs need to sometimes to in-kernel mounts for subset filesystems
5221 This is needed for the cleaner - the cleaner needs to hold a ref somehow.
5223 15cg/ lafs_sync_inode is weird - why the lafs_checkpoint_start and update_cluster
5226 15ch/ Review values of youth and checkpoint_youth and think about off-by-one
5229 15da/ Replace directory updates!!!!!
5231 15db/ Decide how version string will be used.
5233 15dc/ resolve table_size - it should be stored in the segusage file and validated
5234 based on device geometry.
5236 15ea/ rollforward should recognise VerifyDevNext{,2} to allow next
5237 cluster on same device to verify previous.
5239 15eb/ When multiple devices and lots to do and plenty of free space,
5240 allow multiple segments, one per device, to be open at once,
5241 and possibly be writing multiple clusters at once using
5244 15ec/ Implement i_version tracking. This should be a 64bit numbers
5245 that appears to change every time the file changes. We only
5246 need a new number when someone looks at the value with
5248 We could simply use mtime with the sub-millisecond part being
5249 a counter of times that getattr sees a change in the same
5251 However as mtime can go backwards we might get i_version going
5252 backwards, which is awkward. I wonder if I care.
5253 Otherwise, leave for an inode extention later.
5255 16/ Update locking.doc
5257 17/ cluster_flush calls lafs_cluster_allocate calls lafs_add_block_address
5258 calls lafs_iolock_written. How do we know that won't block on cluster_flush?
5260 18/ See if per-fs shrinker is available yet and consider it for index blocks.
5262 19/ Review WritePhase and make sure it is used properly.
5264 20/ Review places where we update blocks and be sure they are not in writeout
5265 or in a different phase.
5267 21/ Review and document all lru uses (locking.doc) and make sure they are
5268 all locked properly.
5270 22/ Check possible failures:
5273 - reading critical metadata
5276 23/ Rebase on 2.6.latest. Done for .38
5278 24/ load/dirty block0 before dirtying any other block in depth=0 file,
5279 else we might lose block0
5281 25/ use kmem_cache for
5283 indexblock - probably a mempool because we cannot allow failure when
5284 splitting an index block.
5285 skippoint (mempool?)
5289 26/ Review seg addressing code for 2-D geometries.
5291 27/ Allow ranges of holes in pending_addr so partial truncate can be more efficient.
5293 28/ Make sure youth blocks are always referenced properly.
5295 29/ Make sure new segments are referenced properly. I think there might be
5296 some double referencing.
5298 30/ Decide when to use VerifyNULL or VerifyNext2
5300 31/ Implement non-logged files
5302 DONE 32a/ Store access time in a file
5303 32b/ Make it a non-logged file
5304 32c/ Avoid writing out dirty atime file blocks when not necessary.
5305 i.e. keep the page clean and active, and trigger 'write'
5308 33/ Support quota : group / user / tree
5310 34/ handle subordinate filesystems:
5311 ss[]->rootdir needs to be array or list
5312 lafs_iget_fs needs to understand this
5314 35/ review snapshots:
5315 - peer lists and cleaning
5320 36/ review roll-forward
5322 DONE 36a/ make sure files with nlink == 0 are handled well
5323 DONE 36b/ sanity check before trusting clusters
5324 DONE 36c/ handle miniblocks which create new inodes.
5325 DONE 36d/ Handle DescHole in roll_block
5326 DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
5327 than just iolock, for consistency...
5328 DONE 36f/ What to do if table becomes full when add_block_address in
5330 DONE 36g/ Write roll_mini for directories.
5331 DONE 36h/ In roll_one, use the cluster counting code to find block number and
5332 make sure we don't exceed the segment.
5333 DONE 36i/ add more general error checking to lafs_mount -
5334 lafs_iget orphans and segsum. Check type is correct.
5335 errors from lafs_count_orphans or lafs_add_orphans.
5336 alloc_page failure for chead - maybe allocate something bigger??
5338 37/ Configure index block hash_table at run time base on mem size??
5341 review everything needed for safe RAID5
5343 39/ How to handle all different IO errors
5345 40/ Guard against data corruption at every level.
5347 41/ Add checksums on index blocks and dir blocks and Inodes and ???
5349 42/ Store duplicates of some blocks. At least index and inode.
5351 43/ Handle writepage on mem-mapped page, adding new credits or unmapping.
5352 Make sure ->page_mkwrite sets up credits properly
5354 44/ Examine created filesystem and make sure everything looks good.
5360 47/ Write good documentation
5362 48/ Review all code, improve all comments, remove all bugs.
5364 49/ measure performance
5366 50/ Support O_DIRECT
5368 51/ Check support for multiple devices
5369 - add a device to an live array
5370 - remove a device from a live array
5374 53/ 'overlay' support
5375 So I mount one device read-only an another device
5376 writable which gets all the updates. metadata on first
5379 54/ cluster support - is this possible?
5381 55/ is any useful variant of reflink possible?
5383 56/ Review roll-forward completely.
5385 57/ learn about FS_HAS_SUBTYPE and document it.
5386 This is for fuse in particular so users can know the real type
5388 58/ Consider embedding symlinks and device files in directory.
5389 Need owner/group/perm for device file, but not for symlink.
5390 Can we create unique inode numbers?
5391 hard links for dev-files would be problematic.
5392 What do we gain? Maybe something for short symlinks.
5393 40 seems a good length to get 70% of symlinks.
5395 59/ Fix NeedFlush handling so we don't drop-then-retake
5396 a mutex as that isn't sensible.
5398 60/ Introduce some fs state recording that fsck is needed and possibly
5399 identifying what sort of fsck.
5401 61/ Try to make the inode struct smaller - maybe move some of the
5402 fs metadata into a separately-allocated struct.
5404 62/ System/trusted extended attributes:
5408 63/ user extended attributes.
5410 64/ wonder if index blocks can be flushed out by memory pressure somehow.
5411 e.g. if a data block is written by reclaim, flag the index block.
5412 When a flagged index block has no children, it is incorporated and written.
5415 65/ review why lafs_allocated_block needs the new_parent label. Should not
5416 lafs_incorporate leave all parents dirty? Maybe it is just the need for
5417 B_Realloc - so maybe lafs_incorporate should leave the new block either
5418 realloc or dirty rather than lafs_allocated_block doing it.?
5419 See also 15ad below.
5421 66/ Delay writeout of directory updates until an fsync. If a checkpoint happens
5422 first, discard the updates (and fsync waits for checkpoint to complete).
5423 If a cross-directory rename happens care is needed: either flush updates
5424 first or ensure that a flush does happen before the cross-directory
5426 Note that if the target of a rename is a directory, it must also be fully
5427 flushed before the rename can proceed.
5432 Normal sequence is to surrender UnincCredit, then to clear Dirty,
5433 then to write. If anyone re-dirties after Dirty is clear, they
5434 will naturally have to add an UnincCredit having reserved space first.
5435 However it seems that the Cleaner gets in the way as the block in question
5436 has just previously been cleaned, which consumed the UnincCredit
5437 Do we need ReallocUnincCredit?? I hope not.
5438 We generally need a way to say "I might want to write to this" so cleaner
5439 doesn't write it early.
5440 For index blocks that is pincnt. For data it is 'PinPending'.
5441 This keeps index blocks off clean_leafs until they are ready, but
5443 And in any case, TypeSegmentMap blocks don't get PinPending as they
5444 get written *after* the checkpoint. That is a rather ugly exception.
5445 Maybe we make their different handling more explicit. We put them on
5446 a separate list unpinned so the rest of the checkpoint can complete.
5447 Then we flush that list?
5448 Then PinPending keeps them off the clean_leafs list.
5450 So to clarify the plan: If a block is already Pinned to this phase,
5451 we can "clean" it by marking it Dirty rather than Realloc. This is
5452 appropriate for blocks that are likely to change soon (as blocks written
5453 to the cleaner segment are not likely to change soon).
5454 For data blocks we take "PinPending" to say "might change soon". For
5455 index blocks ... we don't know if it is pinned by Realloc or Dirty or
5456 PinPending children. So we set Realloc and wait for any children to
5457 be unpinned for whatever reason. If it is only pinned by Realloc blocks,
5458 it will end up on clean_leafs and be processed to the cleaner segment.
5459 If it is pinned by anything else it will be found by the checkpoint and
5460 processed to the new-data segment.
5462 So Index blocks always get Realloc, PinPending blocks get Dirty,
5463 Other data blocks get Realloc. Good.
5465 Must review PinPending usage... always set, then maybe-dirty inside
5466 checkpoint lock. In cases of unlocked usage (inode map) we don't clear
5467 PinPending until checkpoint so it has longer exposure to Realloc->Dirty.
5468 It is likely to be changing though, so not a big cost. Even good.
5470 Could make the distinction later. PinPending blocks don't go on
5471 clean_leafs. So if they are still realloc at the checkpoint, we Realloc
5472 to the new-data segment. This has the same net effect but is arguably
5473 cleaner. It means that if a realloc block gets pinpending set, it
5474 immediately stops being a clean leaf and so is safe.
5475 So: just keep PinPending blocks off clean_leafs. Keep them on phase_leafs.
5476 However there is no mechanism for moving things from phase_leafs to clean_leafs.
5477 So maybe they stay on clean_leafs, but when the cleaner gets to them, it
5478 dirties them and drops them.... that would work.
5480 So; if cleaner finds a block (on clean_leafs during cleaner-flush) which is
5481 Dirty or PinPending, it makes sure it is Dirty and drops it for phase_leafs
5484 BUT: Does this work for TypeSegmentMap blocks? They aren't PinPending.
5486 We could treat them specially in the cleaner. Or we could set PinPending
5487 and pin them to the phase, but treat them differently in checkpoint.
5488 If we gathered them onto a separate list, then flush the list after
5489 the phase had changed, it might be quite neat. No more getting writepages
5490 to do our work for us.
5491 They would need to be re-pinned to the next phase, then written out.
5492 Or just unpinned, and let seg_inc re-pin as appropriate... except that
5493 seg_inc is too later to pin. It dirties. We need to pin when we get
5494 SegRef. We currently reserve but we don't pin.
5495 We really do need to phase_flip these segmentmap blocks. But that requires
5496 getting extra credits, and Pinning everything if new credits are not available.
5497 And we don't really have a good list of 'everything' that depends on a segment.
5498 But seeing the space_alloc never fails for these...
5499 So Pin them, and flip them with AccountSpace
5502 - split out common 'flip' code
5503 - add 'flip' for data blocks
5504 - create list of accounting blocks and flip accounting file blocks onto
5505 that list during checkpoint
5506 Flush should write that list, not the files.
5507 - Get cleaner to ignore pinpending blocks, marking them dirty.
5508 - pin segusage blocks while ref on them is held.
5509 - writepage no longer needs special case for TypeSegmentMap, just PinPending
5510 - lafs_prealloc just tests PinPending
5513 [[aside: quota files seem to be handled like segmentmap files. Is that
5515 We only track usage of data blocks based on various 'owners' of the file.
5516 We need to know if a block was written in one phase or the next, and
5517 only count blocks written/allocated in the one.
5518 Data blocks can slip into 'this' phase quite late - any time before the
5519 parent is finally incorporated. So we don't write quota blocks
5520 until checkpoint is done. So yes, they are like SegmentMap
5525 If there are hundreds of snapshots, then a block being cleaned (whether to
5526 cleaner segment or new-data segment) could affect hundreds of segment
5527 usage counters. That would be clumsy to work with. Every block in the
5528 free table would need to hold references to hundreds of blocks. This
5529 is do-able and might not be a big waste of space, but is still clumsy.
5530 I could change the arrangement for accounting per-snapshot usage by having
5531 a limited number of snapshots and having all the counters for one segment
5532 in the one blocks. So 1024byte block could hold 512 counters (youth plus
5533 base plus 510 snapshots). Half that if I go to 4byte counters.
5534 In more common case of 32 snaphots, could fit counters for 8 segments in
5535 a block. This means using space/io for all possible snapshots rather than
5536 all active snapshots. It would also mean having a fairly fixed upper limit.
5537 I wonder what NILFS does....
5538 Worry about this later.
5540 Still trying to get pinning of SegmentMap blocks right.
5541 Normally we need a phase-lock when pinning a data block so that we
5542 don't lose the pinning before we dirty. But as we phase_flip
5543 these it doesn't matter... So just add that too the test??
5546 Reflecting on 5c - dirty_inode might find InoIdx pre-allocated but
5547 datablock not, and doesn't cope.
5548 We either prealloc both, which seems clumsy, or always defer
5549 to InoIdx if it is present and pinned.
5550 lafs_prealloc does both Index and Data blocks for inode.
5551 But Data could lose as writeout while index will replenish at
5552 phase_flip, so maybe not a good idea.
5553 If lafs_allocate_cluster finds a Dirty InoIdx it will copy the Dirty
5554 credits across to the data block (on non-cleaning segments) so the
5555 Data block doesn't need to have credits.
5557 dirty_inode gets called:
5558 {__,}mark_inode_dirty{,_sync}
5559 inode_{inc,dec}_link_count
5560 [[various quota ops]]
5565 generic_file_...write
5568 updates through inode_setattr go to lafs_setattr so the
5569 data block will be pinpending and the checkpoint lock will be held.
5571 updates through inode_*_link_count happen in filesystem and the inode data
5572 block is PinPending, or a block in the file is pinned and will be
5573 dirty, so it will get written.
5575 updates through touch_atime or file_update_time are unexpected and
5576 cannot be prepared for. file_update_time changes will be caught by
5577 normal file writeout. atime changes will be lost until we get the
5581 dirty_inode cannot change the block as it might be in writeout, and
5582 it cannot lock anything as it might be in touch_atime which shouldn't
5583 block and cannot fail.
5584 So just set I_Dirty and use that to flush inode to db at writeout.
5585 Any changes which must be in the next phase will come via setattr and
5586 so will wait for incompatible changes to be written out.
5588 Reflecting on 7c - cluster_flush might find ->my_inode is NULL.
5591 iget and mount-time stuff
5595 When I_Destroyed is set and the last ref on the block is dropped
5596 When inode_map_new_prepare claims an inodeblock
5598 So we could easily not have a my_inode - e.g. just cleaning the data block.
5599 ->my_inode cannot disappear while we hold the block, so a test is safe.
5602 ----------------------------------------------
5603 Space reservation and file-system-full conditions.
5605 Space is needed for everything we write.
5606 Some things we can reject if the fs is too full
5607 Some things we can delay when space is tight
5608 Some things we need to write in order to free up space.
5609 Others absolutely must be written so we need to always have
5612 The things that must be written are
5613 - cluster header - which we never allocate
5614 - some seg-usage and youth blocks - and quota blocks
5615 Whese continually have credit attached - it is a bug if there
5616 are not enough. (We hit this bug)
5618 Things that we need to write to free up space are
5619 any block - data or index - that the cleaner finds.
5621 Things that we can delay, but not fail, are any change to a block that
5622 has already been written or allocate.
5624 When space is needed it can come from one of three places.
5625 - the remainder of the current main segment
5626 - the remainder of the current cleaner segment
5629 Only Realloc blocks can go to the cleaner segment, so the
5630 'must write' blocks cannot go there, so unused + main must have enough
5631 space for all those.
5632 Realloc blocks can go anywhere - we don't need a cleaner segment if things
5635 When we run out of space there are several things we can do to get more:
5636 - incorporate index blocks. This tends to free up uninc-credits which
5637 are normally over-allocated for safety.
5638 - cluster_allocate/cluster_flush so more blocks get allocated and so
5639 more can be incorporated. See above. This is probably most helpful
5641 - clean several segments into whole cleaner segments or into the main segment.
5642 Much of this happens by triggering a snapshot, however we should only do that
5643 when we have full cleaner-segments (or zero cleaner segments).
5645 When cleaning we don't want to over-clean. i.e. we don't want to commit
5646 any blocks from a second segment if that will stop us from commiting blocks
5647 from the first segment. Otherwise we might use one cleaning segment up by
5648 makeing 4 half-clean. This doesn't help.
5651 So: we reserve multiple segments for the cleaner, possibly zero.
5653 We clean up to that many segments at a time, though if that many is zero,
5654 we clean one segment at a time.
5655 lafs_cluster_allocate only succeeds if there was room in an allocated segment.
5656 If allocating a new segment fails, the cluster_allocate must fail. This
5657 will push extra cleaning into the main segment where allocations must not
5660 The last 3(?) [adjusted for number of snapshots] segments can only be allocated
5661 to the main segment, and this space can only be used for cleaning.
5662 Once the "free_space - allocated_space" drops below one segment, we
5663 force a checkpoint. This should free up at least one segment.
5665 We need some point at which we stop cleaning because the chance of finding
5666 something to clean is too low. At that point all 'new' requests defintely
5667 become failures. They might do earlier too.
5668 Possibly at some point we start discounting youth from new usage scores so
5669 that the list becomes sorted by usage.
5673 cut-off point for free_seg where we don't allow cleaner to use segments
5676 event when we start using fixed '0x8000' youth for new segment scores.
5677 Maybe when we clean a segment with usage gap below 16 or 1/128
5678 event when we stop doing that.
5679 Maybe when free_segs cross some number - 8?
5681 point when alloc failure for NewSpace becomes ENOSPC
5684 point when we don't bother cleaning
5685 no cleaner segments can be allocated, and checkpoint did not increase
5686 number of clean segments (used as many as freed).
5687 Clear this state when something is deleted.
5690 Allocations come out of free_blocks which does not included those
5691 segments that have been promised to the cleaner.
5692 CleanSpace and AccountSpace cannot fail.
5693 We *know* not to ask for too many - cleaner knows when to stop.
5694 ReleaseSpace fail (to be retried) if available is below a threshold,
5695 providing the cleaner hasn't been stopped.
5696 NewSpace fail if below a somewhat higher threshold. If we haven't entered
5697 emergency cleaning mode, these requests fail -ENOSPC, else -EAGAIN.
5700 Possibly limit some 'cleaner' segments to data only??
5704 - change CleanSpace to never fail, but cluster_allocate new_segment
5705 can for cleaner segment. This is propagated through lafs_cluster_alloc
5706 - cleaner pre-allocates cleaner segments (for new_segment to use)
5707 and only cleans that many segments at a time.
5708 - introduce emergency cleaning mode which causes ENOSPC to be returned
5709 and ignores 'youth' on score.
5710 - pause cleaner when we are so short of space that there is not point
5711 trying until something is deleted.
5714 notes on current issue with checkpoint misbehaving and running out of
5717 1/ don't want to cluster-flush too early. Ideally wait until segment is
5718 full, but we currently hold writeback on everything so we cannot delay
5720 2/ row goes negative!! let's see...
5722 seg_remainder doesn't change the set, but just returns
5723 the remaining rows times the width
5725 seg_step move nxt_* to *, stepping to the next ... row?
5726 save current as 'st_*
5728 seg_setsize - allocate space in the segment for 'size' blocks plus
5729 a bit to round of to a whole number of table/rows
5732 seg_setpos initialises the seg to a location and makes it empty,
5733 st_ and nxt_ are the same
5735 seg_next reports address of next block, and moves forward.
5737 seg_addr simply reports address of next block
5739 So the sequence should be:
5741 seg_setpos to initialise
5742 seg_remainder as much as you want
5743 seg_setsize when we start a cluster
5744 seg_next up to seg_remainder times
5745 seg_step to go to next cluster (when not seg_setpos).
5746 or maybe just before seg_setpos
5748 Need cluster_reset to be called after new_segment, or after we
5749 flush a cluster but don't need a new_segment.
5751 I think I'm cleaning too early ... I am even cleaning
5752 the current main segment!!!!
5754 OK, I got rid of the worst bugs. Now it just keeps cleaning
5755 the same blocks in the current segment over and over.
5757 1/ it cleans a segment that it should not touch
5758 We need to avoid cleaner segment increasing the
5759 checkpoint youth number.
5760 2/ it has 6 free segments and doesn't use them
5762 clean_reserved is 3 segments, < 4, so free_block <= allocated+ watermark
5763 watermake is 4 segs, so free < 4. So we have 3 allocated to cleaner,
5764 3 in reserve and so nothing much to clean!
5766 The heuristic for returning ENOSPC is not working. Need something more
5767 directly related to what is happening.
5768 Maybe if cleaning doesn't actually increase free space.
5770 !Need to leave segments in the table until we have finished
5771 writing to them, so they cannot be cleanable. - DONE
5773 WAIT - problem. If cleaner segment is part-used, the alloc_cleaner_segs
5774 doesn't count that. Bad?
5776 When nearly full we keep checkpointing even though it cannot help.
5777 Need clearer rules on when there is any point pushing forward.
5778 Need to know when to fail requests.
5782 I am wasting lots of space creating snapshots that don't serve any
5784 The reasons for creating a snapshot are:
5785 - turn clean segments into free segments
5786 - reduce size of required roll-forward
5787 - possibly flush all inode updates for 'sync'.
5789 We currently force one when
5790 newblocks > max_newblocks
5791 max is 1000 , newblocks is never reset!
5792 probably make that a number of segments.
5793 lafs_checkpoint_start is called
5794 when cleaner blocks, and space is available
5796 on write_super is s_dirt
5797 __fsync_super before ->sync_fs
5802 generic_shutdown_super before put_super if s_dirt
5803 sync_supers is s_dirt
5805 file_sync !!! is s_dirt
5807 I think I should move checkpoint_start to
5812 - blocks remaining after truncate - one index and 1-4 data
5813 - truncate finds blocks being cleaned
5814 FIXED - move setting of I_Trunc
5815 - orphans aren't being cleaned up sometimes.
5816 Hacked by forcing the thread to run.
5817 - parent of index block has depth==1
5818 Don't reduce depth while dirty children.
5819 Probably don't want uninc either?
5821 - some sort of deadlock? lafs_cluster_update_commit_both
5822 has got the wc lock and wants to flush
5823 writepage also is flushed.
5824 Not sure what the blockage is.
5825 I think the writepage is the one in clusiter_flush, and it
5828 - Async is keeping 16/0 pinned during shutdpwn
5831 Testing overnight with 250 runs produced:
5832 - blocked for more than 120 seconds
5833 Cleaner tries to get an inode that is being deleted
5834 and blocks, so inode_map_free is blocked waiting for
5835 checkpoint to finish - deadlock.
5836 Need to create a ->drop_inode which provides interlock with
5839 But this is hard to get right.
5840 generic_forget_inode need to write_inode_now and flush all changes
5841 out and then truncate the pages off so the inode will be
5842 empty and can be freed. But flushing needs the cleaner thread
5843 which can block on the inode lookup.
5844 Ahh.... I can abuse iget5_locked.
5845 If test sees I_WILL_FREE or similar, it fails and sets a flag.
5846 if the flag was set, then 'set' fails
5849 - block.c:504 DONE (I trink).
5850 unlink/delete_commit dirties a block without credits
5851 It could have been just cleaned..
5852 It looks like it was in Writeback for the cleaner when
5853 unlink pinned and allocated it....
5854 or maybe it was on a cluster (due to writepage) when
5855 it was pinned. Then cluster_flush cleared dirty ... but
5856 it should still have a Credit.
5857 Maybe I should iolock the block ??
5859 On reflection it wasn't cleaning, just tiny clusters
5860 of recent changes which were originally written as tiny
5861 checkpoints. Maybe lots of directory updates triggered the clusters.
5862 I guess writepage is being called to sync the directory???
5863 Or maybe the checkpoint was pushed by s_dirt being set.
5865 So use PinPending and iolock to protect dir blocks from writepage.
5868 dir handle orphan find a block (74/0) which is not
5870 This can happen if orphan_release failed to reserve a block.
5871 We need to retry the release.
5873 index block and some data blocks still accounted to deleted file.
5875 No theory on this yet. Always one index block and a small number
5876 of data blocks. Maybe the index block looked dirty, but was then
5877 incorporated with something that was missed from the children list...
5878 Or maybe I_Trunc is cleared a bit early...
5879 Or trunc_next advanced too far?? or too soon
5882 - segments.c:640 DONE
5883 prealloc in the cleaner finds all 2315 free blocks allocated.
5885 Need to be able to fail CleanSpace requests when cleaner_reserve
5888 or just slow down the cleaner to one segment per checkpoint when
5889 we are tight.. Hope that works.
5891 async flag on 16/0 keeping block pinned
5892 Maybe clear Async flag during checkpoint. Cleaner won't need it
5893 No, just ensure to clear Async on all successful async calls.
5895 orphan file 8/0 has orphan reference keeping parent pinned
5896 [cfb64c90]8/0(1782)r1E:Valid,SegRef,PhysValid orphan(1)
5897 Orphan handling is failing to get a reservation to write out the
5898 orphan file block? Not convincing as there should be lots of space
5899 at unmount, and 'orphan sleeping' has become empty.
5902 orphan inode blocked by leaf index stuck in writeback:
5903 [cfb68460]331/0(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,SegRef,CI,CN,CNI,UninCredit,EmptyIndex{0,0}[0] primary(1) leaf(1) Leaf1(5)
5904 [cfb28d20]331/336(NoPhysAddr)r2F:Index(1),Pinned,Phase1,Valid,Dirty,Writeback,Async,UninCredit,PrimaryRef{0,0}[0] async(1) cluster(1) wc[0][0]
5906 This is in the write-cluster waiting to be flushed
5911 If a thread wants async something, it
5913 - checks if it can have what it wants.
5915 + if so, clear B_Async and succeed
5917 If a thread releases something that might be requested Async,
5918 it doesn't clear Async, but wakes up *the*thread*.
5921 IOLock - iolock_block
5922 Writeback - writeback_donem iolock_written
5923 Valid - erase_dblock, wait_block
5924 inode I_* - iget / drop_inode
5926 orphan handler, cleaner, segscan - all in the cleaner thread.
5929 2 hit 'Show State' with a blocked orphan inode.
5930 Two children, one EmptyIndex, one PrimaryRef, Async,Writeback
5933 Several runs blocked in cluster_flush or waiting for writeback.
5935 - first case: looks like cluster flush should run but doesn't.
5937 checkpoint, cleaner, cluster_allocate when full, update,
5938 writepage, sync_page
5939 So we have no timeout or other flush.
5940 I guess if we are waiting for writeback, we need to trigger a
5943 - other case - cluster_flush was called but is waiting for pending count
5945 Looks like cluster_reset shouldn't be changing pending_next
5947 New hang. Orphans not being processed:
5948 inode, because InoIdx is on leaf and checkpoint isn't pushing
5950 dir block 0 is Dirty leaf
5952 Maybe we failed to get a mutex, and mutex_unlock doesn't wake us.
5955 Over night it looks *very* good.
5956 Have one infinite loop with 31770 repeates of
5957 ORPH: [cfbe0000]0/328(2326)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,
5958 Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
5960 So either stuck in truncate_inode_pages, lafs_add_orphan, or inode_map_free
5961 lafs_add_orphan too short.
5962 tracing shows after truncate_inode_pages.
5963 must be blocked in inode_map_free - maybe use AccountSpace??
5964 But why isn't the the truncate progressing?
5965 Probably same reason: No ReleaseSpace available.
5966 Maybe we aren't cleaning because there is a free segment, and
5967 we aren't checkpointing because there aren't enough yet...
5969 Probably the cleaner has halted while CleanerBlocks - fix that.
5971 - 0/74 is a stuck orphan because 74/0 is a dirty leaf going nowhere..
5972 Need a checkpoint to release the orphan?
5973 ditto for 0/331 - 331/0
5976 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice
5978 This was pinned: [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI
5979 ,UninCredit,PhysValid leaf(1) intable(6) release(1)
5980 [ce5914f0]16/0(2)r8F:Pinned,Phase0,PinPending,Valid,C,CI,CN,CNI,UninCredit,Phys
5981 Valid leaf(1) intable(6) release(1) Leaf0(0)
5982 ------------[ cut here ]------------
5983 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:698!
5986 724 != 7 (st->free.cnt afte segdelete, close_segment, close_all)
5987 ------------[ cut here ]------------
5988 WARNING: at /home/neilb/work/nfsbrick/fs/module/segments.c:844 lafs_check_seg_cn
5990 we called segdelete on something that was on the freelist.
5991 This happens when the final cluster starts a new segment.
5992 Need to improve the fix though.
5995 lafs_inode_handle_orphan can make progress without leaving
5996 anything async. Maybe we need a return status:
5997 -EAGAIN - try after async
5998 -ENOMEM - try some time soon - hope memory will be better
5999 0 we called orphan_release
6000 anything else loops.
6003 - we allocate a segment in last checkpoint we don't
6004 take references properly.
6006 - orphan handle spinning on:
6008 ORPH: [ce545f08]0/290(1663)r4E:Valid,Dirty,Async,UninCredit,Claimed,PhysValid,Orphan(0) async(1) delete_inode(1) orphan_list(1) drop(1)
6010 stuck in delete_inode?? ?
6013 never-ending cleaning? Maybe just computer slow ??
6015 11July2010 - on plane to Prague.
6016 How can we safely access ->iblock?
6017 normally iolock, but how do we get iolock?
6018 - flush data to inode
6019 - cluster flush takes private_lock
6020 - private_lock is used to set to null.
6021 I guess we use private_lock to get a reference
6022 then iolock and revalidate
6023 but I can probably test for NULL at any time? though that can change under private_lock
6024 If we own a reference to a child with a parent, then we can use
6025 rcu_dereference to get a ref which might change
6029 ->write_inode is called by write_inode() called by __sync_single_inode
6030 to handle I_DIRTY_SYNC|I_DIRTY_DATASYNC after do_writepages
6033 change to addresss we already handle with checkpoints
6034 change due to setattr we can handle directly if we want
6035 that just cleans mtime/ctime and atime.
6036 mtime/ctime calls ->dirty_inode
6040 getattr changes set I_Dirty so that when cluster_allocate
6041 happens all the changes get saved.
6043 when dirty_inode is called, we set I_Dirty but don't dirty
6045 If anything happened to justify an inode write, it will
6046 be dirty anyway. If it isn't, this is just atime
6048 So on dirty_inode we check if atime has changed and if so
6049 we schedule change to atime file
6051 sync_inode should write an update for the inode if I_Dirty
6052 but sync_filesystems should not
6054 Simple. fsync calls ->fsync. We get that to write an
6055 inode update, but nothing else does.
6057 Possibly all directory updates could be chained onto a
6058 directory and only written when fsync is requested before
6060 both sides of a rename ??
6061 leave that for later.
6063 WritePhase - what is that all about?
6064 We must not change a block while it is being written to previous
6065 phase, else we corrupt causality.
6066 But we probably don't want to change it any way as that would
6067 mess up any checksum or duplication.
6069 So we want to ignore WritePhase - scrap it.
6070 Before changing a block, we must iolock_written
6072 - inode update in fsync
6077 But what about regular data. If prepare_write finds a block in
6078 writeback, do I need to wait, or can I just mark it dirty in
6079 commit_write? If no checksum and no duplication applies, this should
6083 BUT e.g. dir operations are in particular phases. If the dirblock
6084 is pinned to the old phase, we need to flush it, then wait for io
6085 to complete. So we need lafs_phase_wait as well as iolock_written.
6086 This is already done by pin_dblock.
6087 I wonder if we need a way to accelerate pinned blocks that are being
6088 waited for - probably not, they should be done early.
6090 So we probably want to iolock after phase_wait in pin_dblock.
6091 Though dir.c pins early.
6092 I need to review all of this and get it right.
6095 - we aren't allowed to block much holding checkpoint_lock as
6096 checkpoint_start waits for that. However phase_wait will only
6097 block if a new checkpoint has started already, so there is not
6098 chance of phase_wait ever blocking checkpoint_start.
6099 So it is safe to call phase_wait in checkpoint_lock.
6100 phase_wait will wait until block is written, added back to
6101 the lru clean, then found and flipped... I wonder if that is
6102 good - it keeps parent from being a leaf, and so written, until
6103 child write has completed.
6104 We want to phase-flip a block as soon as it is allocated by cluster_flush.
6106 With directory blocks, i_mutex stops other changes, so an early iolock_written
6107 will leave the block clean and phase won't be an issue.
6109 With inode-map blocks.. we:
6110 set B_Pinned to ensure no-one writes except for phase change
6111 do that after lock_written so it starts safe.
6112 once we have checkpointlock, wait for phase if needed.
6113 then lock_written again which should be instant but ensures
6114 that block is locked while we change it...
6117 - refile to call phase flip if index is not dirty and is in wrong phase
6118 and has no pinned children in that phase.
6119 - Only clear PinPending if we have i_mutex or refcnt == 0
6120 - before transaction:
6121 lock_written / set PinPending / unlock
6122 the inside cluster_lock
6123 lock_written pin / change / dirty / unlock
6124 it will only wait for writeout if phase changed.
6125 so don't need phase_wait
6126 but want pre-pin then pindblock
6128 dir create/delete/update - DONE
6129 inode allocate/deallocate - on inode map DONE
6131 orphan set/change/discard
6133 Orphans are a little different as when we compact the
6134 file, the orphan file block 'owned' by the orphan block
6135 can change. As along as we keep them all PinPending it
6136 should be fine though.
6137 I think that every block in the orphan file will always be
6140 OK - done most of that.
6141 Early phase_flip is awkward. We need an iolock to phase_flip,
6142 and we don't have one. The phase_flip could cause incorporation
6143 which cannot happen until the write completes. So I guess
6144 we leave it as it is.
6147 FIXME what about inode data block - cluster_allocate is removing
6148 PinPending after making them dirty from the index block..
6150 If all free inode numbers a B_Claimed, don't think we allocate
6151 a new block... yes we do, as 'restarted' is local to caller.
6154 each device has a number of flags
6155 - new metadata can go here
6156 - new data can go here
6157 - clean data can go here
6158 - clean metadata can go here
6159 - non-logged segments allowed
6160 - priority clean - any segment can be cleaned
6161 - dev is shared and read-only - no state-block updates
6163 state block needs a uuid for an ro-filesystem that this is
6166 Is metadata an issue?
6167 We might want it on a faster device, but ditto for directories
6168 and for some data. So probably skip that.
6170 Have separate segment tables for:
6172 - can have clean data but not new. (this often empty)
6174 Clean data can go to new-not-clean if nothing else
6175 new data can go to clean-not-new ?? if not sync??
6176 Maybe call them 'prefer clean' and 'prefer new'
6179 'no sync new' - don't write new data, unless it is in big chunks and
6180 can wait for checkpoint to be 'synced'
6181 'no write' - never write anything - this is readonly.
6182 used for removing a device from the fs.
6184 A 'no sync new' device can have single-block segments.
6185 This doesn't allow compression, but avoids any need to clean
6186 In this case we don't store youth and the segusage is 32 bits per segment.
6187 That means - for 1K block size - 0.5% of devices used for segusage. That
6188 feels high. For 4K, 1/1024 so a giga per terabyte.
6189 Then limited to 29 snapshots plus base fs, and 2 bits to record bad blocks.
6191 Other segusage for 29 snaps is 1/million of space used.
6192 So we 'waste' 0.1% of device for no secondary cleaning.
6193 Can still do defrag though.
6195 clearing a snapshot on a 1TB device writes 1GB of data!! potentially.
6196 as does creating a snapshot.
6199 If lafs were cluster enabled we would want multiple checkpoint clusters,
6200 one for each node. When a node crashes some node would need to find and
6201 roll-forward. For single node failure, it is enough to broadcast cluster
6202 address to all others. For whole-cluster failure, need to either list all
6203 in superblock or link from main write cluster.
6205 When writing to multiple devices we may want multiple write clusters
6206 active for new data. These all need to be findable from checkpoint cluster
6207 so linking sounds good.
6208 Having a single 'fork' link in cluster head might work but does scale to large
6209 cluster. I doesn't need to be committed to other not does checkpoint end, so
6211 Could have a special group_head to list other clusters for roll forward.
6212 If we put fsnum first, a large value - 0xffffffff - could easily mean
6215 Or every cluster head could point to an alternate stream, and if we want many
6216 quickly, each simply points to another, so we create a chain across all writers.
6220 When we 'sync' we don't wait for blocks until after the checkpoint is started,
6221 and we know that will be driven through to CheckpointEnd which will commit and
6223 However 'fsync' doesn't have the same guarantee. The sync_page call will ensure
6224 the data has been written, but we don't know it is safe until the next
6225 header is written. So we need to push out the next cluster promptly.
6227 So if sync_page is called on a page in writeback, then we mark the cluster as
6228 synchronous. When a sync cluster completes, the next (or even next+1) clusters
6229 are flushed out promptly. Hopefully they won't be empty on a reasonably busy system,
6230 but it is OK if they are.
6232 If a block is writeback for the cleaner.. then as the cluster is VerifyNone, as soon
6233 as the write completes the block will be released.
6235 So: to clarify sync_page:
6236 This can be called when page is in writeback or locked.
6237 If locked there is nothing we can do except maybe unplug the read queue.
6238 If page is in writeback and block is dirty, then it is probably in
6239 a cluster queue and we should flush the cluster and the next.
6240 If page is in writeback and block is not dirty, but is writeback,
6241 just flush one cluster.
6242 But we don't want these cluster flushes to start while the previous is
6243 still outstanding else we stop new requests from being added.
6244 So as soon as the cluster can be flushed we flush, but no sooner.
6245 I guess we use FlushNeeded and make that be less hasty.
6250 We currently have a superblock for each device.
6251 I cannot see a good reason for that.
6252 We can just bdev_claim for 'this' filesystem.
6253 Rather we should have a number of anon superblocks,
6254 one for each fileset, then one for each snapshot.
6255 Do we use different fs types? probably yes
6256 lafs - main filesystem made from devices
6257 lafs_subset - subordinate fileset, given a path to fileset object
6258 can have 'create' option when given an empty directory.
6259 lafs_snap - snapshot - given a path to filesys and textname.
6261 Cannot create a snap of a subset, only of the whole filesystem
6262 Is it OK to mount eith snap of subset or subset of snap?
6263 It probably does, so need to use the same filesystem type for both.
6264 Maybe lafs_sub or sublafs. Needs path to directory.
6265 can be given 'snap=foo'.
6266 No: a given filesystem may not exist in a snapshot. You need to
6267 mount the snapshot first, then the subset of the snapshot.
6268 So we have three types as above. All subsets as 'lafs_subset',
6269 whether they are subset of main or of snapshot.
6271 Should we be able to create a snapshot or subset without mounting it?
6272 It doesn't really seem necessary but might be elegant..
6274 remount doesn't seem the right way to edit a filesystem as it forces
6275 some cache flushing.
6276 What do we want to edit?
6277 - add device, remove device
6278 - add/remove snapshot by name
6279 - add/remove subset? Not needed, just mkdir/rmdir and mount to convert
6280 empty dir to subset.
6281 - change cleaner settings??
6282 Could have remount as an option. If problem find other option.
6284 While cleaning (which is always) we potentially need all superblocks
6285 available as we might need to load blocks in those filesystems to
6287 Unfortunately each super needs to be in a global list so there is a cost
6288 in having them appear and disappear. I guess that is not a big deal. They
6289 are refcounted and will disappear cleanly when the count hits zero.
6292 DONE - change all prime_sb->blocksize refs to fs->blocksize
6293 DONE - create an anon sb for the main filesystem
6294 DONE - discard the device sbs, just bd_claim the devices and add to list
6295 - use lafs_subset for creating/mounting subsets.
6297 Changed s_fs_info to point to the TypeInodeFile for the super, but
6298 for root/snapshot that doesn't exist early enough to differentiate the
6300 So we make an inode before the super exists and attach it after.
6301 Need to do all that get_new_inode does.
6302 inode_stat.nr_inodes++ - just don't generic_forget the inode
6303 add to inode_in_use - seems pointless - just set i_list to something
6304 add to sb->s_inodes - if we don't it won't flush - maybe that is good?
6305 add to hash - don't want
6306 i_state == lock|new - only really needed if hashed.
6307 but there is lots of initialisation in alloc_inode that we cannot access!!
6309 Problem is that we need s_fs_info to uniquely identify the fs with something
6310 that can be set in the spinlock, so allocating an inode is out.
6311 And also to get to the filesystem metadata which is in the inode.
6312 I guess we allocate a little something that stores identifier and later inode.
6313 for lafs we use uuid
6314 for subset we use just the inode
6315 for snapshot we use fs and number
6320 - sget gives us an active super_block. We need to attach to a vfsmnt
6321 using simple_set_mnt, or call deactivate_locked_super.
6322 - sget's set should call set_anon_super
6323 - kill_sb (called by deactive_super) should then call kill_anon_super
6325 If we have a vfsmnt, we have an active reference, so we can atomic_inc
6326 s_active safely. So use this to allow snapshots and subsets to hold a
6327 ref on the prime_sb and thence on the 'fs'.
6330 - DONE need to set MS_ACTIVE somewhere!!
6331 - FIXME if an inode is being dropped when iget comes in, it gets confused
6332 and the inode appears to be deleted.
6334 We cannot really break the dblock <-> inode link until after write_inode_now,
6335 but there is no call-back before generic_detach_inode is complete.
6336 The last is write_inode which is only calledif I_DIRTY_something.
6337 Maybe when writeback completes on an inode dblock, we should check if
6338 the inode is I_WILL_FREE and if so, we break the link...
6340 Or maybe when we find my_inode set we can check the block and if it isn't
6341 dirty or being deleted we break the link directly... That makes more sense.
6343 So... what is the deal with freeing inodes???
6344 ->iblock is like a hashtable reference. It is not refcounted
6345 It gets set under private_lock
6346 iblock is freed by memory pressure or lafs_release_index from
6348 when refcount of iblock is non-zero, ->dblock ref is counted,
6350 dblock is set to NULL if I_Destroyed, or when dblock is discarded,
6351 (under lafs_hash_lock)
6352 and set to 'b' in lafs_iget and lafs_inode_dblock
6354 We can drop the dblock link as soon as iblock has no reference
6356 probably get clear_inode to break the link if possible, which it should
6357 be on 'forget_inode'. Then lafs_iget can wait on the bit_waitqueue.
6358 or maybe do clear_inode itself
6360 FIXME when we drop dblock we must clear iblock! as getiref iblock assumes
6364 So: ->dblock and ->my_inode need to be clarified.
6366 Neither is a counted reference - the idea is that either can be freed and
6367 will destroy the pointer at the time so if the pointer is there, the
6368 object must be ... but we need locking for that.
6369 ->dblock is reasonably protected by private_lock, though if ->iblock exists
6370 we hold a ref of ->dblock so we can access it more safely.
6372 Need to check getiref_locked knows ->dblock exists when called on iblock
6373 and lafs_inode_fillblock
6376 But ->my_inode needs locking too so the inode can safely disappear without
6377 having to wait for the data block to go. After all data blocks some in sets,
6378 and one shouldn't keep others with inodes.
6379 So something light-weight like rcu might work.
6380 We use call_rcu to free the inode and rcu_readlock to access ->my_inode
6382 Yes, that will work. Occasionally we will want an igrab to, but not
6384 Should look into rcu for index hash table and ->iblock as well.
6385 Current ->iblock is only cleared when the block is freed .. I guess that is fine...
6389 rcu protection of ->my_inode
6390 A/ orphan inodes - are they protected?
6391 B/ orphan blocks - are the inodes of those protected? Probably...
6393 inodes are 'orphan' for two reasons
6394 1/ a truncate is in progress
6395 2/ there are no remaining links, so inode should be truncated/deleted
6398 The second precludes us from holding a refcount on any orphan inode,
6399 else it would never get deleted.
6400 So we must assert that an inode with I_Deleting or I_Trunc has an implied
6401 reference and so delete must be delayed... not quite.
6402 If we set I_Trunc but not I_Deleting, then we igrab the inode until
6403 I_Trunc is cleared. While we hold the igrab, I_Deleting cannot possibly
6404 be set as that is set when last ref is dropped.
6407 FIXME lafs_pin_dblock in lafs_dir_handle_orphan needed to be ASYNC.
6408 .. and in lafs_orphan_release
6409 Well... only iolock_written can be a problem, and our rules require that
6410 only phase-change writeout can set writeback. So the cleaner can never
6411 wait for writeout here. Maybe it can wait for a lock, and maybe we don't
6412 really need a lock, just 'wait_writeback'.
6414 So cleaner is in run_orphans, dir_handle_orphan pin_dblock iolock_written
6415 It is writeback waiting on 74/BIGNUM fromm file.c:329. So writepage
6416 tried to write a block in a directory .. but it is PinPending so that
6417 must have been set after writepage got it...
6418 lafs_dir_handle_orphan gets an async lock, then sets PinPending.
6419 If write_page is before that, it will have the lock and dir_handle will try later.
6420 If write_page is after it will block on the lock, or see PinPending and
6422 So someone else must be clearing PinPending!
6423 - checkpoint clears and re-sets under the lock, so that is safe
6424 - dir.c clears under i_mutex
6425 dir_handle_orphans always hold i_mutex ... or does it.
6426 - refile drops when the last non-lru reference goes.
6427 - inode_map_new_abort clears for inode
6428 No, not that - just bad test on result lof iolock_written_async ;-(
6430 Now have an interesting deadlock.
6431 rm in lafs_delete_inode in inode_map_free is waiting for the block to
6432 flush which requires the cleaner.
6433 The cleaner thread in inode-handle_orphan is calling erase_dblock
6434 on the same inode which blocks while inode_map_free has it locked....
6435 no, not same block - just waiting for writeout which requires cleaner.
6436 lafs_erase_dblock from inode_map_free must be async!
6437 pin_dblock in lafs_orphan_release must too.... no - only the setting of
6438 PinPending needs to be async or out side of cleaner, which it is.
6440 Ok, got that fixed. All seems happy again, time for a commit.
6444 14b/ What backing-dev to show the filesystem.
6449 throughput measurements
6451 Much of that is for generic code to use. We need to:
6452 - provide an unplug funtion that unplugs all devices
6453 - provide a congested function that which checks all devices,
6454 or for 'write' - at least the device we are writing to.
6456 How do we set the backing device?
6457 The 'struct address_space' point to one, as does struct super_block.
6458 set_anon_super establishes a null bdi, set_bdev_super gets it from the
6461 We need to bdi_init and bdi_register (if no error) our bdi.
6462 bdi_destroy calls unregister and reverses bdi_init
6463 or just bdi_setup_and_register
6464 but bdi_register_dev gives a better name - isn't this sick!!!
6466 Partly done ... but I'm hitting more bugs :-(
6468 -Checkpoint cannot complete because...
6469 Lots of dirty inodes that are orphans are not pinned!! I
6470 guess the InoIdx is ??
6471 Most of them don't have InoIdx(?) Only '8' does.
6472 8/0 is also an orphan and is on wc[0]
6474 It seems that this block keeps getting re-written and stays in
6476 Is that because it is a data block with PinPending.. No, that works
6477 as long as it become un-dirty: we drop pinpending, refile, and set again
6479 It is being dirtied again during writeout for the checkpoint
6480 so it doesn't get to changed phase when we lift PinPending.
6481 I gues we mustn't dirty it if it is in the old phase.
6483 -And twice inode 17 is deleted without B_Orphan being set!
6484 That is the only file that exists before we mount.
6485 Problem was orphan_release instead of orphan_forget
6486 I wonder why it only affected 17...
6488 -at shutdown we drop an inode and try to invalidate pages, but
6489 root inode is still dirty - I wonder why.
6490 The dblock is in a different phase to the iblock.
6491 In checkpoint we wait until root iblock changes phase, but
6496 I'm testing subordinate filesystems, which don't work yet.
6497 I need to create the root directory and inode map.
6498 Obviously I cannot record the inode map file in the inode map....
6499 inode_map should ignore everything less than 16? 8? 2?
6500 Need to make sure creating with a given inode number works.
6501 Need to make make sure auto-allocate inum is never less than 16.
6504 How to map from filesys inode to superblock?
6507 choose_free_inum - to get inode-1
6508 ditto in inode_map_free
6509 lafs_put_super has something odd with i_sb
6511 Could do an sget search..
6512 Or could just store it in the inode (but not in i_sb!!)
6513 inode already a bit large though.
6514 Do it for now, but make a note to trim the fs_md part of inode
6515 into a separate allocation.
6517 lafs_new_inode should take an 'sb' not a 'filesys'.
6518 In fact, get rid of filesys. It is
6519 MAP(i->i_sb->s_fs_info)->root.
6521 15f - timestamps for roll-forward.
6522 The writeout can be much later, but logging the mtime is fairly
6523 boring ... we could log mtime in the group head, which might be cheap
6524 enough. How much precision is needed, and against what base?
6525 probably mtime of last checkpoint from superblock. That should
6526 be not more than 2048 seconds ago, so 16 bits gets is 30msec...
6529 15l - decay youth info.
6531 youth_next and checkpoint_youth in 'struct fs'
6532 all blocks in youth files on storage
6533 all scores in seg-tracker.
6534 - not needed, they'll get updated in normal progress
6535 and being wrong for a while is no cost.
6536 ensure correct youth is stored in lafs_free_get
6537 check little-endian conversion of all youth accesses
6539 checkpoint_youth only used by thread, so no locking needed
6540 youth_next protected by fs->lock
6542 15m - share orphans and cleaning list_heads in datablock
6543 It certainly is possible to clean an orphan but it is very unlikely
6544 as it will have changed recently, or be changing soon.
6545 The cleaner could just dirty any B_Orphan it finds.
6546 But if orphan finds a block on the list, it must be careful...
6547 I guess when cleaner drops a cleaning ref, it should check if the block
6548 is an orphan, and re-queue if it is.
6550 15o - async blocks just have an extra refcount.
6552 - keep PinPending set
6553 - keep an index block pinned - will phase-flip
6554 - keep ->parent link
6555 not not get in the way of a checkpoint.
6557 Should we clear any that we find though?
6558 Normally async is only used by cleaner, orphan processing, or segscan
6559 So it should all be finished when we do a checkpoint.
6561 So if checkpoint, or release_page, finds an async block, drop it.
6563 15r - further optimisations in cleaner to avoid lookups.
6564 We have fsnum,inum,blocknum and cluster seq number and trunc num.
6566 I want to introduce more async though. Currently it only loads
6567 one inode at a time.
6568 To do more, I need to mark inodes as 'done' when they are and always
6569 restart from the start of the cluster (only do one cluster at a time
6571 So if we get all the way though a cluster with no 'EAGAIN' we finish
6574 15y - when could a directory block become an orphan?
6575 - when deleting that last entry - we don't know if it can be fully
6576 deleted until we look in next block
6577 - when deleting an entry follows a chain back to the first block
6578 - when deleting the last entry in the block.
6580 So it could be an orphan if the entry found:
6581 - is at end of block
6584 or first entry is already deleted.
6587 looking at flushing etc when run out of space.
6588 We often force a checkpoint when it won't do any good as
6589 nothing has been cleaned.
6590 In fact we write lots of dead checkpoints to 0/0 until it is full,
6591 then move on, clean 0/0 and suddenly have space.
6592 We shouldn't do that. sync should be what pushes us forwards.
6593 Maybe that is fixed..
6595 InoIdx blocks still cause confusion. Should they ever have credits?
6596 or do only the data block have those? Certainly they cannot have
6598 And there is confusion in my mind whether data blocks can be pinned
6599 while the InoIdx block is - need to clarify that.
6602 13Sep2010 - now, where was I...
6603 - I've just been dropping the use of SegRef on InoIdx blocks, where it makes no sense.
6604 - test run: block.c:660 - no credits available while dirtying an InoIdx block during
6605 orphan handling. lafs_reserver_block (under checkpoint lock) should have set credit.
6606 Only I just changed reserve_block to do that dblock instead - I wonder why.
6607 OK, I think I cleaned that up...
6610 - make_orphan is hanging in checkpoint_unlock_wait. So orphan_pin returned -EAGAIN
6611 so pin_dblock did too. So reserve_block did too, so prealloc or summary_alloc or seg_ref_block
6613 Problem is that we don't push a checkpoint when cleaner runs out of things to do.
6614 But we don't want to go back to pushing a checkpoint too often.
6615 Maybe the problem is that we only force the checkpoint when we have enough space to do
6616 new allocations, but we need to force it earlier if nothing new can be cleaned.
6618 Once we set EmergencyClean, lafs_reserve_block will stop returning EAGAIN for newspace, so
6619 we need to wake 'checkpoint_wait' then.
6620 But for ReleaseSpace we want to wake on every checkpoint... we probably do anyway.
6621 ...anyway, that is sorted now at commit 95b6b05e460
6625 - These never get SegRef as that is meaningless - done.
6626 - These can have credits. It possibly isn't necessary bit it makes things
6627 easier. They are 'written' by transfering the credits to the data block, or discarding them.
6628 - I think dblock and iblock can both be pinned
6629 The problem this caused was that the dblock might get processed as a leaf before iblock.
6630 We now have lafs_is_leaf which causes dblock not be a leaf even if it is pinned, if the iblock
6631 is pinned to the same phase.
6632 lafs_phase_flip refiles the dblock so that it goes back on the leaf list as does lafs_refile when
6634 So lafs_pin_dblock doesn't need to pin the inode instead.
6635 OK, that is fixed. - commit f1c05293bfd Mon Sep 13 15:07:27 2010 +1000
6637 15u - I don't need to get a segref there, but I need to have one from the original dirty block,
6638 so fix that up - commit Mon Sep 13 15:28:08 2010 +100
6640 15v - What do we have?
6641 lafs_dirty_dblock: set Dirty, clear Credit clear NCredit
6642 set Uninc, clear Icredit clear NICredit
6643 lafs_dirty_iblock: set dirty, clear credit
6644 test uninc, clear ICredit, set Unincredit - not essential
6645 mark_cleaning: test realloc, / alloc / set realloc
6646 test dirty / clear realloc/ set credit
6647 set uninc clear icredit
6648 cleaner_flush: set dirty, clear realloc, clear credit
6649 test dirty, clear realloc set credit
6650 flush_data_to_inode:
6651 lafs_cluster_allocate - there is some odd code ther!!
6653 lafs_allocated_block
6655 all rather different really.
6656 Just do some tiny tidyup in lafs_cluster_allocate when dirtying dblock
6658 15w/ Space used by cluster updates??
6659 It is all fine - just some confusion of function names.
6661 15z/ logging symlink creation.
6662 Do I need to log the content? I needs to be safe on a dir sync, and you cannot sync the
6663 symlink itself. So I guess we queue the block for writeout so it will go with the
6665 Yes, that works: Mon Sep 13 17:33:54 2010 +100
6667 15ab/ already did that in commit f90959e6f492b6
6670 15ac/ How can we trigger write-out of dirty index block which have no pin-count, thus allowing them to
6671 be freed after the write completes? A checkpoint could do it, but that would write out index block
6672 that cannot be freed too. A checkpoint would only be good after lots of data pages had been written.
6673 We could just wait and let other processes kick in..
6675 I don't think we need to do anything. lafs_shrinker doesn't really know how tight memory
6676 is, and periodic checkpoint will free up any memory that we are pinning.
6678 .... but something is needed. We need some trigger to write dirty index blocks
6680 - a timeout on checkpoints - every dirty_expire_interval - but that isn't exported.
6683 Not sure this is a complete solution. I might want to incorp/flush index block when they
6684 have no dirty children, but I'm not sure about that.
6687 15ad - lafs_add_block_address call from lafs_phase_flip - do I handle failure correctly?
6688 failure happens when b2 is data block and uninc table is full so we called incorporate on the parent.
6689 This could split the parent which means the block could have been re-parented - it would have been in the
6690 child list and so found and fixed.
6691 lafs_allocated_block, when this happens, checks that the parent is dirty/realloc as appropriate.
6692 Inf this case, realloc isn't an issue, only dirty. lafs_incorporate must have made it dirty and
6693 it won't get written while it has these in-phase children, so all is happy.
6695 15ae - refile race? Someone might set B_IOLock before removing from lru, so
6696 onlru is 0 and refcnt is elevated so it doesn't seem to be unused.
6697 But then whoever has the refer will refile again when dropping it and
6698 so the right thing will be done.
6699 But more generally, do we really want the lru etc to own a counted reference?
6701 - we would need to refile when removing from any list
6702 - we would need to get a ref when removing from list.
6706 clear PinPending if refcnt is low
6707 unpin if not PinPending, or dirty etc and data or refcnt is low
6708 place on leaf list - if pinned etc - this can be earlier
6709 drop parent linkm if refcnt is low, and not pinned etc
6710 handle dblock issues
6712 if lru was not refcounted, then the only things we might do when refcnt isn't zero are:
6713 unpin a dblock once it is not dirty
6716 But if we don't count lru, then we can lose the refcount on dblock
6718 Hmmm - we cannot leave things on the leaf list forever as they thus hold a reference and
6721 I think I want things on 'leafs' list to not hold a counted reference.
6722 Things *only* get removed while walking the list.
6723 InoIdx blocks hold a ref on the dblock both when counted and some other time. Possibly
6724 when pinned. This ensure they are held InoIdx is while a real leaf.
6725 But: When we take that first ref, how do we know the dblock even exists?
6727 What is the lifetime of ->dblock?
6728 removed when page is released
6729 set by lafs_import_inode
6730 set by lafs_inode_dblock
6731 removed by clear_inode
6732 So if I don't hold a ref, I always need to be ready to call lafs_inode_dblock
6733 This is currently callers of getiref_locked
6734 - erase_dblock_locked ?? shouldn't need a lock
6735 - ihash_lookup - never on InoIdx
6736 - lafs_make_iblock - already have dblock
6737 So none of those really need lafs_inode_dblock
6738 What about when we set Pinned
6739 only really from set_phase ... messy.
6740 What about when we set ->parent
6741 grow index tree - not relevant
6742 ditto do_incorporate_*
6744 Can be called on InoIdx from:
6745 lafs_make_iblock only!!
6749 I have tidied lafs_refile up a lot but I need to make locking a lot cleaner.
6750 In particular I want a single lock I can take when the refcnt hits zero which will ensure no ref
6751 is taken until I have finished my cleanup. I suspect the inode private_lock is the one to use.
6752 I also need to clean up getiref_locked and getref_locked - having both is awkward.
6754 So: when are they called?
6757 lafs_get_flushable - hold fs->lock
6758 first_in_seg - holds private_lock, but shouldn't need _locked as hold a ref through child.
6760 pin_all_children - hold private_lock
6761 find_better - private_lock
6763 lafs_invalidate_page - to get a ref on each block to either erase or invalidate it
6764 presumably page is locked
6765 lafs_get_block - holds private_lock - plus once with only page_lock
6766 lafs_release_page - holds private_lock
6767 (getiref_locked on dblock) - no locking
6768 lafs_inode_dblock - private_lock of my_inode...
6769 lafs_delete_inode - private_lock of my_inode
6770 lafs_destroy_inode - ditto
6771 lafs_drop_inode - ditto
6773 erase_dblock_locked - private_lock
6774 lafs_get_flushable - fs->lock
6775 ihash_lookup - lafs_hash_lock
6776 lafs_make_iblock - private_lock
6778 So private_lock looks like a good choice. Issues are:
6779 - what is the story with dblock on my_inode->private_lock
6780 - what is the lock ordering
6781 - what can refile negate that we need to be careful of.
6782 i.e. we want to keep things stable while refile does its tests, but what do we need to keep
6784 + we break the parent link?? and so the siblings link
6785 + move things to freelist
6787 + free dblock if not page_private
6789 Lock_ordering. private_lock, then fs->lock, then lafs_hash_lock
6790 So if we have to hold lafs_hash_lock, we increment refcnt, drop the lock, get/drop private_lock
6792 This is getting messy - I need something nice and clear.
6795 If Pinned, either has references or is on a leaf list - possibly both
6796 If no references and not pinned then not on leaf list, so can be on free list
6798 Pinned can only be set when there are references, and can only be cleared under private_lock
6799 This is violated by phase_flip, which badly reads refcnt
6800 If refcnt is zero and not pinned, then can be moved to free_list
6801 If on freelist and refcnt is zero under hash_lock, can be freed
6803 So if lafs_get_flushable finds a block that is not pinned, then we can delete and ignore.
6804 Someone else must hold a ref and will put it and it will refile. but that is pointless as
6805 it could immediately be cleared after we test Pinned.
6807 lafs_get_flushable should get a reference before deleting from list. This ensure it won't be freed
6808 by lafs_shrinker, though it could be on the free list. If it is, then it isn't pinned so it is not
6813 These are removed from lru when freed - we just need the extra refcnt check after removing from list.
6814 No we don't - these are only pinned while refcnt or dirty and can only loose dirty while refcnt
6815 so they cannot disappear
6817 What is the story with my_inode->private_lock though? This is used to protect ->dblock accesses.
6818 I guess we need to get or hold the other lock .... look at what the race is - what else is checked when dblock is cleared?
6819 dblock is cleared in refile for the dblock,
6820 or in clear_inode under the inode rivate lock.
6823 There are various places that hold a non-counted reference to a block.
6825 - index hash table lafs_hash_lock
6826 - index free list lafs_hash_lock
6827 - phase_leafs / clean_leafs fs->lock only if pinned
6828 - inode->iblock lafs_hash_lock
6829 - inode->dblock inode->i_data.private_lock
6831 Each of these is protected by its own lock, but not all the same lock.
6832 When we turn one of these into a counted reference, we increment refcnt under the local lock,
6833 then after dropping that lock we take and drop b->inode->i_data.private_lock to ensure refile has
6834 finished. This must be done before changing/using the block in any way.
6835 To free an index block it must first be removed from _leafs list. Then if the refcount is still
6836 zero it can be freed - or put on freelist and subsequently freed.
6837 An InoIdx block - we need to hold hash_lock as well as private_lock to take a reference.
6838 To free a data block we similarly need to recheck refcnt after removing from leaf list.
6839 If it is in an inode file we also take that inode's private_lock to clear dblock.
6840 We use rcu to get the inode, the lock it, then clear dblock if refcnt is still zero.
6843 review lafs_refile - are some of those tests redundant? - yes, one is gone.
6846 15ah - What about truncated blocks sitting on an uninc chain?
6847 I don't see the problem. It will eventually get incorporated and do the right thing...
6849 15ai - We don't want to touch the youth block during a checkpoint else it is awkward to write it out in
6851 No, I don't think that is really a problem. It only gets written out in the tail of the checkpoint after
6852 the root. I guess it could then get a youth number for a segment that it has no count for, if the root is
6853 written at the end of one segment and the segusage/youth written at the start of the next.
6855 But I think roll-forward is missing something. Blocks in the next phase need to be counted into segusage.
6856 Are they? oh, yes - they are. - cleaned and index blocks are ignored so they might be some wasted space,
6857 but the important blocks picked up by the roll-forward are handled.
6861 A checkpoint could cover multiple segments. We need to be sure these each get a valid youth number.
6862 Probably most of them will, but we need a consistent approach to be sure.
6863 They don't need to be added to the segtracker, except the last needs to be active, and it already is.
6864 So as we find a new segment we want to do much like was lafs_free_get does youth_update.
6865 But the data block - isn't that youthblk? When it that set?
6866 segsum_find sets if it ssnum == 0
6869 15ak - run the orphan file at mount time.
6870 After roll-forward when we have a working filesystem, we need to read the orphan file, load each block
6871 mentioned, and register each as an orphan.
6873 - setting the orphan_slot
6876 Just like at the start of orphan_commit
6877 We also need to initialise nextfree and possibly 'reserved'.
6878 But: can orphans be created during roll-forward? They certainly can. We currently hide that in a re-use of
6879 the orphan list.. But directory updates are possible too, and not handled.
6881 I guess we should examine the file as soon as root is loaded as before roll-forward as roll-forward cannot
6882 change the orphan file. Then after roll-forward, we read the original part of the file and set up
6883 any orphans that aren't yet.
6884 So we want to read once to get the size. Then read again to process content up to that size.
6886 15am - filesystem name.
6887 This is only used for identifying snapshots
6890 - mkfs is done to an initial version of lafs-utils. !!!
6892 So: 15am - filesystem name - used to identify snapshots
6893 So the name is pointless in subordinate filesets. So I could just shrink
6894 the metadata. The primary metadata needs to be big enough to get a name
6898 When cleaning we have a separate credit bit 'B_Realloc' from 'B_Dirty'.
6899 But we have the same B_UnincCredit bit for both. Is that safe?
6900 Processing the cleaner could absorb the UnincCredit while the blocks is
6901 reserved but not dirty. Then when it gets dirtied, there may be not
6902 enough credits to split.
6903 We set Dirty from Credit, and use ICredit for UnincCredit.
6904 But when only Realloc (not dirty) we don't use those bits. We allocate
6905 fresh credits or set Dirty if that fails.
6908 Need lafs_iget_fs to work on other filesystems. And other snapshots?
6910 in cleaner when parsing cluster head
6911 in orphan handler when loading orphan file or when rearranging it.
6914 Each of these might need to kern-mount the fs - so we need to hold the ref
6916 Cleaner also needs to explore snapshots.
6918 Don't want kern_mount - that is too heavy weight and includes a vfsmnt.
6919 Just split up lafs_get_subset and use sget etc. so we get an 'sb' that we need
6921 Similarly for snapshots. Cleaner needs to consider all snapshots, so they
6922 all need to be mounted.
6924 So snapshot 'sb's are referenced by cleaner, and de-reffed when cleaner stops.
6925 Subset 'sb's can be attached to the parent inode and then only dropped when
6926 the inode goes... only sb currently references inode.
6927 So maybe the first ref to an sb doesn't ref the inode but others do - is that
6928 possible? No, as we don't see them being dropped.
6929 Every inode in the subset could ref the filesys inode. That would keep it active
6930 the right amount of time, but release/destroy could still be racy.
6932 I guess cleaner/orphan/roll need to explicitly ref the fs.
6933 cleaner already refs inode when B_Cleaning, so hold fs too.
6934 B_Orphan seems to own and inode ref too.
6937 lafs_iget_fs gets a ref on the inode and the sb.
6938 need lafs_iput_fs to drop both references
6939 B_Cleaning, B_Orphan, I_Pinned and I_Trunc all hold this double ref.
6941 cleaner holds refs on all snapshots
6943 FIXME I probably need to hold inode/fs for B_Async too.
6944 No. Async only refs the block, not the inode or fs.
6945 Something else would normally ref the inode - e.g. cleaner.
6946 When the inode is free, the page invalidation will notice the
6947 B_Async flag and release it.
6949 So that is all done now, except I don't hold refs on snapshots in the cleaner
6954 - When is this used? directory etc don't need it.
6955 - a regular file might, but there is no API to punch
6956 a hole.... yet I guess.
6957 - So we just want to allocate these blocks to 0.
6959 15oct2010 - happy birthday Daniel...
6961 a/ files with nlink==0;
6962 If we happen to find them, we hold a reference until all roll-forward
6963 is done, incase a name is found - it is important not to start deletion
6967 36g - write roll_mini for directories.
6968 We get a name, an inode number, and one of:
6969 LINK UNLINK REN_SOURCE REN_NEW_TARGET REN_OLD_TARGET
6971 The REN_SOURCE is linked with a REN_*_TARGET which could be in a
6972 different directory, so we need to stash the SOURCE until the TARGET
6974 We simply impose the implied change on the directory and update the
6975 link count in the target inode.
6978 possibly record REN_SOURCE for later
6980 calls prepare/pin/commit as appropriate.
6981 Put the inode on orphan list if appropriate - needs care
6982 as we retarget orphan list.
6983 update inode link count.
6986 Just a refresh on the purpose of these updates.
6987 1/ They allow us to fsync a directory without performing a full checkpoint.
6988 As directory blocks are not processed in roll-forward we need the update
6989 for data to be safe. As fsync of directories are rare in some common
6990 situations we could avoid actually writing these. Simply queue them
6991 internally and discard them on a checkpoint. If an fsync comes before the
6992 checkpoint, only then do we write them out. If there are any cross-directory
6993 renames then the preceeding updates in both directories need to be flushed
6994 before the cross-directory rename. It might be easier to always flush on
6995 a cross-directory rename.
6996 2/ They ensure consistency of inode link-count wrt to names in the filesystem,
6997 but as link count is only updated by these (or a checkpoint) there is no
6998 problem with delaying.
7000 So: when replaying these we must update the directory content and the inode
7002 It is OK to delay the write-out of these until an fsync, and not bother
7003 if a checkpoint happens.
7004 So add that to th TODO list - item 66.
7007 - roll forward directory updates ... I wonder if I got it right :-)(untested).
7010 I don't seem to have easy-access notes about the various meaning of
7011 'width' and 'stride'
7013 width: The number of independent devices across which the (virtual) device
7014 is placed. The normal goal is to write 'width' blocks on every single write.
7015 On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
7016 and it will keep all devices equally busy with writes.
7017 The 'width' blocks probably aren't consecutive.
7019 There are two different layouts - one with width*stride <= segment_size
7020 and one with width*stride > segment_size.
7022 width*stride <= segment_size
7023 This is a traditional striped layout like RAID0/4/5/6.
7024 The 'stride' is the chunk size, so 'width*stride' is the stripe size,
7025 and segment_size must be a multiple of this.
7026 In this case all addresses in a single segment are contigious. We don't
7027 necessarily write them in order if we want to write less than one stripe.
7028 segment_offset will normally be a multiple of width*stride though this isn't
7029 enforced as one could have a partition with an non-aligned start.
7031 width*stride > segment_size
7032 This implies a catentated layout. If parity-redundancy is in use when the
7033 blocks which combine to form a stripe are 'stride' blocks apart.
7034 The benefit of this layout is that an extra drive can be added by simply
7035 zeroing it and joining it to the array - no re-stripe needed.
7036 This will make all stripes slightly larger so at first the space will not
7037 be available. As cleaning happens the space will gradually become
7038 available. This still requires restriping, but unlike a normal
7039 raid5 restripe, the space becomes available in small amounts immediately,
7040 when there is no demand for more space, the re-striping (cleaning) can happen
7041 at a very low priority with no cost.
7043 In this case the blocks in a segment are not contiguous.
7044 'segment_size/width' are, then there is a large gap (in virtual address
7045 space) to the next chunk.
7047 The segment_offset is an amount of space which is free at the start of
7048 each device. 0..segment_offset and stride..stride+segment_offset etc
7049 do not contain data and can be used for metadata.
7051 When width > 1 it makes sense to replicate each state block across
7052 every device - as we want to write the whole stripe anyway.
7053 For now we only write and read the first two copies at the beginning, and
7054 the last two at the end...
7056 Question: what do we want to do about metadata on flash devices? We really
7057 don't want a small number of locations to store the metadata, but a large
7058 number that we search through - possibly a binary search.
7059 These could be all at start/end or scattered throughout the device.
7060 The later would make it impossible to find efficiently - there is no way to
7061 create useful linkage without writing something else at start of end.
7062 As many devices optimise for random writes where the FAT table would be,
7063 it make sense to just put the metadata there and not at the end.
7064 We should allow one 'page' for each metadatum, which probably meanss
7066 So we should allow all state blocks to be near the start.
7068 01mar2011 - Autumn arrives.
7070 Time to add handling of 'atime' and non-logged files.
7072 The idea is to have a separate file for storing only 'atime'
7073 This is separate from the inode file because the volatility of the data
7074 is very different and one of the principles of log-structured-fs is that
7075 differently volatile data should be kept separate.
7077 This does mean that an inode lookup requires getting data from two files,
7078 but it is hopped that the 'atime' file will mostly be in cache as each
7079 block contains the atime for lots of different inodes.
7081 The atime file contains 2 bytes for each inode, so with a block size of 4K,
7082 each block would hold info for 2048 inodes. 1 million inodes would require
7085 The 16bits are treated as a positive floating point number which
7086 gets added to the atime stored in the inode. The lower 5 bits are
7087 the exponent, the remaining 11 bits are mantissa. Though there is a
7088 little complexity in interpreting the exponent.
7089 If the exponent is 0, the mantissa is used as milliseconds -
7090 so shift left 5 and multiply by 1000000 for nanoseconds.
7091 The smallest change that can be recorded in 1 millisecond.
7092 and values up to (2^11-1) milliseconds - or 2seconds can be stored.
7093 If the exponent is 1 to 10, the mantissa has a '1' appended as a
7094 new msb, and is shifted by the exponent-1 and then treated as milliseconds.
7095 This ranges up to 2^(12+9) milliseconds or 30 minutes, where
7096 the granularity will be 2^9 millisecs or 0.5 seconds
7098 For exponents from 11 up to 31 we add the 1 msb and treat
7099 the number as seconds after shifting (e-11). So at e==31,
7100 we shift a number that is
7101 up to 4095 by 20 to get nearly 2^32 seconds or 136 years.
7102 At this point the granularity is 2^20 seconds or 12 days.
7105 So overall we can update the atime for 136 years without needing to
7106 update the inode, and can record differences of 1msec for the first
7107 couple of seconds, then gradually less granularity until we are
7108 down to one second an hour after the last change, and 4 hours a
7111 To convert a number of seconds to this format:
7113 If >= 2048 seconds, we shift down until less than 4096 seconds
7114 counting the shift. We add 11 to that number to form exponent,
7115 and shift the resulting mantissa up 5, or with exponent, and mask
7118 Otherwise we convert to milliseconds (divide nanno by 1000000 and
7119 multiply seconds by 1000, and add). Then if < 2048, we shift up by
7120 5 leaving a zero exponent and use that.
7122 Otherwise we shift down until < 4096 counting shifts, add 1 to the
7123 shift to form an exponent, and combine with mantissa as above.
7125 So that is the format - how do we implement it?
7127 We don't want to expose to user-space numbers that we cannot store.
7128 So any 'utimes' call updates that the inode directly can clear the
7129 value in the atime file. Only updates due to accesses go to the atimes
7131 We define a 'getattr' function which looks at the atime stored in
7132 the vfs inode and if it has changed we need to deal with it.
7133 - if the inode is still dirty we simply update the lafs inode
7134 and use the number as-is, clearing the atimes entry
7135 - else we subtract the stored atime from the new atime. If this
7136 is negative or exceeds 136 years we mark the inode dirty and
7137 store it there. It we cannot mark the inode dirty for some
7138 reason we just store all 1s in the atime file.
7140 The same operation is needed when dirty_inode is called to make
7141 sure atime updates get saved even when no getattr is called.
7143 As we always need to be able to update the atime file, it needs to
7144 be permanently pinned whenever an inode is read in. For
7145 non-logged files this should be cheap but we must do it anyway as
7146 the file might not be non-logged.
7147 So we need to keep a permanent reference to each block while the
7148 inode is loaded. That can keep it pinned.
7151 We don't want updates to the atime file to be flushed in any great
7152 hurry, especially if it is a logged file. We would be quite happy
7153 to only write at 'unmount' and probably 'sync'.
7154 So we want to stop the pages from appearing dirty in the page
7155 cache (PAGECACHE_TAG_DIRTY), and the inode from appearing dirty
7157 We can still keep them dirty in lafs metadata so if release_page
7158 is called we can schedule a write out then.
7163 1/ load atime file at mount time - there is one for each
7164 filesystem. It has inum of 3 and type of TypeAccesstime (6).
7165 Also release it on unmount.
7167 2/ loading an inode must take a ref to the block in the atime file
7168 if it exists. A new inode flag records if this has happened.
7169 Unless mounted noatime, we pin the block and reserve space.
7171 3/ getattr and dirty_inode must resolve any issues with the
7172 atime. So lafs_inode probably needs an extra field to be able
7173 to check for changes
7177 Hmm.. this is getting confusing...
7178 When atime is changed the only way we find out is by ->dirty_inode
7179 being called. But that is called when anything is changed.
7180 Filtering out whether or not we need to update the inode itself
7181 is awkward... maybe there is some context we can use.
7182 ->dirty_inode is called by mark_inode_dirty which is called:
7183 - by touch_atime, if something changed
7184 - file_update_time - at which time we also update iversion
7185 - setattr ... which has changed recently (2.3.37ish)
7187 - generic_file_direct_write - which increasing size of inode
7188 - set_page_dirty_nobuffers
7190 So either the inode is pinned, or it isn't.
7191 If it isn't, then this *must* be an atime-only update.
7192 If it is, then it could be anything, but in any case we update the
7194 So: dirty_inode should try to get dblock and check if it is pinned.
7195 If it is pinned, then update the atime immediately and the offset
7196 in the atime file too.
7197 If not, just update the offset
7201 ARGggg... checkpin is interfering with unmount - it keeps an
7202 s_active count so unmount 'works' but doesn't release anything.
7204 checkpin is needed is needed to ensure that inodes remain safe while
7205 we are cleaning. Particularly, while the inode index block is
7206 pinned, we keep the inode and fs referenced as well. I guess the
7207 theory is that they won't stay pinned for long - but they do.
7208 e.g. segusage blocks are permanently pinned.
7211 We could have a rule about the prime filesystem always being mounted.
7212 Then we don't need refcounts, but kill off the cleaner before
7213 unmount... which we sort-of do..
7215 All subordinate filesystems have references on the prime_sb so the
7216 prime_sb must be the last one to go. When it goes it kills
7218 So we don't need checkpin to take a ref on the prime_sb.
7220 There might be still an issue with files in subset filesystems
7221 being permanently pinned so they stay around longer than they
7222 should... need to check on that somehow.
7223 The idea is that a quota file block is permanently pinned so it
7224 will keep the fs pinned. That in turn will keep everything else
7225 pinned... Worry about that when we implement quotas FIXME
7228 I really need to sort this out, and it isn't easy...
7229 We really want to know when "all" filesystems have been unmounted
7230 so the block device(s) can be released and the cleaner stopped.
7231 But we don't have a count for that. We could if that was all
7232 we counted - but that would mean that we only have a single
7233 struct super_block for all filesystems.
7235 So that is what I have to do. A single super_block for all parts
7236 of the filesystem. I probably still need to allocated other
7237 dev numbers stat->dev, but I don't need to use them internally.
7238 Maybe I even allocate superblocks... Yes - we need to use
7239 set_anon_super and kill_anon_super to allocate the numbers.
7240 lafs_inode will need a pointer to the filesystem - we use that
7246 bug at block.c:658. Block not B_Valid in lafs_dirty_iblock from
7247 lafs_allocate_block from cluster_flush.
7248 Block is 74/0: InoIdx block of a newly created file I think.
7249 '74' was /f23, then /mnt/1/adir. We are creating file in that
7251 This is a depth=0 InoIdx block - i.e. the data is in the
7252 dblock, so there is no index info, so it kind-a makes sense for the
7253 index block to not be Valid.
7254 yes- commit d268a566605bf006cf33c confirms that.
7256 So why are we trying to dirty it?..
7259 We create a couple of directory entries, then flush and end up
7260 with an in-line data block.
7261 Then we add more, flush again and so try to dirty parent...
7262 Where to we turn depth=0 inodes to depth=1??
7263 - erase_dblock_locked - don't want that
7265 So I guess the 'bug' is in error - it is OK to mark that invalid
7269 So - back to the super_block reworking. We want only one
7271 So we use the TypeInodeFile inodes a bit more to hold the details
7272 of different filesystems. We need to store a unique 'dev' number in
7273 there use set_anon_super/kill_anon_super on a local 'struct
7274 super_block' and copy s_dev in/out.
7276 As we only have one sb, we can only have one fstype, so we cannot
7277 use the fstype to choose what to do.
7278 - if dev_name is a block device we try an normal mount
7279 - if dev_name is a Inode file, we perform a subset mount
7280 - if dev_name is a lafs dir and '-o snapshot=name', we mount that
7282 - if dev_name is a lafs dir in root with perm zero and
7283 '-o subset=MAXSIZE', create a subset filesystem.
7285 - lafs_iget needs an inode rather than a superblock
7286 ditto for lafs_new_inode, lafs_inode_inuse, inode_map_free,
7287 choose_free_inum, inode_map_new_prepare
7288 - lafs_iput_fs,lafs_igrab_fs, ino_from_sb
7290 - NFS filehandles need careful thought
7291 They are 'per-super-block', not 'per-vfsmnt' which might be
7293 We could change that but.....
7294 For non-snapshot files it is easy - just record two inodes, the
7296 For snapshots there is nothing that is really stable.
7297 Maybe we could have different superblocks for snapshots.
7298 The snapshot doesn't need the cleaner as it is read-only, though
7299 the cleaner can need the snapshot...
7301 So the cleaner might automagically mount a snapshot, but a
7302 snapshot will never invoke the cleaner or any other thread stuff.
7304 So I guess we want one superblock for the fs and one for each
7306 The filehandle is then either inum+gen or inum+inum+gen where first
7307 inum must be TypeInodeFile
7310 ... though I could just put a snapshot number and partial timestamp
7315 This isn't a new to-do list, it is a list of the main features that are
7316 still not implemented:
7318 + at very least I don't pad with zeros yet
7319 + if stripe size were multiple of 3*3*5*7*2^N, then changing
7320 width might be managable.
7321 e.g. stripe size: 40320 blocks.. But with megabyte chunksizes,
7322 we really want 32bit segsizes and 322560 block segments.
7323 - non-logged files - with interface to request access-time file
7325 - snapshots: particularly cleaning
7327 - metadata (inode/directory/etc) CRCs and duplication
7332 - locate and validate device and state blocks.
7333 - locate and validate checkpoint cluster.
7334 - locate and validate filesystem root
7335 - roll forward to collect segusage and quota blocks.
7336 - load inode map, read inode file, validate each inode and make sure
7338 - explore each file, following all indexing, count segusage for each
7339 segment and make sure segusage file is consistent.
7340 - check no block is allocated twice. This might require multiple passes,
7341 each time we examine a different collection of segments.
7343 - checking a file requires:
7344 - checking inode is consistent
7345 - checking index blocks are consistent with depth
7346 - checking index/extent blocks are sorted with no overlaps
7347 - checking block/iblock counts are correct.
7348 - checking all cluster headers in the current segment to ensure they
7349 look consistent and agree with file information. i.e. if cluster_header
7350 identifies a block, the block must live there, or later in the segment.
7352 - scan all directories looking for consistency of hash etc. Count links
7353 for all inodes. This might need to be multi-pass too.
7354 Could use a bitmap for single-link files, and table for others.
7357 - First must find segments which are not in use according to segusage file
7358 or according to block search.
7359 If there are none, require a new device be provided.
7360 - If anything looks incorrect, write corrected version to new segment
7361 Then write out new segusage files
7363 In some cases we might need to search all write-clusters for missing blocks??
7364 That could take a very long time!
7367 What do I really want to do about CRCs and hashes.
7368 It might be nice to store a hash for each block in the index block.
7369 But that wastes precious index-block space.
7370 If I store a CRC together with address info in the block, then I could
7371 be fairly sure it is the right block. So e.g. inodes store the inode number,
7372 Index blocks could hold inode+depth+address.
7373 Last 8 bytes of each block could be a 4byte CRC and a 4byte identity.
7374 identiy is XOR of fsinum inum blocknum generation - or a CRC of these.
7376 Actually, we don't need to store the identity info - we just need to
7377 include it in the CRC. That either saves space, or allows more bits to
7378 be used for the CRC, which is probably the best use of bits for detecting
7380 Though it might be nice to store phys-addr in the CRC too, we cannot as
7383 My short-term todo list is:
7384 DONE - get 'lafs' to the stage where I can create an fs requiring roll-forward
7385 DONE - use 'lafs' to create images for testing, so I don't need 'fred.safe' any more.
7386 DONE - Make lots of 'layout' changes - see 15cb
7389 - 'run' goes to completion, but segusage isn't updated in the final cluster
7390 and the number left over from before looks wrong.
7391 DONE - 'ls -l' on a subset file gets confused.
7392 - fs created by 'lafs' has wrong Blocks and Inodes counts
7393 - we lose a ref to a segsum and sometimes put it too often.
7394 REFCNT 1 [ce0ffc48]0/182(2535)r0E:Valid,Claimed,PhysValid NP
7395 REFCNT 1 [ce055b9c]0/187(2535)r0E:Valid,Claimed,PhysValid NP
7396 REFCNT 1 [ce0445d8]0/182(2535)r0E:Valid,Claimed,PhysValid NP
7400 Once I have these bugs sorted out I want to make some format changes.
7402 DONE - fs_metadata need a 'parent' link
7403 rename needs to be careful about what is updated!
7405 lafs_get_parent needs some thought.
7407 DONE - roll-forward should get exact mtime stamps, and ctime.
7408 So each data block must have an exact timestamp
7409 of when the change actually happened. Or the group_head
7410 has a timestamp for the most recent update to the file
7411 As we use nanosecond timestamps (pointless though they are)
7412 we need 30 bits for the nanoseconds and at least 11 for the seconds.
7413 So 48 bits (6 bytes) is plenty.
7414 So include a 64bit timestamp in the cluster_head and 48bit
7415 number to subtract in the group_head
7416 But saving 2 bytes per file isn't really worth it, and we may
7417 well lose it in padding. So just store a 64bit timestamp in
7420 DONE - use CRC in place of all checksums - lafs_calc_cluster_csum
7422 DONE - state block flags for inconsistencies found
7423 If any inconsistency found, fsck is advised.
7424 For some it may be imperative.
7425 Things that can be wrong include:
7426 - generic read error
7428 - index block incoherent
7429 - dir block incoherent
7430 - link count negative
7431 - cluster header incoherent
7433 64 bits should be adequate and simple for this.
7434 Any unknown bit requires a full fsck.
7436 DONE - 32bit segment size
7437 With 16bit at 4K blocks we are limited to 256Meg segments.
7438 64Meg with 1k blocks. This takes about 1 second to write on
7439 a modern drive. On an array it will take even less time.
7440 24bits gives 16 to 64 gigabytes which is plenty.
7441 However 24bits is awkward to access. a 1K block holds 341 1/3.
7442 A 4K block holds 1365 1/3.
7443 But this wastes less space than 256 or 1024 and so causes less IO.
7444 But then we probably want to size segments to be very big.
7445 A few thousand segments should be OK, which is tens of blocks.
7446 I don't think the savings with 24bits are worth it, and I do
7447 think v.big segments could be useful, so lets go with 32bit segments.
7449 Youth is currently tuned to 16bits. Let's leave it there and
7450 maybe waste some space.
7453 - parallel new-data write clusters.
7454 I think it is sufficient to include a second 'next_addr' in the
7455 cluster_head - or maybe two. alt_next_addr[2].
7456 When a thread wants to start a new stream of clusters it allocates
7457 the segments then attaches to the next outgoing write cluster.
7458 Once that is written everything in the new cluster is safe.
7459 On a checkpoint every stream writes at least one checkpoint cluster
7460 and these are linked together through alt_next_addr.
7461 The 'next' cluster for each must be the checkpoint cluster and must
7462 carry linkage but unlike with first-link, there is no need to wait
7463 The data is already safe as long as the state block isn't updated
7464 until every cluster_end block is written.
7465 So really, one is enough. I had though 2 would enable quick fan-out
7466 but there is no real need for that.
7468 As 0 is a valid write-cluster address we use 'this_address' to signify
7469 that there is no alt-next.
7471 It is possible that a block of a file could be written to two
7472 different streams at different points in time between two checkpoints.
7473 We need to ensure that roll-forward gets these in the right order.
7474 'seq' can be the same in two different streams so we cannot use that.
7475 timestamp could possibly be used, but as times can go backwards it
7478 NEW IDEA. Just use one stream of clusters. However it can
7479 bounce from one device to another easily. So two different
7480 threads can be building up two different write clusters at the
7481 same time as long as they synchronise at some point to pass
7482 addresses around. They also need some other Verify mode as
7483 VerifyNext or VerifyNext2 will destroy any parallelism.
7484 As the point of this is two write to multiple devices in
7485 parallel, maybe VerifyDevNext{,2} meaning the next header on
7486 the same device serves to verify this.
7490 maximum number of segments written between checkpoints
7491 whether data can be cleaned to a particular device
7492 whether a device can receive new data
7493 whether metadata duplication is needed
7494 whether an RO device from a different array is allowed.
7495 Some of these are per-device policies. Some are per-array.
7497 The 'RO Device' thing is special. I think I want an alt_uuid.
7498 It works like this: You assemble the RO array when you
7499 mount a new filesystem identifying the old as a component.
7500 So that 'state' block on the new devices must identify the alt_uuid
7501 and state seq number.
7503 Do we want to record more info about which devices are in the
7504 array? Currently we just record how many. If we find enough
7505 with the right UUID/seq, they must be it.. what else would we
7508 For all the other policy statements it is probably simplest to
7509 allow a set of simple strings. e.g. "noclean", "nonew",
7511 devblock currently uses 146 bytes, so room for 878
7512 stateblock uses 112 plus some for snapshots, so much the same.
7513 We currently don't use 'version' and have no concrete plans.
7514 The vague idea is to allow lafs to *know* that it cannot mount
7515 the array, so any incompatible feature gets set.
7516 We could keep those in the policy sets. From that perspective
7517 there are 3 types of things.
7518 - if you don't understand, don't worry
7519 - if you don't understand, don't try to write
7520 - if you don't understand, you cannot even read.
7522 That last is really best avoided. We have version info
7523 elsewhere in the tree so that a new index style will simply
7524 make that block unreadable.
7525 So I think make the dev and state blocks a simple incrementing
7526 version number which apply to that block, and have "don't
7527 worry" and "don't write" policies distinguished by first
7529 Capital is "If you don't understand, don't write"
7530 Lower is "if you don't understand, don't worry".
7532 These are space separated strings
7536 - what about i_version? Include in timestamp?