2 So, let's try to write a kernel module that implements this filesystem.
3 It would be good to have a plan.
5 - Mount filesystem, providing empty root directory
6 o parse mount options - DONE
7 o find/load superblocks and stateblocks - DONE
8 o present empty directory - DONE
9 o Compile external module - DONE
12 - Mount filesystem read-only with no roll-forward
14 sync_page_io or bread? - not bread I think
15 o Index blocks management
16 o search cluster-header for root inode
18 o Directory lookup/read
21 - Support roll-forward for blocks, orphans, whatever
22 o manage segusage files
27 o cluster creation / block sorting
31 - Interface for snapshots and other admin
35 ------------------------
37 If a device is removed from the filesystem, we cannot reliably
38 tell from the other devices or state that this is so.
39 Maybe we need to update all devblocks with a new 'seq' number...
41 How do we specify mounting subordinate filesets?
42 What superblock do they have?
43 I suspect we do a -F lafs-sub mount from the original filesystem.
46 If mount fails, we seem to be leaving a super lying around,
47 and sync_supers dies on it. - DONE
50 Umount appear to work, but a sync_supers dies. - DONE
53 subordinate supers aren't being locked as much - is that a problem?
56 index pages never get put on an LRU - how is this supposed to work?
64 --------------------------
66 Inodes live in an address-space, much like a file. To load the
67 first inode, we need an address-space, so may as well have an
68 'struct inode' as we may want to expose it to user-space.
70 Loading an inode, need
71 fs (lafs filesystem structure)
72 which subfs (maybe a lafs inode)
73 which snapshot - this is implied by the subfs inode.
74 and fs can be obtained from inode, so just inode, inum
80 review block_leaf_find and make_iblock
81 need to do setparent and block_adopt next
84 need to resolve locking for ->siblings list
91 I can read a file.....!!!!!
92 Code review / tidy up.
93 resolve locking buffer vs page
95 Export on a web page somewhere??
98 (I spent a while getting large-directories to work again in prototype..
100 - Priority: clean mount and unmount
104 FIXME how do we record and handle write errors???
106 The iput in lafs_release - which is needed - is oopsing
110 Ok, I finally have a clean mount/unmount.
111 .. not quite. blocks being freed at unmount still have a refcnt, which is bad.
114 - make sure we can handle 'large' directories.
115 - make sure we can handle files with indexes
116 - handle filesystems that span devices.
119 Hurray - clean unmounts!!!
120 There is a nasty circular reference of the root inode which is stored in
121 a block that it manages. Maybe this should not happen, rather than having to be
122 explicitly broken - the root-block can live elsewhere, not in the inode.
124 Next multi-level index blocks.
126 But first, need to understand memory pressure and pageout.
127 How are dirty pages found to be cleaned?
128 How is pressure put on a filesystem to clean up?
129 How are clean pages reaped?
131 - call pagevec_lru_add{,_active)(pvec) to put the page on an LRU
132 lru_cache_add{,_active}(page) might be easier, but isn't exported.
133 - call mark_page_accessed(page) to keep the page 'active'.
136 - make sure indexes work...
140 eax,bx,cx,dx,s1 all zero
141 from block_leaf_find 203
143 ... OK, indexes seem to work.
144 But 'lafs' have problems creating some large files.
147 This is due to not handling error properly.. fix it later FIXME
151 Must make sure the index address-space gets clearred up... I wonder
152 how we find all the pages to free. This might be one reason to keep them
153 in a radix tree. Though we should be able to walk our own data structures.
156 Then work on mounting a 2-device filesystem.
159 FIXME dir_next_ent always starts from the beginning rather than
160 remembering where it is up to... can this be fixed??
163 18mar2006 (Wedding anniversary, and Saturday ... during commonwealth games)
165 Mounting snapshot needs a way to identify that it is a snapshotmount
166 and which snapshot, and which filesystem.
167 We could use a different filesystem type, but that isn't really needed
169 mount -t lafs -o snapshot=name /original/mount/point /new
171 This grabs the named snapshot of /original/mount/point and places it at
173 The 'snapshot=' option is the trigger.
176 mount -t lafs -o control /original/mount/point /new
178 To grow a filesystem, we initialise a device (super/state blocks) and
179 mount -t lafs -o remount,new=/dev/name whatever /original/mount/point
181 as the dev_name isn't passed to remount
183 So, mount options are:
190 pairs matching what is exposed in the control filesystem
193 - factored out super-block finding preparatory to finding snapshots.
196 superblocks for snapshots and sub-ordinate filesystems do
197 not get stored in the 'state'. There is, however, a usage count so that
198 the prime filesystem cannot be unmounted until all snaps and subs are gone.
199 This should just refcount the prime_sb I suspect.
201 So: a snapshot sb points to the 'struct fs' but doesn't .... what???
204 - remove the super-block finding code by changing the layout to store
205 superblock locations explicitly :-)
207 - teach 'mount' to mount snapshots.
209 - need to audit for bad use of ss[0]
210 - need to find better way to map 'sb' to snapshot number.
211 - need to make unmount work.
213 01apr2006 (no, really!!)
214 - rewrite index to kmalloc index blocks and use a shrinker to free them.
215 This means that indexblock no longer has a 'page', which makes sense.
216 It also means they cannot live in highmem, which is sad, but could
219 Notes: superblocks and refcounts.
220 Each device holding the filesystem gets a superblock.
221 One of these (arbitrarily) is the 'prime' superblock and gets to
222 manage the whole filesystem.
223 Each snapshot also gets a superblock, as does each
224 subordinate filesystem. These are anon sbs - using anon dev.
225 Each anon sb takes a reference to the 'struct fs', and also to the
226 prime sb.... how about the reference relationship between fs and prime_sb???
230 - problem with getting parent superblock due to semaphores...
231 - when unmount, put_super isn't being called, so inode 0 isn't released!
234 (Took a week off to play with rt2500 wireless cards)
235 - Use different filesystem type for snapshots and subordinate filesystems.
236 This removes the semaphore problem
237 + OK, mount and unmount works for snapshots... what next?
238 - review index block - worry about himem?
239 - review ss[0] usage - OK
240 - general code review
242 FIXME - what should leaf_lookup/index_lookup return on format error?
243 The currently return '0' which will quietly make an empty block.
244 Many '-1' would be better to make an error block.
245 FIXME check how other filesystem lock the setting of PagePrivate
246 Maybe just need to lock_page
247 FIXME combine find/load/wait into one operation
248 Review dir, super, roll, link
250 FIXME module refcount increases on failed mount!
253 I've been sick for too long, and not much has happened... However I think more than
254 the above comment says. I started looking at roll-forward and have the
255 basic block parsing in place so that it reports what it sees in the roll.
256 Also, the format has been changes a little: the address in the state block
257 is the CheckpointStart cluster, and we simply roll forward to the
258 CheckpointEnd, and then keep going beyond there - there is no longer any
259 walking back to find the start.
261 Next step is to start incorporating rolled elements into the filesystem
263 - data blocks: shouldn't be too hard. Don't need to update the
265 - inode updates: should be straight forward enough, but care is needed
266 as the data might be in multiple places
267 - directory updates: these are probably most interesting..
270 Question: how are symlinks created?
272 log the inode creation
274 log the directory update.
275 This allows the 'value' stored in the inode to appear after the directory
277 That might be OK for files (Which are created empty and then extended)
278 but is bad for symlinks (which are created atomically).
280 - ensure inode is in a previous cluster to directory updates.
281 This slows things down too much I think
282 - log the content as well. This is awkward if it is big, certainly if more
283 than a block, which is possible.
284 - directory updates could be dependant on the inode being valid.
286 - log content if it is small, else write inode, flush, then create link.
288 So the fast option is:
289 log inode create, log content, log filename
290 and the slow/safe option is
291 log inode ceate, sync file, log filename
293 So on roll-forward if we see the inode we just save the data.
294 Saving the whole inode seems attractive, but we want minimal order
295 dependance: an inode update in the same cluster as the new inode should
296 still over-ride, even though it is earlier.
298 Ok, rollforward is proceeding slowly. I think I am now incorporating
299 new blocks into the tree properly, though the code probably won't compile.
300 It will be nice to test this and see the file have the right data.
302 Next step would be to include the index incorporation code.
310 - what exactly should happen when rollforward finds a file with a linkcount of 0?
311 Currently all updates get lost - I wonder if they are lost safely?
312 - rollforward is getting the size right, but not the content
313 - do I need to flag a block that ->phys is valid?
315 : Ok, roll-forward picks up new blocks in a file OK,
316 but umount has stopped working.
317 Presumably because there are pages attached to the inode which aren't
318 getting released. What do we want to do here?
319 Normally those pages, or their addresses need to be recorded before
320 they are lost. But on a read-only mount we don't care so much.
322 22jun2006 continuing above thought..
324 When we roll-forward and pick up the pieces of a file, we don't
325 want to allocate pages to hold those pieces (and definitely don't
326 want to read them all). We just want to attach the addresses
327 to the parent for incorporation. Similarly after writing
328 dirty blocks in a file we want to be able to release them
329 immediately rather than waiting for the addresses to be
330 incorporated (as incorporation can be more efficient when delayed).
332 We could just allow the page associated with a block to be released,
333 except that the page provides the indexing to find a block. We might
334 be able to live without the indexing, and hunt down the indexblock tree,
335 but living without the mutual-exclusion provided by block indexing would
337 And the 'struct datablock' still contains a lot more than is needed.
339 So maybe we should just have a completely separate structure attached to
340 the indexblock which lists fileaddr/physaddr. This could include
341 extent information. The trick would be guranteeing allocation.
342 We could either allocate-late with a fallback of attaching the 'struct block'
343 or performing an immediate incorporation, or allocate-early and block
344 the dirtying of a page until there is space to record the new address.
345 This last is bound to be easiest.
347 So: what exactly do we use to store addresses?
348 Probably a linked list of tables.
349 Each table contains a link pointer and an array of
350 fileaddr/physaddr/extentlen
351 But we would need to allocate lots of these if there are hundreds of
352 dirty pages, but possibly only end up using a few if they made
353 extents very nicely. That might be wasteful.
355 Or we could allocate just one. When it is full we perform an
356 incorporation. But if that causes a page split we are in trouble.
357 We could have a spare page, split to it, write out one
358 and wait for the spare page to be written and free.
359 But we cannot just release the index page as it might still have
362 (I think I've been here before).
363 A worst-case scenario involves writing one block and that requires
364 spliting every index up the tree to the inode. This requires
365 arbitrarily many pages to be allocated. To accomodate this we either
366 pre-allocate a spare page at every level of the tree down to the data
367 block (a bit like storage space allocation) which seems very wasteful,
368 or we make sure we can release one of the split pages, which seems impossible.
370 I could decide not to worry about it. Have a pool of index pages and hope
371 it always works. Afterall, most pages are data pages, and they can be
372 freed successfully. We would only have a deadlock if all dirty memory were
373 index pages, and that seems unbelievably unlikely. If we trigger a
374 checkpoint when the count of locked-pages hits some limit we should be
377 So: Keep one table per index block. Use simple append and sequential search.
378 When table gets full, force an incorporation
380 Do we allocate the table separately, or embed it in the indexblock??
382 Probably embed it. indexblocks that don't need it can be freed at any
383 time so that space waste hopefully isn't significant.
386 If the file is written sequentially, then everything should gather into
387 extents, and so it doesn't need to be enormous.
388 If the file is written randomly then the index block can be expected to
389 be 'indirect', so incorporation will be cheap.
390 So 'small' seems ok in both cases.
394 But wait a minute.....
395 On a checkpoint we can be getting phys updates for prev and next phases.
396 next-phase updates cannot be incorporated until the indexblock has passed
397 on to the next phase. So in that case, I think we still keep a linked
398 list of unincorporated blocks and live with the fact that we cannot
399 free them until the phase change passes. That shouldn't be a big problem
400 as it is a limited time frame - especially for data blocks..
402 But does this solve our initial problem??
403 During roll-forward we want to keep the addresses but not the blocks,
404 and we don't want to force incorporation. That means an arbitrary list
405 of addresses attached to an index block.
406 I guess we could possibly allow incorporation, but I would rather not
407 as I want the fs to be able to be read-only nicely.
408 So that means we need to have a list of address tables.
409 Maybe the normal approach is 'add a table if possible, else incorporate'?
411 OUCH... we may write a block a second time before incorporating the
412 new address, so when adding an address to the table we need to check
413 if it already exists. That could be expensive.
414 For index blocks might it even be a different address? I think
415 not but the vague possibility (in the future?) does complicate
416 things somewhat. Maybe we just keep thing in chron order and
417 don't worry about duplicates until incorporate time, when we have to
423 free_block must free tables DONE
426 Unmounting still doesn't work.
427 Problem is that an index block is holding a reference on parent,
428 and parent references aren't getting cleaned up.
429 On read-only unmount I guess we need to walk the list of leafs,
430 discard any address info, and unlock the blocks.
431 So that should be the first task for next time.
434 Leafs are locked blocks which have no locked children.
435 So any locked data block (non-inode) is a leaf
436 Any locked index block with lockcnt[phase] 0 is a leaf.
438 OK - fixed numerous bugs, but I can unmount now!!
439 I can even rmmod and insmod and all is cool.
443 - review refile and get all the code in there from prototype
445 - write a combined find/load/wait function and use it
447 - allocate inodes in single memcache and avoid generic_ip
448 HALF DONE. (still using kmalloc, not doing initonce well)
449 - review recording of new block addresses
450 + make sure we lookup there on index lookup - YES
451 + make sure ->uninc_next gets tranferred to table at phase change.
452 + write incorporation code as it is tricky
453 - review how directory updates can be incorporated into a RO filesystem.
454 No, they cannot. We need to update the directory.
455 - write directory update code
456 - write cluster construction code
457 - make sure indexblocks with unincorporated addresses get on to inc_pending
458 ?? or is locking them enough?
461 INCORPORATION - ARgggghhhhh.
462 The current uninc_table doesn't really lend itself to building
463 index block... though maybe....
464 Question: what happens when an index block disappears? i.e. it has no
466 We clearly need to remove it from the parent. This should be trivial,
467 a direct operation on the parent index block. etc some number to 0.
468 Then the next incorporation pass with simply lose that entry.
470 OK, that might be all well and good, but how do we sort unincorporated
471 addresses so we can merge them?
472 A linked-list merge sort is nice and open-ended, but does waste
473 quite a bit of space in pointers.
475 Or maybe I should just always do small-table incorporations.
476 Is there a way that a bad ordering of writes could force very bad
477 index layout in this case? i.e. cause a table split every time,
478 but new blocks go in the first (full) table.
479 OK Decision: always do small-table incorporation.
480 i.e. not a list of blocks: just a table of addresses.
482 FIXME check validity of index type when it is first read in,
483 and reject early if it cannot be recognised.
486 Took a break from incorporation.
487 Looking at directories.
488 Wrote dir.doc in module to sum lots of stuff up.
490 dir blocks have an info structure attached.
491 This included a counted reference to the parent.
492 How long does this need to hang around for??
494 - when there is any orphan issue happening, it must stay, via
496 - when actually performing a dir op, we need to create and
499 When last ref of a dir block is dropped, should drop
500 the parent reference.
504 free list management mostly done.
506 create/delete prepare/commit/abort
508 dirty_block lock_block
511 FIXME should dir_new_block zero out the block?
512 How will commit_create know what to do with this block?
514 NOTE another type of directory orphan is a free leaf block which
515 is on the part-free list.
517 -------------------------------------------------------------
518 09spe2006 0 on the plane to Frankfurt
519 Don't tell me I am rethinking preallocation again ???
522 dirty_inode needs to record the phase it is dirty in
523 inode_fillblock needs to check current phase and act accordingly.
525 Make sure the B_Orphan flag is set and used - or discard it.
527 How do we commit creating a symlink?
528 If it is a full block in size we cannot make an update record.
529 - maybe have two update records? We cannot guarantee they are in
531 ... but if we put the 'make dir entry' last it should work.
533 Change 'struct descriptor' definition
534 the 'block_type' aka 'length' 16 field becomes
535 0x0000 -> 0x8000 -> datablock, possibly a hole - upto 32K.
536 0x8001 -> 0xc000 -> miniblock upto 16K+
537 0xffff -> index block.
539 Need to write IO routines which decrease pending-block-count in
543 Thinks. a 1TB filesystem with 1K blocks and 4096 blocks/seg
544 gives 4Meg segments. That would be 256K segments which at 2 bytes per segment
545 - 512 segments per block - is 512 blocks in each seg usage file
549 - lafs_lock_{d,}block DONE
550 Make sure the block has parents and allocation and set the locked
554 Given a datablock, wait for it to be written out
555 This is needed before updating a block that is still locked in the
558 Used when creating a new object/inode
559 Given a datablock which is to hold the inode
560 and a type (Type*) and a mode,
561 Fill in the data block with appropriate data so that
562 when lafs_import_inode looks at it, the right stuff happens.
575 lafs_cluster_update_abort
576 lafs_cluster_update_commit_buf
577 lafs_cluster_update_commit
579 lafs_cluster_update_prepare
580 lafs_inode_phase_check
583 lafs_cluster_update_lock
584 lafs_checkpoint_unlock_wait
590 - I need to know if a block is undergoing write-io so that I can
591 avoid modifying it in certain circumstances. But I don't track
592 this information. Options:
593 1/ track the info. This means an extra field in the 'struct block'
594 because I still need to know which wc has had a write.
595 2/ For blocks that we care about copy the data on write...
596 But we care about all inodes and directory blocks. That is a waste.
597 I think we put extra info in the block.
598 We need to know which wc was used (0,1,2) and which pending cluster
599 in there (0-3) which comes to 4 bits.
600 But we only care about the block for wc=0. and we could include the
601 which-pending in the b_end_io, or maybe put it all in low bits
602 of the block pointer.... Need max 4 bits. Can only be sure of 2...
605 'which' goes in bottom two bits of bi_private
609 4apr2007 (What a long gap !!)
611 - lafs_cluster_update_*
612 How do we prepare for a cluster update? How do we lock it.
614 The important thing is that the update can be written. That
615 requires that there is space available. So we need to preallocate
616 space and then release it.
617 It is possible that each update might go in a different cluster, so maybe
618 we need to preallocate one block per update. That sounds a little expensive.
619 After all, we aren't preallocating a cluster block for every data block
621 So: prepare does nothing
622 lock preallocates the space - a full block.
628 - Can now create and delete lots of files. This is cool.
630 Orphan slots just grow and grow - never to be reclaimed - why?
631 After rm f*, 7 files remain. but rm f* again and the go.
632 FIXED - readdir wasn't returning them
633 Size of directory remains large.
634 And sometimes, files become ghosts... (try just removing one after first rm f*).
636 TODO - process those orphans to clean up the directory.
638 20June2007 (Happy Birthday Dad)
640 - Creating lots of file and then deleting them leaves 5 orphan slots
641 for the directory busy, and one for inode 0??
643 Directory handling uses the following orphans:
645 A new index block is created by splitting. This needs to be linked in.
647 The dirent block we are deleting from
648 If it becomes empty, it needs to go on free list
649 The index block we are deleting from
650 If it has lots of free space it might need to be rebalanced.
651 The inode that was deleted.
654 - When a file is fully deleted, we need to drop any orphan info... DONE
655 - Need to do orphan handling of free blocks in directory, and
656 unmerged parents - but there doesn't seem much point as I am going to
657 change the directory layout (again).
659 So: writing to a file.
660 We need prepare_write, commit_write, and writepage.
661 Prepare loads and links the page and checks there is space.
662 commit marks it as dirty so writeout is possible.
663 writepage chooses a page to write out
665 25June2007 - HACK week, thanks Novell!!
669 Need to revise the process whereby async completion
670 clears PAgeWriteback,
671 We need locking in there, and need to worry about
672 'which' wrapping too soon.
673 Need to not start IO before we set page writeback
675 Maybe, but syncing to disk needs more thought.
677 Partly done, need actual content.
679 Can make directory, but creating first entry fails. - FIXED
682 - new directory structure.
684 27Jun2007 - More HACK week :-)
686 - new directory layout done - much easier!!
687 - If I delete a file that was created, the blocks still have a ref-count
689 - mkdir doesn't increase link count on parent. - FIXED
693 Infrastructure to process orphans
694 Handle specific cases
695 flush orphans at key times.
696 load orphans at roll-forward
699 Write out a checkpoint (when?)
700 Make sure refcount goes back to zero on blocks I write.
702 Check on inode_phase_check and checkpoint_unlock and inode_dirty
703 in all directory operations.
705 FIX: Writing a small file leaves something non-dirty but
706 due to be written, and lafs_cluster_allocate complains.
709 FIX: dir_handle_orphan doesn't lock the orphan transaction required.
711 FIX: rm a file with (small) content hang waiting in sync_page in truncate_inode_pages.
713 FIX: lafs_allocate hasn't been written!!!
715 FIX: before updating any block in a depth=0 file, we must first load
718 29Jun2007 - still HACK week.
719 Summary of how incorporation works.
721 Each index block has a small table for unicorporated changes. i.e.
722 blocks number and their addresses.
723 This supports efficient storage of extents, and is extensible by allocating
724 more tables. This last is done rarely.
726 When a block gets a new address, this is added to the table or, if
727 there is a phase missmatch, it is added to a list until a phase change
728 happens (so the whole block is pinned pending the phase change).
730 If the table is full then:
731 - if the filesystem is read-only (including during roll-forward),
732 a new table is allocated (else rollforward fails).
733 - otherwise we incorporate the table into the block, then add the new
734 address to the (now empty) table.
736 If incorporation requires that we split the index block we allocate one
737 from a pool. If there are none in the pool, we wait.
739 As the table is much smaller than a block, the incorporation into
740 two block will always succeed.
741 The 'uninc_next' and 'children' lists will then need to be shared
742 between the two blocks before the new address is added to whichever
743 table is appropriate.
745 When looking for a block address, we must always check the table and
746 then children lists. We do not need to check uninc_next as they will always
749 How to ensure that the pool always has sufficient index blocks and we don't
751 We have two halves of the table, one for each phase. Before we allow
752 a block to be dirty in a phase, we ensure that the pool has adequate
753 index blocks for that phase. e.g. twice the depth of the block. If it
754 doesn't we block the dirtying until space becomes available.
755 For syscall writes, this is easy as we catch in prepare_write.
756 When we perform a phase change, we must be sure there are enough index
757 blocks for the deepest bloc that will stay dirty. If there aren't, we need
758 to flush all dirty block, and unmap all writable mappings before
759 starting the checkpoint.
762 FIX: need to work out life time rules so that inodes hang around while they have blocks.
763 currently have an igrab that is never put.
765 FIX: Dirty isn't cleared until 'flush', but do_checkpoints requires 'alloc' to clear it.
768 Checkpoint flushing is getting close.
770 InoIdx blocks are not changing phase.
771 Phase change should happen when all children have been incorporated, and
772 then the write has been triggered marking us clean.
773 For InoIdx blocks, we need to be marked clean when the data block
776 5jul2007 - a week off
777 Checkpoint flushing seems to work !!!!
778 FIX: what should filesize of symlink be?
779 other filesystems use len, but still zero-terminate for vfs.
781 Problem. A chmod is followed immediately by an unlink then a checkpoint.
782 The chmod update gets into the checkpoint cluster, but the unlink completes
783 before the checkpoint is finished so the new superblock sees the file
784 as gone. Roll-forward find the update and want to update a missing file.
786 This isn't a big problem, but with slightly different details, it could be.
788 One option is to ignore updates that preceed the updated block. That might
789 be awkward with e.g. directory updates and checkpoints that cross multiple
792 Another option might be to prohibit updates once a checkpoint has started
793 unless they are known to be after the phase change.
795 FIX: unlink isn't punching a hole in the inode file.
796 Inode usage map isn't being updated. - FIXED (For create, not unlink).
798 FIX: roll forward does not pick up inodes, only data blocks.
799 But tiny files are synced to inode, so they might not be picked up.
800 So we must process a level=0 inode like a data block.
803 Time for lots of clean up.
805 DONE 1/ Index blocks to fill with 0 - use phys=0 to imply invalid.
806 DONE 2/ rename 'lock' -> 'pin'
807 3/ Review and fix up all locking/refcounts. See locking.doc
808 DONE 3a/ Make sure cluster_allocate can be call concurrently. e.g. check
809 B_Alloc inside the semaphore
810 Also lock inode when copying in block 0 and probably
811 when calling lafs_inode_fillblock (??)
812 DONE 3b/ lafs_incorporate must take a copy of the table under a lock so
813 more allocations can come in at any time.
814 NotYet 3c/ cluster_flush should start all writes before calling _allocate
815 as _allocate might block on incorporation/splitting.
816 No. We really want _allocate to not block, but to queue...
817 I think this is too hard to get perfect just now, so I will leave it.
818 DONE 3d/ introduce PinPending for data blocks. remove fs->phase_depth.
819 LATER 3e/ Index needs a clean-lru on each filesystem, and a list of filesystems
820 so that locking of lru doesn't have to be too global
821 DONE 3f/ change wc[]->hlhead to be a regular listhead as it is part of the
823 DONE 3g/ revise refile lru handling based on new understanding
824 3h/ Utilise WritePhase bit, to be cleared when write completes.
825 In particular, find when to wait for Alloc to be cleared if
826 WritePhase doesn't match Phase.
827 - when about to perform an incorporation.
828 3i/ make sure we don't re-cluster_allocate until old-phase address has
829 be recorded for incorporation.
830 3j/ Check that index blocks cannot race when getting locked....
831 k/ Check what locking is needed to set PagePrivate exclusively.
832 DONE l/ cluster_done needs to call refile, but is called in interrupt context.
833 We need to get it done in process context I think and lock
834 ->waiting access with fs->lock after changing it to ->lru
835 DONE m/ Need to know which blocks in a page are in writeback so we can clear writeback
836 only when *all* have finished.
837 DONE n/ on phase change, uninc_next blocks need to be shared out.
838 NO 3o/ Make sure lafs_refile can be called from irq context.
839 3p/ lock all lru accesses.
840 3q/ Lock those index blocks!!!
841 3r/ Can inode data block be on leafs while index isn't, what happens if we
842 try to write it out...
843 FIXED Why are extent entries only grouped in 4s?
844 If InoIdx doesn't exist, then write_inode must write the data block.
845 4/ resolve length of symlink
846 FIXED - long symlink followed by 'sync' crashes.
847 FIXED - rollforward isn't calling 'allocated' on blocks, or something
848 FIXED - I cannot find 'bfile'. (inode isn't written)
849 SEEMS OK...- Must flush final segment of a cluster properly...
850 5/ Review what does, and does not need to be initialised in a new datablock
851 6/ document and review all guards against dirtying a block from a previous phase
852 that is not yet safe on storage.
853 See lafs_dirty_dblock.
854 7/ check for proper handling of error conditions
855 a/ checkpoint_start might fail to start a thread!
856 b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
857 8/ review checkpoint loop.
858 Should anything be explicit, or will refile do whatever is needed?
860 What should checkpoint_unlock_wait wait for?
861 When do we need to wait for blocks the change state. And how?
862 DONE 10/ rebase on 2.6.current
863 DONE - use s_blocksize / s_blocksize_bits rather than fs->
865 11/ load/dirty block0 before dirtying any other block in depth=0 file
866 12/ Add writecluster flag for old-phase updates.
867 Why is this needed? updates should always go in the new phase???
868 13/ use kmem_cache for 'struct datablock'
869 14/ indexblock allocation.
871 allocate the 'data' buffer late for InoIdx block.
872 trigger flushing when space is tight
873 Understand exactly when make_iblock should be called, and make it so.
874 15/ use a mempool for skippoints in cluster.c
875 16/ Review seg addressing code in cluster.c and make sure comments are good.
876 DONE 17/ Make sure create inherits uid etc from process.
877 18/ consider ranges of holes in pending_addr.
879 DONE 20/ Implement rest of "incorporate"
880 DONE 21/ Implement staged truncate
881 DONE use for setattr and delete_inode
882 DONE 22/ block usage counts.
883 23/ review segment usage /youth handling and make a todo list.
884 a/ Understand ref counting on segments and get it right.
885 24/ Choose when to use VerifyNull and when to use VerifyNext2.
886 25/ Store accesstime in separate (non-logged) file.
888 make sure files are released on unmount.
891 Support 'peer' lists and peer_find. etc
892 31/ subordinate filesystems:
893 a/ ss[]->rootdir needs to be an array or list.
894 b/ lafs_iget_fs need to understand these.
895 32/ review snapshots.
897 how they can fail / how to abort
900 - need to clean up checkpoint thread cleanly - be sure it has fully exited.
901 34/ review roll-forward
902 - make sure files with nlink=0 are handled well.
903 - sanity check various values before trusting clusters.
905 34/ Configure index block hash_table at run time base on memory size??
907 Review everything that needs to handle laying out at cluster
908 aligned for striping.
910 36/ consider how to handle IO errors in detail, and implement it.
911 37/ consider how to handle data corruption in indexing and directories and
912 other metadata and guard against problems (lot of -EIO I suspect).
914 - check all uninc_table accesses are locked if needed.
917 1/ fs->pending_orphans and inode->orphans are largely unused!
918 2/ If a datablock is memory mapped writeable, then when we write it out,
919 we need to with fill up it's credits again, or unmap it.
920 3/ Need to handly orphans asynchonously.
924 Free index block are on two lists, both protected by the global
926 1/ The per-inode free_index, so they can be destroyed with the inode
927 2/ The global freelist so they can be freed by memory pressure.
929 11feb2008. Where was I up to again?
930 reviewing phase_flip and lafs_refile.
933 Reading through modify.c, at 'add_indirect'. Plan to fix all this code.
934 Need to thnik about how index block really change. How old blocks get
935 dis-counted from segment usage, and what optimisation are really good
936 for re-incorporating index blocks.
937 Operations to consider are:
938 i)Append new block, ii)truncate, iii)over-write, iv)fill-hole.
939 i/ leaf block splits, index block gets new entry at end, and replacement
940 for other entry. Easy to handle
941 ii/ trailing entries are zeroed. Should be easy, but isn't yet.
942 iii/ probably caught in leafs. May cause internal split so we add new
943 index address, which is easily handled if there is space.
944 iv/ same as iii, though split more likely.
946 What about merging index blocks. That just makes addresses disappear, which
947 we handle the slow way.
948 Do we ever re-target index blocks? Would need to be careful about that.
949 Make it look like a split where one block ends up empty as a hole.
951 grow_index_tree (DONE - untested)
952 ib is a leaf inode that is getting full. Copy addresses
953 into 'new', and make 'ib' an index block pointing at new.
955 add_index/walk index (DONE - untested)
957 end of do_incorporate (DONE - untested)
958 new contains the early addresses. Some remain in ib
960 the buffers much be swapped, so ib has the early address.
961 ui needs to be attached to new
962 return 2; - then new uninc needs to be split
965 case 2 - horizontal split
966 case 3 - vertical split
968 Bother - uninc_table is a problem (again).
969 We can currently add at any time with just a spinlock.
970 So when we split a block horizontally,
974 share out children and uninc_table in do_incorporate
975 share out credits in do_incorporate
978 Still need to do incorporate as above but took a break to...
980 Counting allocated blocks now works - stat show right info, hopefully
981 storage is correct too. - DONE
983 next: truncate? orphan thread?
984 Then segment usage and the cleaner.
988 truncate - removing blocks doesn't need to erase them...
989 - nothing forces a cluster_flush promptly!!! We need a timeout
990 or at least we need a flush before truncate_inode_pages...
992 - in lafs_truncate we need to make the block an orphan an pin in
995 21Feb2008 (Research morning)
996 Discard checkpoint thread created on demand in favour of a cleaner
997 thread that runs all the time. It cleans and checkpoints and
1001 do segment scan and get a real list of free segments and
1005 - segment usage scanning to count free blocks
1006 - fix up re-reading of erased blocks
1007 - FIX truncate can still block waiting for writeback to complete.
1008 - FIX allocations aren't failing when we run out of free space
1009 - FIX df doesn't agree with du.
1012 Truncate when an index block has addresses in uninc_table.
1013 The summary for the new address has already been performed.
1014 We need to deallocate the new without disturbing the old.
1015 However a simple allocation may not be possible.
1016 I guess we can prune them all to zero, then incorporation
1019 TOFIX: when truncating a recently created file, it is still depth=0 so
1021 We really need to increase the depth to 1 as soon as we dirty
1022 any block, then reset back to 0 if it fits.
1025 We have a file that we have written to, and the data blocks have been
1026 written out and the addresses stuck in uninc_table.
1027 We then truncate the file. Who releases the usage of those blocks?
1028 And who removes them from uninc_table?
1030 OK, 'rm' returns all the blocks back now so 'df' is almost the same as 'du'.
1031 I really should make sure that inodes are getting freed properly and the
1032 inode map is clean and everything.
1035 Do we reserve segment-usage blocks.
1036 We cannot do it naively as we get infinite recursion.
1037 But we need it to be allowed to dirty the segment block.
1038 But we cannot pin them to this phase as we want to write them out
1040 This still needs more thought. I avoided the recursion by setting SegRef
1041 before getting the ref. But that isn't safe.
1044 The table of cleanable segments is not working out. Each segment appears multiple
1045 times which wastes space and adds confusion.
1046 We really want to be able to lookup by dev/seg and also find the least.
1047 'Find least' sounds like we want a heap but then we cannot discard the bottom half.
1049 We could have a skiplist for dev/segment lookup and do a merge-sort on
1050 a different link when we want to find the best segment.
1051 We then remember the best number found since a sort, and re-sort if the top
1052 is worse than the best.
1054 We keep all this in a fixed size table. Each entry has
1055 seg,dev,usage,weight,weight-sort-link,addr-sort-link and possibly some
1056 addr-sort-skip links.
1057 This is 32+32+16+16+16+16 bits, or 16 bytes or bigger.
1058 Say 16bytes, 24bytes, or 32 bytes. (depth 8, which is plenty).
1059 One page of 16byte entries (256 of them)
1060 2/3 page of 24byte entries, 1/3 of 32byte entries.
1061 Total 2 pages, and 256+113+43 = 412 entries.
1063 But deleting random elements is awkward... but not too awkward. We can delete
1064 lots of entries by marking them as old, then performing a single pass of the skip
1067 We should keep free segments here too, on a separate list.
1070 2 pages of 16byte entries
1074 free list randomly threads through all.
1076 When using from 24 or 32, randomly choose height of 2-5 or 2-9
1077 Two lists run through the skiplist entries. One for cleanable, one for free.
1078 Remember the nth element for some small n (10, but it decreases as we pull
1079 things off the front) and if we add something less than that, we trigger a
1080 mergesort on the next time we want to clean.... maybe.
1082 Remember end of free list and add to there. Maybe merge-sort the free list
1083 by addr occasionally.
1086 When can we clean, when can we free wrt checkpoints?
1087 - we an clean a segment as soon as we have a checkpoint after it.
1088 So we record the youth of the segment holding the (start of the)
1089 checkpoint, and can clean any segment with a lower youth.
1090 - we can free a segment after the checkpoint after itfs usage has reached
1091 zero. So if usage is zero and youth....
1092 We could offset the usage by one (say - for the first cluster header..)
1093 then when we find a segment with usage of '1', we schedule an update to
1094 0 in the next checkpoint...
1095 Have about segments with different sizes - they get different weights.
1096 Need to divide by segment size: usage * youth / size.
1099 - It seems I sometimes fall off the end of the last segment !!! - FIXED (locking)
1100 - We seem to switch to a new segment when still 83 blocks remaining? - FIXED (delete did flush)
1102 - Lots of 'creates' makes lots of little clusters - need to optimise!
1103 Or it could be deletes as we currently cluster_flush for each
1105 - I think this is fixed
1108 Started looking at the cleaner.
1109 Need to understand how much to clean each checkpoint
1110 Need to track free-space-in-active-sectors while scanning.
1114 - the cluster head is currently limited to one page. This is not good.
1116 - Should the cleaner start before the scan is complete after a checkpoint?
1117 Probably it can, but while the scan is still happening it might be best
1121 try_clean is taking shape and has a few FIXMEs.
1122 need to write async find_block code and get it to watch for
1123 block in a cleaning segment.
1126 - where can padding appear in a cluster? between miniblocks? at
1127 end of device blocks?
1128 - need to track phys block while parsing headers for cleaning.. why?
1129 - determine rules for avoiding block lookup during cleaning
1130 based on youth/snapshot age, and truncate generation.
1131 We need to load the inode from each snapshot
1132 Can we optimise based on snapshot age?
1133 only if we know the block is newer than the snapshot.
1134 So when we relocate blocks (cleaning) they must go in a segment
1135 that is marked as being old. we cannot really guarentee that.
1136 I guess blocks that are marked as 'new' can safely be skipped if
1137 segment is newer than snapshot. This 'age' is not the youth, but
1138 is the cluster_head->seq which is stored in creation_age.
1140 - Store the rootdir for a filesystem in the metadata for the root inode.
1141 Then 'struct snapshot' doesn't need rootdir. It can have a root
1144 Looking at lafs_find_block_async.
1145 Needs async flag to make_iblock.
1146 Check that. Can we block_adopt if there was an error?
1148 setparent has async flag.
1149 lafs_leaf_find has async flag
1150 lafs_wait_block_async
1152 FIXME I wakeup the cleaner every time an IO completes.
1153 Do I really want that? Maybe only when number of async IOs hits
1154 half the recent maximum??
1156 FIXME need to ensure that lafs_pin_dblock flushed committed
1159 FIXME when we incorporate a dirty (non-realloc) address to an index block,
1160 we need to clear B_Realloc on the indexblock.
1162 FIXME in lafs_incorporate we lafs_dirty_iblock 'new' without
1163 giving it any credits. Where should they come from?
1165 We don't seem to scan for free/cleanable segments often enough.
1167 FIXME we shouldn't start a checkpoint while cleaning is happening.
1169 FIXME need to be careful when cleaning about finding inodes that
1170 don't exist any more.
1172 FIXME give credits to realloc blocks.
1174 FIXME think about/document transitions between realloc and dirty,
1175 and what locking is needed.
1178 Allowing for the FIXMEs above, the cleaner is now identifying
1179 blocks that need to be cleaned and marking them B_Realloc (I think).
1180 We now need to gather these into a write cluster and write them.
1181 They will all be on the clean_leafs list, so we can iterate that
1182 allocating or incorporating as needed. This will be similar to
1184 Important question is: when?
1185 Ideally we would have some auto-flush mechanism. The cleaner just
1186 keeps finding blocks to clean and when we start running out of
1187 resources we flush the cleaning queue.
1188 However we will still want to flush the cleaner always before a
1189 checkpoint, so for now we cna implement that bit and wait for a
1190 need for the other to arise.
1193 FIXME: cleaner lookup of 0/0/0 has interesting consequences as we
1194 don't record that location the same way.. how to handle?
1195 Should check that 'adopt' doesn't do the wrong thing with this block.
1198 Realloc blocks need to be pinned. That makes sense. Only that way
1199 will they get onto the clean_leafs list.
1200 When checkpointing we should probably examine clean_leafs to be
1206 Both of these hold a Credit.
1207 Both can be set at the same time.
1208 Cleaner ignores Dirty and sets Realloc anytime the block is in
1209 the wrong segment. It also Pins the block.
1210 When the cleaner is flushing to the cleaning segment, it
1211 ignores Dirty blocks. They get their Realloc cleared, but
1212 the remain pinned. So they will get moved at the next checkpoint.
1213 How do we know whether an indexblock should be Dirty or Realloc?
1214 The Dirty/Realloc bit is cleared before we get to incorporation.
1215 Maybe we lafs_dirty_iblock the parent of any block we write
1216 out. Then after incorporation, we set Realloc if it is not
1220 I think I'm pinning cleaner blocks now.
1221 Need to make sure the dirty ones are dropped. DONE
1222 Need to make sure the usage is transferred
1223 Need to get free segments back into use
1224 Need some more 'dump' options. Maybe youth/usage files.
1226 Need to make sure scan etc are triggered often enough.
1228 FIXME lafs_prealloc walks up ->parent without locking
1229 I think we want i_mapping->private_lock like lafs_pin_iblock.
1232 1/ a 'dump' option that triggers a scan and prints everything out.
1233 2/ scan must mark freeable as such, then subsequently free them.
1234 3/ Look at code that decreases usage of old segments.
1235 4/ Review lafs_cluster_wait_all and decide exactly how long we need
1237 5/ Review 'FIXME that is gross' HZ/10 thing.
1238 6/ Review 'wait for checkpoint to flush' msleep(500);
1239 Maybe remove that altogether.
1241 FIXME BUG_ON in grow_index_tree fires. sync - writepages - flush
1242 FIXME BUG in lafs_allocated_block fired.
1243 from lafs_erase_dblock from invalidate_page from .. vmtruncate
1247 An inode data block is dirty and pinned, but the inoidx is no longer
1248 pinned. Presumably it isn't dirty.
1249 Recheck what 'dirty' means on the two blocks and see how this can happen.
1252 Tree gets very big! Lots of 'Realloc' blocks that should
1255 WE are spinning in cleaner again, and not in try_clean.
1257 Is it a problem that 'Pinned' is used for Realloc and dirty blocks?
1258 In general it shouldn't be. The flush_cleaner process will remove
1259 the Realloc bits so the blocks fall off clean_leafs. They then either
1260 go onto phase_leafs or get unpinned.
1261 But I currently have a problem with InoIdx/data.
1262 The Pin is transferred to the Data block, but it doesn't go from the
1263 InoIdx block because it has a pincnt. Now that is probably a bug, but
1264 what if it weren't? What if, while we were cleaning, a block got dirtied.
1265 That would pin the whole tree.
1266 I guess the rule about not allocating an inodedata block while the
1267 InoIdx is pinned needs to be revised. If the inodedata block is
1268 Realloc (and not Dirty) while the InoIdx is not Realloc, we
1269 can go ahead (in a cleaning segment).
1272 adir/big1 is garbage.... big1 was removed, so why is it even there?
1274 echo tre > dump # still too much stuff.
1278 Put cond_sched in checkpoint loops!
1281 Thoughts about cleaning and pinning.
1283 When cleaning we need to know how many dependant blocks are being cleaned
1284 so that we know when *this* block can be written - i.e. when the could hits 0.
1285 We cannot use the pincnt for this phase because there may be dependant blocks
1286 which are dirty. They, and therefore this, may get flushed at next checkpoint,
1287 but they may not. If we could be certain they would, we could just write
1288 to the clean-segment blocks which can become unpinned. However if there
1289 is an index block being cleaned, and no dependant is being cleaned, but some
1290 are dirty but not pinned, then the checkpoint can go past without the block
1291 being moved.... but maybe we can detect that.
1294 We set B_Realloc precisely on blocks found in segments being cleaned.
1295 We pin these blocks and leafs which are Realloc go in clean_leafs.
1296 If a block is both Realloc and Dirty we clear Realloc but leave pinned.
1297 That way it gets written at end of checkpoint, but to main cluster.
1298 When we incorporate Realloc blocks into an index block, it gets marked
1299 Realloc. When we incorp dirty blocks, mark dirty. Then see above.
1300 On a checkpoint, we process both phase_leafs and clean_leafs
1303 FIXME do inode reads async better when cleaning...
1305 FIXME if a realloc inode has been allocated to a cluster when we try
1306 to dirty it, confusion can ensue as the writeout won't mark it
1307 clean, but will use up the credits.
1308 Maybe we need something similar to phasewait to not set PinPending...
1309 But normal dirtying doesn't phasewait. I think we just need to
1310 detect this case and wait for the clean-cluster to flush.
1313 FIXME make sure incorporate is doing the right thing with credits.
1315 FIXME lafs_write_inode. We need to be careful about clearing Dirty
1316 when making an update. Need some sort of locking.
1317 Need to review all inode dirty stuff and make sure we do
1318 write thing no matter when it is called.
1320 FIXME when blocks are attached to uninc_next, they don't have 'dirty'
1321 anymore so we don't know how to flag the index block.
1324 UPTO: unlink etc don't prealloc the inode that will be modified.
1325 And a warnon inode.c:579 is very noisy.
1328 FIXME: lafs_reserve_block uses CleanSpace if Realloc is set,
1329 but it doesn't get set until AFTER lafs_reserve_block is called.
1332 Cleaning cleans an InoIdx block which schedules the data block.
1333 Subsequent the InoIdx block gets pinned again.
1334 Now when we go to write the data block, we cannot because InoIdx is pinned
1336 Maybe given that data block is pinned, we write it anyway...
1338 FIXME: when we realloc an block embedded in the inode, don't pluck it out
1339 and put it back in again. Just realloc the inode.
1341 FIXME: when cleaning a directory that has shrunk, we think we have
1342 blocks that don't exist any more. FIXED - we thought '0' was in
1346 FIXME: lafs_dirty_iblock called from lafs_allocated_block in cluster
1347 flush finds no credit. for InoIdx block of 8501
1349 FIXME: do we do SEGREF on all the index blocks? do we need to?
1353 FIXME: seg usage for segment 0/5 isn't dropping to zero.
1354 Part of a file got moved off, but count is still there.
1355 FIXED - seg_move wasn't being called.
1356 FIXME: segusage file has inconsistent extents:
1361 FIXED several bugs in walk_extent
1363 FIXME qphase: any locking between that changing and lafs_seg_move??
1364 I don't think so. Just that seg_apply_all must be called after qphase is set.
1366 FIXME make sure we don't try to clean the current segment!!
1368 FIXME 'Available' goes negative!
1369 Creating large file doesn't instantly reduce 'Used'.
1370 Deleting files plus sync doesn't increase Avail?
1372 FIXME a segment is in the table but doesn't print out!
1374 FIXME we don't cope with running out of free segments (not that we ever should).
1376 FIXME check all Credit usage and make sure credits are returned when
1377 ->parent is dropped.
1378 provide visibilty into credit counts.
1379 Make sure we are keeping enough space for cleaning. We should always
1380 have a few segments unallocatable.
1383 FIXME cannot do io completion in cleaner thread as it can block on
1384 a i_mutex which might be waiting for completion. FIXED (keventd).
1386 FIXME as ->iblock isn't refcounted we need to be careful accessing it.
1387 If we 'know' we have a reference, e.g. a child with a ->parent
1388 link, we can access it without locking.
1390 lafs_make_iblock should return a counted reference.
1392 If we own an (indirect?) reference to iblock, we can access
1393 both iblock and dblock for free... but iblock can change???
1394 If not, we need to get a reference to on or other under a lock.
1396 FIXME block->inode should be a counted reference?
1400 lafs_inode_handle_orphan OK
1401 inode_handle_orphan_loop FIXED
1405 lafs_find_next FIXED
1409 lafs_inode_handle_orphan
1413 FIXME root->iblock should always be refcounted. Is it?
1414 FIXME walking siblings - what lock?
1417 FIXME several times we clean PinPending without refiling, in dir.c in particular.
1418 that looks wrong. FIXED
1420 Maybe lafs_new_inode should return a reference to the dblock
1421 Or pin it. or something. FIXED And pinned (when needed).
1423 FIXME lafs_inode_dblock might return a block without valid data...
1424 Need to get valid data, then load block 0 in find_block rather than
1427 FIXME we really should own a reference to ->dblock before calling
1428 lafs_pin_inode. We don't want IO during a pin request.
1431 FIXME review use of PhysValid FIXED
1433 lafs_orphan_abort - what if lafs_orphan_pin not called?
1434 or if 'b' is NULL. FIXED
1436 Do I Need to clean PinPending when retrying??
1437 Well, we need to be phase-locked when we set PinPending, so
1438 it must be Pinned to the current phase.
1439 So when we unpin a datablock, we must clear PinPending.
1440 FIXED we now clear PinPending in do_checkpoint.
1442 Does phase_wait do the right thing when pinning an inoidx block
1447 Need to understand and document the lifetime of a page with datablocks.
1448 who hold what refcount, and when can it be freed?
1449 Then fix up locking in lafs_refile, __putref.
1451 FIXME how keep what refcount on orphan blocks/inodes??
1452 FIXME should dirty/pinned/etc hold a refcount? they don't.
1456 FIXME make use a failed (-EAGAIN) pinning triggers a checkpoint (eventually)
1458 FIXME make sure empty files have depth of 1.
1460 FIXME Truncate proceeds lazily. All data blocks need to be gone
1463 If I call lafs_erase_dblock while a write is underway, we have a problem.
1464 We need to wait potentially for a checkpoint to let go of the block and
1465 a write to complete.
1466 This should be done with waiting for PG_writeback on the page to disappear.
1469 When end_page_writeback is called, we must have dropped all references to the
1471 When we commit to writing a block, we have to set PG_writeback on the page
1472 so that truncate et al can wait for it. Before we have committed, truncate
1473 can just remove the page. Internally we differentiate by B_Alloc.
1474 So before setting B_Allocated we need to test_set_page_writeback(page).
1475 Be careful of races.
1476 I don't think we can ensure all references are dropped. After all, that is
1477 the point of refcounts. So dblock array must exist without page!
1478 But we need to ensure that we don't start a writeout after truncate
1479 has done wait_on_page_writeback.
1480 This is done with the page locked so when we want to write a page
1481 in a checkpoint, we need to lock the page first. Once we have the lock,
1482 we check if the page is still dirty. If it has been truncated it
1484 But how do we safely reference the page if b->page can be cleared?
1486 When we clear PagePrivate, we take a counted reference to the page
1487 for db->page. This is dropped when the page is freed by lafs_refile.
1488 But while it is held, it is still safe for db->page to be dereferenced.
1489 So before we commence writeout we have to lock the page and set
1490 PG_writeback. After locking, we need to test if writeback is still
1493 Maybe not. I think we can submit blocks for writeout without setting the
1494 page to writeback. If we do, then we need to be sure those writes
1495 finish before invalidatepage calls releasepage (block_invalidatepage
1496 calls discard_buffer which calls lock_buffer which waits).
1497 In our case invalidatepage need to make sure that no new write commenses.
1498 Maybe we should lafs_iolock_block before we allocate to a cluster and check
1499 again if the block is dirty.
1502 lafs_cluster_allocate does:
1504 check if still dirty. If not, unlock and return
1507 when write completes, allocate is cleared.
1512 clear Valid,Dirty,Realloc
1517 2008 aug 28 - happy birthday.
1518 FIXME segsum_find calls lafs_reserve_block without a checkpoint lock.
1519 lafs_prealloc complains.
1521 mark_cleaning does too, but cleaning only happens well away from a checkpoint
1523 segsum_find is being called to reference a new segment when we flush a cluster.
1524 segment usage blocks are special. Their index information doesn't
1525 need to be written out in the current checkpoint. We can do that, but
1526 the backstop is to write just the data block in the tail of the
1527 checkpoint and write indexing information later.
1530 unlink is getting "No space left on device". This is when trying to
1531 pin the directoory block, the physaddr is 0, so it looks like we want
1532 NewSpace. But we should even be trying to prealloc in that case becase
1533 there should already be a prealloc on the block. i.e. there should be
1535 Hmmm. after multiple 'syncs' how can the block not be written out.
1536 Maybe it is embedded in the inode?
1537 When we pin a block that was embedded in the inode it isn't clear what to
1538 do. If we might grow the file so it doesn't fit any more, we need to
1539 allocate NewSpace. If we know it won't grow. we use Release.
1540 This still needs a proper fix.
1542 Cleaning seems to be working nicely. However we don't get all the space
1543 back that we should because lots of blocks still have credits that
1544 aren't being returned.
1546 So when should credits be returned?
1547 They are set when a block is pinned. It then gets dirtied which
1548 consumes a credit. Then gets unpinned. I guess if it isn't pinned,
1549 then it doesn't need any credits.
1552 It seems that cluster_flush is not always writing things in the correct
1553 order. Root gets written before some other things below it.
1554 Maybe they are temporarily out of the loop??
1555 No. There are dirty blocks which one checkpoint doesn't pick up, but
1556 they aren't holding the index block pinned. so they lose allocation.
1558 But they must hold the indexblock pinned, even though they aren't pinned
1559 themselves. We maybe do this just with the refcnt... maybe. That will cause
1560 it to phase-flip rather than drop pinning, which I think is right.
1562 So: too many credits remain allocated. Where are they? There are 1464
1563 outstanding credits. 290 are in the tree so 1200 or so are elsewhere??
1564 But things removed from the tree have credits removed.
1568 FIXME roll forward ignores inodes. But what about an inode that contains
1569 data. Should that be ignored? I think not.
1570 FIXME delete adir/big2 then delete adir and it cannot release:
1571 Cannot release [cee29000]74/0(0)r1:Pinned,Phase0,Valid,Dirty,SegRef,UninCredit,PhysValid,Prealloc
1572 presumably there is orphan processing or something to complete???
1573 FIXME when files are deleted, the space isn't returned!
1574 This seems to be mostly fixed - need to test.
1575 FIXME when I "rm [b-z]*" it waits for writeback on something???
1576 zfile again!!! OK, I think that is fixed.
1581 seg_apply_all dirties dblocks. When should they be reserved?
1582 The originally get reserved by a lafs_reserve_block call in
1583 segsum_find called from e.g. lafs_seg_ref which is called by lafs_reserve_block.
1584 However: that block might get written before *and* after a checkpoint.
1585 So we need N* Credits. These are usually only used for Index blocks.
1586 We can set these easily enough if inode type is TypeSegmentMap.
1587 We move them across to Credit in seg_apply_all.
1588 But when to we clear them if they aren't needed? I guess
1589 when we drop the last segref. Yes, we already do that.
1590 FIXME need to make sure these get flushed on next checkpoint
1591 if we cannot allocate new credits after a checkpoint.
1593 New Problem. The 'cleanable' table reports a size of 3, but it is empty!
1594 Think that is fixed.
1597 1/ see above: rm x/y; rmdir x -> BUG - FIXED
1598 2/ Spins on 'CURRENT=1' ??
1599 3/ if alloc_space gives EAGAIN while deleting, we don't survive.
1600 4/ When I create/delete a file, ablocks_used increments by one.
1601 The inode hasn't been allocated yet, so it seems the deallocation
1602 isn't adjusting ablocks_used??
1603 5/ open_namei (for dd) got caught on a mutex_lock.
1604 6/ When a large file is shrunk we don't reduct the level of the InoIdx block
1605 I'm not sure where we should and am not thinking very clearly.
1606 Will fudge something in flush_data_to_inode for now, but it MUST be fixed.
1607 7/ unlink (at least) can get stuck in iolock_block. Who could be holding
1608 the lock? Writeout that hasn't completed?
1609 Yes. writepage calls lafs_allocated_block without calling flush.
1610 So the block could be sitting waiting for a flush. How long do we
1612 8/ It seems that some datablock can need NCredits. Make sure these
1613 are handled properly re flush-or-refill after checkpoint and
1614 flip_phase rather than unpin.
1615 9/ Maybe after lafs_writepage cluster_flush isn't getting called soon
1616 enough, and we lock up (see 7). Need to flush the first block
1617 straight away, and the next one as soon as the first finishes, etc.
1618 Or something like that. Then remove the comment from lafs_writepage.
1622 I seem to be getting only 4 blocks to a cluster at the moment.
1623 This is good as it motivates the code to handle block splitting in
1624 the Btree. But it shouldn't happen.
1627 Block spliting might work - it doesn't crash at least.
1629 After deleting all files, the tree is full of stuff.
1630 Lots of inode data/InoIdx blocks.
1631 Many but not all a Pinned. The others are OnFree
1632 The Pinned ones have outstanding references.
1636 Problem with the block splitting, when adding an index block.
1637 The index block is initially empty - we need to find things by looking
1638 at children. But we don't. We BUG_ON the iphys==0.
1639 In general, when we add a block below and index block and before we incorporate,
1640 the block must be found by finding the first indexed block and looking to
1641 see if there is a 'next' block that contains the address we need.
1644 But if we truncate a file while an index block is pinned and dirty,
1645 we spin on trying to incorporate it, which should make it empty.
1649 sync is trying to get lock in lafs_cluster_flush
1650 pdflush holds the lock and is stuck in cluster_flush_0xa40
1651 some wait_event I expect.
1652 Maybe we need an unplug ??
1654 - checkpoint/seg_apply_all/dirty_dblock doesn't have the credits.
1655 This is in clean_free. We try to update the 'youth' to mark
1656 the segment as free, and we don't have a reservation to do it.
1657 Maybe just reserve it there and then.
1661 When doing a lookup in an index block, we need to check the unincorp
1662 address list. It isn't enough to look for unincorp blocks as they
1663 might have disappeared.
1664 For INDIRECT and EXTENT this is easy enough as full information is in
1666 For INDEX it is a little tricky as we need to look at the full set of
1667 addresses to know where a particular address fits.
1668 We could force and incorporate first, but that has awkward implications
1669 if it requires a split.
1670 Maybe if we get from the lookup "start+range"....
1671 That is not enough as the 'start' might get zeroed by an update.
1674 rm adir/* doen't work as readdir doesn't get all the entries
1676 Reason is that they are being put in the wrong block.
1677 lafs_find_next doesn't correctly find the 'next' block if it
1678 hasn't been incorporated yet.
1680 in index tree -- easy to find
1681 in uninc_table -- not too hard
1682 in only in the ->children list, or attached to a page.
1683 It would be nice to use find_get_pages but that isn't exported so try
1684 something else for now.
1686 Look in index block for 'next
1689 FIXME when we split an index block, we need to hold a reference to
1690 the original so it doesn't disappear until the split-off copy is
1691 written. This is because we search from an index block to find
1693 [ note from Feb09. This should be OK now. Both will need
1694 incorporation, and we now hold on to blocks until they are
1700 - index block. What changes are allowed exactly.
1701 - splitting certainly makes sense.
1702 - merging two adjacent blocks is fine, of which a special case
1703 is finding that a block is empty and so removing it.
1704 - What about a 2->3 split which would require removing a block
1705 and adding another at the same time?
1706 or noticing that the first blocks addressed are all missing, so
1707 moving the index forward?
1708 In each case, searching down by indexes will find a block that
1709 has been replaced by a later address. We could manage that as
1710 long as the new block is attached after the replaced block.
1711 So we cannot move a block. We must delete and replace.
1713 - unincorporated index blocks..
1714 unincorporated data blocks are not pinned in memory. Once they have
1715 been written out, they can be freed. Their address is stored in the
1716 uninc-table. This means we can delay incorporation while many
1717 extents are written out and freed. When we come to incorporated, we
1718 may have many hundred of address in a few extents that can be incorporated
1719 efficiently without holding all that data pinned in memory.
1720 The same scale doesn't apply to index blocks. An index block can
1721 reference only 102 blocks (for 1K block size). And the uninc table can
1722 hold far fewer so we will naturally incorporate more often.
1723 So keeping index/indirect/extent blocks pinned until they are incorporated
1724 is reasonable. And it makes lookup a lot easier, as we have
1725 guarantees about ordering of block in the children list that we
1726 don't have in the uninc table.
1728 Incorporation could have some atomicity issues. There is no
1729 concern about bad stuff appearing on disk as the phase-change
1730 process handles that. In memory it might be awkward if we split
1731 an index block before incorporating a block what would span them.
1732 That could conceivably happen if we only incorporate 8 blocks
1733 (size of uninc table) at a time.
1734 So maybe we should incorporate a full uninc list (not table) at
1736 This means quite different code paths for incorporating leaf
1737 and internal index blocks....
1740 - uninc_table lists are a real problem.
1741 They can only be created during roll-forward so they hardly ever
1743 But if the block is split while processing earlier things on the
1744 list, then splitting an uninc table would be very messy.
1745 Is there any way around this?
1746 Why not just do incorporation during roll-forward?
1747 We only need to incorporate leafs, not internal blocks because we
1748 don't use uninc_table for internal blocks any more.
1749 So during roll forward, all index blocks that are touched need to
1751 I think we live with that. If it every becomes a problem, we will
1752 need to perform the roll-forward twice. The first time collects
1753 the usage information so that we know where we can start writing,
1754 then the second just applies all the changes. to the rest of the
1759 uninc table only used for leaves, and has no linked list
1760 unincorporated index block are stored on a list, which we
1761 sort before applying.
1762 All uninc index blocks are therefore kept in the index tree.
1763 Their order on the children list allows us to find the correct
1764 index. Each block for which the fileaddr is in the parent is
1765 followed by any blocks that have been split off and end after
1766 this one starts. Blocks that have been emptied are Hole and are
1767 skipped over when looking for a block.
1769 When we split an internal block, the remaining uninc blocks
1770 must not start with a Hole.
1772 FIXME: what locking do I need around lafs_incorporate?
1773 i_mutex?? i_alloc_sem??
1774 i_alloc_sem is imposed by truncate (inode_setattr) and
1775 direct_io possibly. So it is really about adding/removing
1776 blocks. Not updating internals.
1777 Maybe our own mutex. Could even be per-index-block !!
1778 Whatever it is, we need to protect walking ->children too.
1782 "rm -r" problem from 12/dec/2008 fixed now.
1783 incorporate code got a make-over and is probably much better.
1785 New problems: After test runs, cannot create files due to no space
1786 on devices!! But directory tree is empty.
1789 free_blocks=3256 allocated=1425 max_seg=512 clean_reserved=0
1791 The problem is that we think 1425 has been allocated to data that
1792 might still need to be written, leaving not enough room for more.
1794 ====================414 credits ==============================
1795 which doesn't explain everything, but does explain a lot. There
1796 really should be nothing in the Index tree (except fs-root and
1799 Some inodes which are OnFree and hold no credits.
1800 0 DATA (1) 52 [0]ESegRef,Claimed,PhysValid
1801 52 1 (0) 0 [2564]{0,00000000}L on free Index(1),InoIdx,OnFree,PhysValid
1803 Some other inodes which are pinned with lots of credits and are
1804 on the phase_leaf list
1805 0 DATA (1) 299 [0]ESegRef,C,CI,Claimed,PhysValid
1806 299 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(40) Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1808 And that is about it. some are not Valid, some are...
1809 checkpoint just wants to 'flip' them.
1810 They mostly have a refcnt of 1... I wonder who is holding that....
1811 The reference of on the dblock is held by the iblock.
1812 But what is the iblock remaining? Who holds that reference?
1814 I restored some code to clean iblock, and now:
1815 free_blocks=3229 allocated=1277 max_seg=512 clean_reserved=0
1816 ====================244 credits ==============================
1817 which saved 130 credits. That helps.
1818 There seem to be many fewer of the many-credits blocks
1819 Lot of index blocks in tree are 'OnFree' and have a
1820 0 refcnt, but haven't been removed. Why?
1821 It seems that the have ->parent == NULL, so lafs_refile never
1822 bothers to remove them. I guess it should...
1823 OK, lots of InoIdx block have gone now with their DATA blocks.
1825 So, remaining blocks are pinned to their phase with lots of Credits,
1826 have not pincnt, mostly have physaddr==0.
1827 It is just the stray refcnt that keeps them there..
1828 inums are 40, 56, 62-73, 275-278, 280
1831 63-69 are directories 2/3/4/5/6/7/8/9
1832 70-73 are looooong symlinks
1834 276 is dfile - same as cfile but truncated.
1835 Then some nbfile-X that were big enough.
1837 So: what do they have in common:
1838 Several only use the in-inode data block, but
1841 Can it be that it is refcounted on the Leaf list, and so
1842 cannot get off?? Yes, I think so!
1843 We only unpin things that have a zero refcount.
1846 checkpoint takes it off the list, then flips the phase and puts it
1847 on the other list with refile. During that time it has a refcount
1848 it doesn't lose the pinning.
1850 1/ Not have it on the list despite being pinned.
1851 2/ Drop the PIN despite the refcnt.
1852 3/ have refile do the phase_flip so it has a chance to
1853 notice the refcount has hit zero.
1855 2 isn't really an option. We need PIN to persist whenver we have
1856 a reference. We could possibly use PinPending for index blocks too,
1857 but that would require a lot of thinking.
1858 1 requires another criterea for being on the list. I suspect that would
1860 3 we used to do I think... But refile is in a big lock, and we
1861 cannot really do a phase_flip under that.. and phase flip calls
1862 refile anyway so we would get recursion.
1863 So:4 - get lafs_phase_flip to notice and de-pin rather than flip.
1865 FIXME use kzalloc where appropriate.
1867 FIXME Maybe test refcnt-!listempty in refile and de-pin if that is zero.
1871 Only 54 credits in Index Tree now.
1872 Inodes 1 2 8 16 are present. (fsroot, dirroot, inodemap, segusage)
1873 plus '74', which seems to be schedules for deletion - root has uninc_table.
1874 ... and 'sync' got rid of that and left 44 credits.
1875 Also have data blocks for inode 50 55 72 73 74 with 2 credits of 74.
1881 These seem to be the files that used data-in-the-inode
1882 They still have a refcnt of 1 (or 2 for adir).
1883 ... OK, that's gone now. I fould a refcount leak.
1885 So now: 42 Credits in Index Dump. No stray files.
1887 df: tot=4608 free=4597 avail=3045(4130-1085) cb=8 pb=0 ab=3
1888 So we still seem to have 1085 blocks allocated. 42 are accounted
1889 for, so 1043 still missing... either we lost the count, or lost the tree.
1891 create a finy file, remove, and sync, now
1892 df: tot=4608 free=4597 avail=3018(4118-1100) cb=8 pb=0 ab=3
1894 so I lost 15, b ut now 48 are in tree. Lets try again...
1895 df: tot=4608 free=4597 avail=3006(4108-1102) cb=8 pb=0 ab=3
1898 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1900 Definitely losing more thant the difference in the tree.
1902 Try creating empty files...
1903 df: tot=4608 free=4597 avail=2986(4098-1112) cb=8 pb=0 ab=3
1904 df: tot=4608 free=4597 avail=2974(4088-1114) cb=8 pb=0 ab=3
1905 df: tot=4608 free=4597 avail=2954(4078-1124) cb=8 pb=0 ab=3
1906 df: tot=4608 free=4597 avail=2942(4068-1126) cb=8 pb=0 ab=3
1907 df: tot=4608 free=4597 avail=2922(4058-1136) cb=8 pb=0 ab=3
1908 df: tot=4608 free=4597 avail=2910(4048-1138) cb=8 pb=0 ab=3
1909 df: tot=4608 free=4597 avail=2890(4038-1148) cb=8 pb=0 ab=3
1911 very strong pattern there.
1912 What about 2 files at a time.
1913 df: tot=4608 free=4597 avail=2879(4028-1149) cb=8 pb=0 ab=3
1914 df: tot=4608 free=4597 avail=2860(4018-1158) cb=8 pb=0 ab=3
1915 df: tot=4608 free=4597 avail=2849(4008-1159) cb=8 pb=0 ab=3
1916 df: tot=4608 free=4597 avail=2830(3998-1168) cb=8 pb=0 ab=3
1917 df: tot=4608 free=4597 avail=2819(3988-1169) cb=8 pb=0 ab=3
1919 Slightly different pattern - not as bad.
1921 df: tot=4608 free=4597 avail=2802(3978-1176) cb=8 pb=0 ab=3
1922 df: tot=4608 free=4597 avail=2793(3968-1175) cb=8 pb=0 ab=3
1923 df: tot=4608 free=4597 avail=2776(3958-1182) cb=8 pb=0 ab=3
1924 df: tot=4608 free=4597 avail=2767(3948-1181) cb=8 pb=0 ab=3
1926 Strange, isn't it....
1928 Making sure we clear UnincCredit... result looks worse.
1931 I fixed up the credit accounting 'incorporate' and then fixed a couple
1932 more little bugs. And now:
1936 ====================48 credits ==============================
1937 df: tot=4608 free=4597 avail=3172(3940-768) cb=10 pb=0 ab=1
1939 So we still have 720 allocated credits that aren't accounted for.
1940 But we are nicely under 100...
1945 ====================76 credits ==============================
1946 df: tot=4608 free=4256 avail=2160(2402-242) cb=350 pb=0 ab=2
1948 That is different. The count of missing blocks is way down,
1949 but there is some extra cruft in the index tree.
1951 0 DATA (1) 303 [0]L Leaf1(13) SegRef,Claimed,PhysValid
1952 0 DATA (1) 302 [0]L Leaf1(14) SegRef,Claimed,PhysValid
1954 0 DATA (2) 330 [0]L Leaf1(1) SegRef,C,CI,Claimed,PhysValid
1955 330 1 (1) 0 [0]{0,00000000} [0, 0]L Leaf1(0) Index(1),Pinned,Phase1,InoIdx,SegRef,C,CI,CN,CNI,PhysValid,Prealloc
1956 Time for a commit though....
1959 ====================46 credits ==============================
1960 df: tot=4608 free=4257 avail=4253(4458-205) cb=350 pb=0 ab=1
1962 so the strays in The index tree are gone. but still have 159 outstanding
1965 ====================36 credits ==============================
1966 df: tot=4608 free=4256 avail=3787(3885-98) cb=350 pb=0 ab=2
1969 That is a little weird...
1971 ====================48 credits ==============================
1972 df: tot=4608 free=4257 avail=4247(4458-211) cb=350 pb=0 ab=1
1975 ====================34 credits ==============================
1976 df: tot=4608 free=4257 avail=3176(3373-197) cb=350 pb=0 ab=1
1978 It seems that the unaccounted blocks are (or can be) created by
1979 writing to a file then removing the file without a sync.
1980 ..but why is cb (cblocks_used) so high?
1984 Got onto a bit of a tangent...
1985 What happens if we truncate a block while it is on a list to
1986 be cleaned? Clearly we want to cleaner to drop it ASAP.
1987 But what if invalidate_page wants to drop it *now*
1988 Hopefully it is either still on clean_leafs and we can remove it,
1989 or it is now iolocked and we can wait for it. So should be OK.
1991 I keep getting caught in "looping on..."
1992 We are truncating an inode and some index block which is now empty
1993 is not getting removed from the tree because there is an outstanding
1994 reference.... 327/0 depth=1. I guess I turn on the tracing.
1996 ... and it seems that it is in the process of checkpointing.
1997 I guess I need to lock against that ... maybe with the iolock.
2000 ib = [ce814e40]328/0(2552)r3:Index(1),Pinned,Phase1,Valid,Dirty,CI,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0]
2001 ------------[ cut here ]------------
2002 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:371!
2005 Every time I create/delete a file, I get an extra 'ab' which disappears
2008 decremented when +ve summary_update on non-index
2009 increased on lafs_summary_allocate... should not be done for index blocks.
2011 OK: after test run, filesystem is empty, but cblocks_used is around 360.
2013 is loaded at mount time
2014 collects pblocks_used on a phase flip
2015 is updated in lafs_summary_update (unless pblocks is)
2016 So we must be missing a lafs_summary_update when phys->0
2020 truncating big (multi-level index) seems to be bad
2021 Leaves 'pb-338 !!! and cb+689, even after sync.
2022 still 'looping on' occasionally
2023 Haven't found cblocks_used leak yet.
2024 Occasionally non-B_Valid blocks are actted on.
2025 I think I need to improve io locking.
2029 Need some improvements to iolock locking.
2030 We use this lock to wait for a block to be written out (if that is happening)
2031 before we allow lafs_invalidate_page to complete.n
2032 It is also use in lafs_erase_{d,i}block (Similar purpose)
2033 We take the lock in lafs_cluster_allocate, and then make sure the block is
2036 Also lock in lafs_new_inode as initing the inode is a form of IO ??
2037 load_block takes the lock
2038 We only clear_bit(B_Valid, ) under this lock.
2040 So the issue is this:
2041 A block that is going to be written is passed to lafs_cluster_allocate.
2042 This happens either after taking it of a _leafs list, or when
2043 lafs_writepage requests the write.
2045 lafs_invalidate_page needs to be able to release the page, so there needs to
2046 be no transient references. In particular, once the block has been
2047 removed from a _leafs list it must already be iolocked.
2048 Invalidate_page can then either remove from that list and erase the block,
2049 or use io_lock_block to wait for the IO to complete.
2050 So when a datablock comes out of get_flushable it must be iolocked, and must
2051 remain iolocked until after Dirty and Alloc are clear
2052 Index blocks belong entirely to the fs, so we can be more relaxed with them.
2053 If get_flushable finds the block already iolocked, it is either being invalidated
2054 or already has IO pending, so it can be dropped.
2059 FIXME When we sync a small file, we just write out the inode.
2060 rollforward currently ignores data in inodes I think.
2061 Thanks needs to be fixed to ensure this data is safe.
2063 - stop iblock from disappearing so much.
2066 While cleaning a file, I truncate it. This makes it appear
2067 to fit in the inode but it is very big and we get confused.
2068 We cannot allocate block 0 until all the others have been
2069 allocated to 0 and forgotten.
2070 But what if we truncate a file to 10 bytes, then fsync?
2071 We need to write the data promptly, but we like doing truncate
2073 When we extend a file we already need to wait for truncation
2074 to complete (FIXME do we do that?) We could wait on fsync too.
2075 We cannot just delay block0 as it might be part of a checkpoint
2076 that has to complete promptly while truncation can take a long time.
2077 i.e. we have a very large file. We update the first byte, then
2078 truncate to 2 bytes.... we don't need to write until fsync which will wait...
2079 Directory?? delete lots of entries so it shrinks to one block?
2080 There is no delayed truncate there.
2081 ?? Never clean an I_Trunc file.
2082 If we try to allocate a file with other indexes:
2084 if Dirty and Pinned, just do normal alloc
2085 if Dirty and not pinned, skip.
2088 Sometimes I run out of credits while truncating a file.
2089 I need credits - maybe only briefly - to dirty the index blocks.
2092 An indexblock remains pinned while the refcount is non-zero.
2093 A pinned index block can be on a _leaf lru
2094 The _leaf lru holds a refcount.
2095 This is an awkward referential loop.
2096 We break it at checkpoint time with special code in phase-flip.
2097 But there are other awkward times such as truncate.
2099 We cannot use PinPending like we do with data blocks because there
2100 could be multiple pending Pins (from different children).
2102 We could possibly treat checkpoint_lock like pinpending, but that
2105 We could not count the _leaf lru, but that might just make the race
2108 I think we want to explicitly drop the pin when we truncate a block.
2109 Normally, once we Pin an index block is will become dirty so we don't
2110 want to de-pin before a checkpoint anyway...
2112 Just to clarify: an index block gets dePinned:
2113 - during checkpoint on a phase_flip if it is no longer dirty etc
2114 - on truncation when we erase it
2115 - during pre-emptive write-out which is a bit like an early phase_flip
2116 not sure that we implement that one yet.
2120 - checkpoint calls incorporate call erase_iblock calls iolock_block
2121 - rm calls orphan_pin calls phase_wait
2122 The problem is in lafs_incorporate. It expects the block to be iolocked,
2123 but can call erase_iblock which try to get an iolock itself...
2124 ...fixed that and it still happens.
2125 checkpoint calls phase_flip calls allocated_block (on uninc list) calls
2126 iolock_block before calling incorporate
2127 Maybe all of these should assume an IO lock.
2129 FIXME truncate assume truncate-to-zero. We need proper ftruncate support.
2133 - sort out individual patches and review DONE
2134 - allow compilation without refcount tracking DONE
2135 - don't hold a 'leaf' reference. NO
2136 - clean up *ref calls - differentiate those that can be called when zero DONE
2137 - use enum for B_* DONE
2138 - support truncate to non-zero offset DONE
2139 - "looping on" found an 'OnFree' block!
2140 - clean out lot of debugging
2143 rmdir is holding i_mutex and waiting for a phase change to pin a dblock.
2144 checkpoint is also holding i_mutex.. or is trying to get one in lafs_cluster_allocate.
2145 Not cool. i_mutex must not be taken by checkpoint
2146 Fixed that, though it is a bit of a hack....
2148 New deadlock: checkpoint calls phase_flip which calls allocate_block,
2149 to move the uninc_next across, and that tries to iolock the parent to
2150 perform a partial incorporation. But that seems to be iolocked.
2151 Generally that is ugly as ->uninc_next might be very long and require
2152 multiple splits, and direct-driving that from phase_flip is bad.
2153 I should just move the list across
2157 Spent too long trying to remove refcount help by *_leaf lists.
2158 This leaves InoIdx block with zero refcount so Data block can get
2159 lost and bad things happen.
2160 I might be able to fix it up, but it is probably better to try the
2161 checkpoint_lock approach if I can only remember what that is.
2174 ->lru when on freelist
2182 ->children / ->parent within an inode
2190 segsummary counters (in blocks)
2194 ->pending_blocks lru - should this be wc->lock ??.. not in 'bh'
2195 Pinned consistent with lru
2196 ->checkpointing / ->phase_locked
2198 ->uninc and ->chain ?? Should use parent->B_IOLock ??
2199 uninc_table - should use B_IOLock
2200 free list / clean list segtrack
2205 wc[0] .. something in prepare_checkpoint
2223 Initialising new inode
2225 IOLock across a page
2228 --------------------
2229 This is a list from 18 months ago, with updates
2231 - Understand how superblock 'version' should be used.
2233 - Review and fix up all locking/refcounts. See locking.doc
2234 Also lock inode when copying in block 0 and probably
2235 when calling lafs_inode_fillblock (??)
2236 - lafs_incorporate must take a copy of the table under a lock so
2237 more allocations can come in at any time.
2239 - We don't want _allocated to block during cluster flush. So have
2240 a no-block version and queue blocks on ->uninc if we cannot
2241 allocate quickly. Find some way to process those ->uninc blocks.
2243 - Use above for phase_flip so that we don't need to _allocated there.
2245 - Utilise WritePhase bit, to be cleared when write completes.
2246 In particular, find when to wait for Alloc to be cleared if
2247 WritePhase doesn't match Phase.
2248 - when about to perform an incorporation.
2249 - make sure we don't re-cluster_allocate until old-phase address has
2250 be recorded for incorporation.
2252 - allocate multiple WAIT_QUEUE_HEADS for 'block_wait'
2254 - Can inode data block be on leafs while index isn't, what happens if we
2255 try to write it out...
2257 - If InoIdx doesn't exist, then write_inode must write the data block.
2259 - document and review all guards against dirtying a block from a previous phase
2260 that is not yet safe on storage.
2261 See lafs_dirty_dblock.
2262 - check for proper handling of error conditions
2263 b/ if lafs_seg_dup fails in do_checkpoint, we need to abort the snapshot.
2264 - review checkpoint loop.
2265 Should anything be explicit, or will refile do whatever is needed?
2267 What should checkpoint_unlock_wait wait for?
2268 When do we need to wait for blocks the change state. And how?
2270 - load/dirty block0 before dirtying any other block in depth=0 file
2272 - use kmem_cache for 'struct datablock'
2273 - indexblock allocation.
2275 allocate the 'data' buffer late for InoIdx block.
2276 trigger flushing when space is tight
2277 Understand exactly when make_iblock should be called, and make it so.
2278 - use a mempool for skippoints in cluster.c
2279 - Review seg addressing code in cluster.c and make sure comments are good.
2280 - consider ranges of holes in pending_addr.
2282 - review correct placement of state block given issues with stripes.
2284 - review segment usage /youth handling and make a todo list.
2285 a/ Understand ref counting on segments and get it right.
2286 - Choose when to use VerifyNull and when to use VerifyNext2.
2287 - implement non-logged files
2288 - Store accesstime in separate (non-logged) file.
2290 make sure files are released on unmount.
2293 Support 'peer' lists and peer_find. etc
2294 - subordinate filesystems:
2295 a/ ss[]->rootdir needs to be an array or list.
2296 b/ lafs_iget_fs need to understand these.
2299 how they can fail / how to abort
2302 - need to clean up checkpoint thread cleanly - be sure it has fully exited.
2303 - review roll-forward
2304 - make sure files with nlink=0 are handled well.
2305 - sanity check various values before trusting clusters.
2307 - Configure index block hash_table at run time base on memory size??
2309 Review everything that needs to handle laying out at cluster
2310 aligned for striping.
2312 - consider how to handle IO errors in detail, and implement it.
2313 - consider how to handle data corruption in indexing and directories and
2314 other metadata and guard against problems (lot of -EIO I suspect).
2316 - check all uninc_table accesses are locked if needed.
2318 - If a datablock is memory mapped writeable, then when we write it out,
2319 we need to with fill up it's credits again, or unmap it.
2320 - Need to handle orphans asynchonously.
2323 - implement 'write_super' ??
2325 - pin_all_children has horrible gotos - remove them.
2327 - perform consistency check on all metadata blocks read from disk
2328 e.g. don't assume index blocks are type 1 or 2.
2331 + looking at cleanup for unmount.
2332 - various more refcounts fixed up
2333 - B_SegRef is never dropped! and we take a ref on a segment when
2334 we start a cluster on it, but never drop that reference.
2335 THIS is next thing - review all setting and clearing of B_SegRef.
2338 - SegRef and lafs_reserve_block...
2339 There is room for recursion here, I need to be careful.
2340 To dirty a data block, all parent index blocks must be Pinned and must
2341 be able to be written. That means their segusage blocks must be
2342 available for update. And Pinning a segusage block for update requires
2343 all its parents. So the segment for the block, the indexes, and the
2344 segusage and indexes and so-on must all be pinned.
2345 When we pin a block, we do it from the root down to avoid recursion.
2346 We probably wany whatever reserve_block calls, to return an unreserved
2347 block rather than call reserve_block itself.
2349 When do we clear SegRef?? We set it when Pinning, so I guess we
2350 clear it when unpinning.
2351 pin_dblock, mark_cleaning, prepare_write, truncate
2353 We it is really when Pinning, or Dirtying or Reallocing.
2354 So we clear when unpinning, or when a dblock gets written...
2355 Maybe just when we lose ->parent
2358 - sometimes sugsum counter goes zero for random data block
2359 Something is going wrong in roll-forward. The block looks transiently valid
2360 so doesn't get read, but has no good data in it.
2361 - After deleting a directory, the block might still have incorporation
2362 to happen, but is not marked dirty
2363 - at unmount, there are various blocks that are still dirty.
2364 - sometimes hit BUG_ON(credits==0) line 1196 in cluster.c(cluster_flush)
2367 - that rollforward problem above:
2368 When rolling the checkpoint, if we find segusage blocks we want to include
2369 them directly into file. But by pinning the block we might preread a
2370 segusage block.. but we must be sure not to update it.
2371 So during the early stages of rollforward while still in the checkpoint,
2372 seg_inc must be called with in_phase == 0.
2373 so seg_move is called with phase != qphase.
2374 ditto for summary update.
2375 So the block must be pinned to the previous phase...
2376 Normally 'phase' changes at checkpoint-start,
2377 qphase changes at checkpoint-end
2378 So we probably want to start with qphase being 0 and phase being 1.
2379 When we reach the end of the checkpoint, we flip qphase to 1.
2381 - blocks still in phase_leafs at unmount:
2382 After we force a final checkpoint we still have Pinned:
2384 ino==8 InoIdx due to Dirty block0
2385 ino=16 InoIdx due to dirty block0
2387 inode block 1, inode usage map
2392 inode blocks dirty but not pinned? No InoIdx...
2393 Segusage dirty - probably by seg_apply_all - disable that at umount
2394 orphan dirty ??... but not pinned!
2395 This is possible - we don't pin for clearing entries, just for setting.
2396 The inode problem stems from the datablock being dirty while the
2397 InoIdx block isn't. That is, at best, confusing.
2400 segusage blocks aren't being pinned
2401 They need to be pinned whenever dirty.
2402 and youth blocks aren't even made dirty some times. They need to be
2403 pre-pinned in many cases.
2405 So: segusage gets changed when we write out a cluster, and when we
2406 delete/relocate blocks.
2407 In the first case we pin the block when it becomes part of the free list,
2408 and need to keep it pinned across checkpoint changes.
2409 In the second, we pin when the block is dirtied and again must keep it pinned.
2410 Youth gets changed when a segment becomes free and again when we allocate
2413 Keeping a datablock pinned across checkpoints is awkward - we currently need
2414 to repin for each dirty... I guess we can re-pin for each checkpoint
2415 in lafs_seg_apply_all. That might work for segusage, but not for youth!
2416 If segsnum for ssnum==0 held a reference to the youth block, that might
2417 help. Segstat on 'clean' or 'free' would imply a reference to that segsum.
2419 Is it OK to keep all youth/usage blocks for free/clean blocks
2420 pinned? We can currently have 810 entries. Only half will be clean/free.
2421 For each entry there can be two blocks, youth and usage. So that could be
2422 810 blocks. 1Meg? Normally much less. If it became a problem we could
2423 reduce the number dynamically I guess.
2425 maybe segusage blocks need to get phase_flipped, as other blocks do
2426 depend on them, pin_all_children wouldn't be able to find them though..
2428 1/ Any address on 'clean' or 'free' segtrack implies a refcount on the
2432 I think I want to link dirty block to the space in free segments that we
2433 actually know about. Each of those segments has youth and usage blocks
2434 pinned (at least parent pointer is active). So we have everything we need
2435 to write everything that is dirty. So 'free' or 'clean' implies
2436 a segsum reference which holds youth block.
2438 When we get low on space, we wait for cleaning/finding to progress.
2439 This would limit us to 400 segments, say 16Meg each, so 6Gig of dirty
2440 memory. I guess that we need to scale the 'free' list based on available
2443 When cleaning needs a segment, it needs to load the usage blocks for other
2446 When cleaning in the presence of snapshot we need to be careful never to
2447 duplicate a block that is shared. To allow for v.many snapshots, we don't
2448 even want to duplicate in memory.
2449 So we need to choose a 'primary' copy - probably first one found - and
2450 follow the peers link when possible...
2455 So clean and free segments in the list carry a SegRef. But it could be
2456 excessive if all of them did - we shouldn't be required to pin more
2458 So for segments with a usage of 0, we use the score to record if a
2459 segref is held. 0 means 'no', 1 means 'yes'.
2460 When space_alloc wants more space we need to find an entry and
2461 segref it. Maybe we want free lists - reffed and not-reffed.
2463 Then again, SegRefs are fairly cheap as they are heavily shared.
2464 maybe 512 to a block. If we hold 400 refs they could easily all be
2465 in one block. We could possibly encourage this by sorting the list
2466 and discarding from one end if it is too full.
2467 Sorting is a good idea definitely. It keeps youth/usage updates
2470 Just check the numbers.
2471 a 1TB device with 1K blocks might have 32M segments of which there
2472 would be 32768. 512 per block means 64 blocks or 16 pages (64K).
2473 So total segusage files is 128K plus snapshots. Not worth worrying
2475 For 16TB, that is 2Meg plus snapshots.
2478 - keep a SegRef for all free and clean blocks.
2479 This must include a youthblk reference.
2480 - sort the free list when 'clean' is merged or when a pass
2484 merge as many as fit into free
2487 How is the code flow...
2488 add_cleanable is called during the periodic scan. It could hold
2490 add_cleanable calls add_clean as does lafs_get_cleanable during
2491 clean. That might block getting a segref, might even
2493 add_free is also called by seg_scan
2495 So seg_scan should get a segref and leave it with everything!
2498 A SegRef implies a 'struct segsum' for each segment. We don't
2499 want to allocated one of these for every segment in the table.
2500 We only want a reference to the youth and segusage block, which
2503 But these blocks need to be Pinned and SegReffed etc so we can
2504 write them at any time.
2507 The refcount held by the 'leaf' lru is a problem.
2508 While it holds a count we do not unpin an index block, so it cannot
2509 be removed from the list.
2510 Thus we can only remove from the leaf lru on a phase change.....
2511 Or when doing lru based flushing... Maybe we can remove from the
2512 lru while holding the checkpoint lock.
2513 This happens when truncating..
2515 No, that is just too messy as it is too easy to get put back on the list.
2517 Maybe the leaf lru should not imply a reference count ... or maybe
2518 we need to split the refcount: 'inuse' and 'active'.....
2519 How about we test refcnt against list_empty(->lru)...
2523 During truncate, we need each index block to get unpinned so they can
2525 But the InoIdx block is held pinned by by the inode block being dirty.
2526 In this particular case, the InoIdx block is Invalid as the file is empty.
2527 But.... InoIdx should always be valid until after Inode is destroyed??
2531 I need to stop the cleaner and flush everything before trying to
2534 This is awkward though.
2535 The 'sync' of umount is done by kill_block_super, but I call
2536 that rather late, after checking that the tree is empty.
2537 There are pinned/dirty bits left after sync that we want to magically
2540 - segusage/youth blocks. Maybe if we don't seg_apply_all...
2541 - orphan block. Maybe don't mark it dirty when we remove things?
2542 - inode map?? why is that dirty
2544 - root directory is dirty still?? But it has been erased.
2545 InoIdx is valid-but-empty. Inode Data is dirty
2546 Data block 0 is Dirty at block 0.
2549 Ahh... need to mark page dirty when block is marked dirty !!
2551 The seg usage blocks are now flushed out but not incorporated.
2552 I feel that might be correct - we don't want to care about
2553 incorporation as we will never use it.
2554 For this, segusage and quota are very special cases.
2556 Inode map is no longer dirty, but is pinned
2557 Orphan does have a dirty block still
2558 The orphan table contains the root directory.
2559 root is now clean and gone
2561 Segusage doesn't get incorporated after last checkpoint now
2563 But now we have a circular reference for SegRef. This should not
2564 be surprising given the circular problems we had setting SegRef.
2565 I guess we just erase the references in the segsum table...
2568 Hurray!!! I can unmount without crashing!
2569 Now I need to sort through all the fixes required to achieve that
2570 and make discrete patches, and be sure it is all OK.
2572 DONE - (block.c) lafs_get_block should not have to lock that page just to do a lookup.
2573 DONE - (block.c) Mark page dirty when block becomes dirty
2574 DONE - (checkpoint.c) print orphan_slot with Orphan flag
2575 DONE - Don't incorporate segcount etc after final checkpoint
2576 DONE - Don't apply seg changes after final checkpoint.
2577 DONE - Don't start opportunistic checkpoint after final.
2578 DONE - (checkpoint) if InoIdx isn't dirty but InodeData is, then still allocate
2579 DONE - (checkpoint) when waiting, wait for checkpointneeded to get cleared
2580 DONE - (cluster) be more flexible about credit usage when flushing InoIdx
2581 DONE - (dir) do add_orphan when we abort as well as on success
2582 DONE - use inode_dec_link_count, not i_nlink--
2583 DONE - (file.c) lafs_writepage: remove from leafs when we cluster_allocate
2584 DONE - change %d/%d to strblk
2585 DONE - (index.c) refile: IF B_IOLOCK, the it isn't on LRU
2586 DONE - (index) refile: when unpinning, remove from lru
2587 - lafs_refile: ->iblock can be non-null for inode 0.
2588 DONE - Make sure I_Deleting gets cleared when deleting finished.
2589 DONE - phase_flip should have something separate to call, not lafs_allocated_block
2590 - inode.c: lafs_dirty_inode: getref_lock used to get dblock
2591 NONO - ?? getref_locked allowed if PagePrivate
2592 DONE - segment: lafs_seg_put_all needed at unmount
2593 DONE - segdelete_all: need to put intable references
2594 DONE - lafs_free_get: put the intable references
2595 DONE - lafs_get_cleanable: put the intable references
2596 DONE - fix sort splitting in add_cleanable
2597 DONE - add lafs_empty_segment_table for unmount
2598 DONE - lafs_release: flush all dirty blocks
2599 DONE - lafs_release: force a final checkpoint
2600 DONE - lafs_release: move kill_block_super before final check
2601 DONE - lafs_put_super: release orphans and segsum files.
2602 DONE - lafs_destroy_inode: putref should be 'iblock'
2603 - lafs_destroy_inode: allow for iblock to be present but no ref held....
2604 DONE - can roll forward call lafs_allocated_block without dirty???
2607 - I've re-arranged lafs_release so that the flush is all done in
2608 generic_shutdown_super. However it calls invalidate_inodes, and that has
2609 problems with pinned inodes. So we need for fsync_super to checkpoint
2610 out all inodes that we don't hold our own reference to.
2611 If we do hold a reference, then invalidate_inodes will skip them,
2612 and ->put_super can be used to drop the references and perform the final
2614 fsync_super calls ->sync_fs. after syncing call files. Maybe I can
2615 do some sort of checkpoint there...
2616 There almost is a checkpoint in there.... But only when called without
2618 I need to understand 's_dirt'.
2619 This is controlled entirely by the filesystem, common code only examines it.
2621 file_fsync (the generic 'fsync' method) will call ->write_super
2622 fsync_super will call write_super
2623 generic_shutdown_super will call write_super
2624 sync_supers will call write_super
2625 sync_filesystems(0) will call ->sync_fs
2627 twice from 'sync', once with '0', once with '1' for 'wait'.
2628 (though in emergency_sync, both are '0').
2629 once from unmount and remount with 'wait' set to '1'.
2630 We don't want two checkpoints for a 'sync', but we want to start
2632 Maybe if we get called with '0', we set a flag and treat the '1'
2633 differently.. There is no locking to make this really safe, but
2634 it will probably be OK... I could take a process_id, but then
2635 parallel 'sync's could race.
2636 write_super is called before the syncs. So it could start the checkpoint,
2637 and sync could wait for it.
2638 write_super is called multiple times at shutdown, We really need
2639 to utilise sb_dirt to avoid some of these.
2640 We set sb_dirty to 0 when we set CheckpointNeeded, and set it to 1:
2641 - when we pin a dblock or dirty a this-phase iblock.
2644 at unmount, we iput the root inode which de-references the dblock
2645 before clearing ->iblock, which fails an assertion ... why?
2646 Apart from the shinker, ->iblock is only set to NULL in refile
2647 when we find an I_Destroyed inode... I guess the root block isn't
2648 getting Destroyed...
2649 The protocol for freeing iblocks is bad. Should be:
2650 - it only gets freed by the shrinker
2651 - when inode dies, set ->inode to NULL
2652 - when InoIdx iblock dies, set ->iblock to NULL
2655 So, what exactly is the protocol?
2656 - index blocks live either in the parent/sibling tree, or
2657 on the inode's free_index list
2658 - when refcnt is 0, they live on 'freelist.lru'. When refcount
2659 is elevated they stay on lru until they need to be
2660 added to some other lru (leafs or cluster)
2661 - when shrinker finds block on freelist.lru with non-zero refcnt,
2662 it just removes from lru
2663 - when shrinker finds free block, it removes from free_index and discards
2664 the block FIXME can refcnt=0 still have Pinned,Uninc,Realloc,Dirty ??
2665 I think not as such would either have children or be on an lru
2666 - When we destroy an inode, all index blocks get disconnected from the
2667 inode and freed. This must include the ->iblock
2668 - When an index block becomes free due to index tree shrinkage,
2669 we set the ->depth to -1 so that it cannot be found by mistake,
2670 and leave it for shrinker or inode destruction.
2672 Confused about inode<->dblock dependence.
2673 We don't want the inode to refcnt the dblock as that wastes space.
2674 We don't want the dblock to refcnt the inode as that stops it from being freed.
2675 So each must disconnect from other when freed.
2677 inode takes private_lock, then checks dblock
2678 dblock cannot take private_lock before checking ->my_inode..
2679 Maybe: destroy_inode takes ref on dblock, thensets I_Destroyed, then
2683 Tracking down the 'credit' count and making sure it stays correct.
2684 It seems that I have a Dirty InoIdx block which is not pinned.
2685 Due to this it has no refcount and so the data block disappears so
2686 the InoIdx block is not visible in the tree. This isn't a definite bug
2687 but it means I cannot count credits properly.
2688 And surely Dirty index blocks must always be pinned!!??
2690 When as small file is flushed to the inode we were dirtying the
2691 iblock. That seems wrong - should dirty the dblock? Need to
2694 I got a hang in 'rm adir/4'.
2695 rm is in lafs_cluster_update_commit_both
2697 cleaner is in lafs_do_checkpoint+0xe4
2698 pdflush is in writepage/lafs_cluster_flush waiting on a lock
2699 so I guess cleaner is holding a mutex and waiting for something
2703 Hang again at 'seq 1 200' in 'cd /mnt/1/adir'.
2704 cleaner is at some point, holding a mutex to stop 'sh'.
2707 ahh.. prepare checkpoint holds wc[0].lock while waiting for checkpoint
2709 So when something locks the checkpoint and needs to flush, we have problems....
2712 I seem to have fixed the above. Now:
2713 Free space is a real problem. When I remount after the successful unmount,
2714 we find a usage pattern like:
2715 CLEANABLE: 0/0 y=10 u=34179
2716 CLEANABLE: 0/1 y=0 u=65144
2717 CLEANABLE: 0/2 y=0 u=65535
2718 CLEANABLE: 0/3 y=32773 u=32910
2719 CLEANABLE: 0/4 y=32772 u=149
2720 CLEANABLE: 0/5 y=0 u=0
2721 CLEANABLE: 0/6 y=32770 u=16529
2722 CLEANABLE: 0/7 y=32769 u=35084
2723 CLEANABLE: 0/8 y=32768 u=31877
2725 Which is ridiculous.
2726 Better fix up what I have first...
2729 In rm /mnt/1/nbfile* we hang..
2730 rm is in lafs_phase_Wait from pin_dblock in unlink
2731 wait for [ce5c2d20]277/0(0)r2F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,PhysValid{0,0}[8] pindb(1) leaf(1)
2733 cleaner is in lafs_iolock_block from add_block_address in phase_flip
2734 iowait for [ce5c33b0]286/0(0)r6E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[1] child(5) leaf(1)
2736 So cleaner is probably deadlocking against itself via iolock_block.
2738 - in lafs_invalidate_page just to wait for any io - it isn't held long
2739 - in lafs_erase_dblock while we erase and 'allocated_block'
2740 - in lafs_get_flushable to protect blocks being checkpointed
2741 - in lafs_writepage to call cluster_allocate (which releases), both for
2742 data block or for inode when data was flushed there.
2743 - lafs_add_block_address to process pending incorporations to make room.
2744 This is what is trapping the cleaner.
2745 - lafs_inode_handle_orphan when truncate finishes to erase_iblock
2746 - lafs_inode_handle_orphan again to incorporate all removal
2747 - and again to erase_iblock
2748 - and for partial truncate to incorporate some removals
2750 - lafs_new_inode to keep it from being cleaned while being created
2751 - roll_block to add addresses
2752 - lafs_load_block during IO
2754 So: who holds it?.... let's use the code to find out...
2755 And the answer is : lafs_get_flushable.
2756 So get_flushable iolocks the block then calls phase_flip which tries to
2757 incorporate other-phase children which try to iolock the block. Deadlock.
2758 Do we need to hold iolock during phase_flip ??. Not for all of it..
2761 FIXME When erasing a block, do I need an uninc credit? I usually don't
2762 have one and the need certainly isn't as great...
2764 Now... let's try to get free space accounting right.
2766 - unlink sometimes failed with ENOSPC
2767 - usage scan shows segmetns with enormous usage - 23039!!
2769 no credits: [ce9a55cc]16/1(2651)r11E:Pinned,Phase1,WPhase1,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(3) cluster(1)
2770 no credits: [cfb695cc]16/1(1840)r12E:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid intable(7) ss(4) cluster(1)
2772 no uninc credit [cfb695cc]16/1(2019)r13F:Pinned,Phase0,WPhase0,Valid,Dirty,SegRef,C,CI,CN,CNI,IOLock,PhysValid intable(7) ss(5) cluster(1)
2775 after umount/remount df says "4608 7 1544" but cannot
2777 df: tot=4608 free=4601 avail=1544(1564-20) cb=7 pb=0 ab=0
2778 ============= Cleanable table (7) =================
2779 pos: dev/seg usage score
2795 --------------- Free table (1) ---------------
2797 --------------- Clean table (0) ---------------
2798 CLEANABLE: 0/0 y=10 u=1
2799 CLEANABLE: 0/1 y=32775 u=3
2800 CLEANABLE: 0/2 y=32774 u=2
2801 CLEANABLE: 0/3 y=32773 u=1
2802 CLEANABLE: 0/4 y=0 u=0
2803 CLEANABLE: 0/5 y=32771 u=1
2804 CLEANABLE: 0/6 y=32770 u=6
2805 CLEANABLE: 0/7 y=32769 u=2
2806 CLEANABLE: 0/8 y=32768 u=3
2811 FIXED 1/ Data 16/1 is being Reallocated, but is Dirty, not Realloc
2812 Gone,presume FIXED 2/ Data 16/1 has no uninc credit in cluster_flush
2813 3/ in cleaner, ->dblock is uninitialised.... actually inode has been free.
2814 4/ invalidate_page find Realloc set, even after iolock ..
2815 This is during umount in generic_shutdown/lafs_put_super/iput
2820 If we flag a block for Realloc then Dirty before it is allocated,
2822 But if we have already allocated to a cleaning cluster... what happens?
2823 We need to treat this like it was dirties after being written, so
2824 it gets written to a regular cluster as well.
2825 As we only have one uninc bit for both Dirty and Realloc, we need
2826 to *not* incorporate the Realloc update if the block is still dirty.
2828 - block gets chosen for cleaning and allocated to a clean-cluster
2829 - block gets marked dirty. This must not clear Realloc
2830 - cluster is flushed, block is dirty, so don't call lafs_allocated_block
2831 - Return the Realloc credit, but keep dirty and Uninc.
2832 Is there a race if Dirty is set after we enter lafs_allocated_block?
2833 As long as the index block gets marked Dirty, not Realloc we might
2834 be safe... though it gets awkward if the Dirty writeout falls in to
2835 the next phase. But reserve_block will have provided NCredits for that.
2837 1/ don't clear Realloc when setting Dirty
2838 2/ do clear Realloc if cleaner finds the block is Dirty
2839 3/ avoid calling lafs_allocate_block when cleaning a dirty block.
2840 This is an optimisation.
2842 Almost... A B_Realloc block no longer has B_Credit so B_Dirty cannot be
2847 When cleaning blocks we hold no reference to the inode and it can disappear.
2848 We don't want to hold the inode active, but need a reference much like
2849 the truncate code has.
2850 I think we need a subordinate refcount for both cleaning and truncate.
2851 These hold inode present but not active.
2852 Maybe every block->inode should be counted like this.
2853 And this might simplify the my_inode->dblock inter-relationship.
2855 We need to ensure that if a new iget is called on an inode that still
2856 exists, we don't allocate a new one but just reuse the old.
2857 But that won't work as we cannot add an inode back into the hash table.
2858 So I think when cleaning a block we need to ref the inode.
2859 i.e. B_Realloc implies an i_grab
2862 So I have a problem with the cleaner wanting to hold and inode that
2863 the VFS is destroying.
2864 I don't want the cleaner to hold i_count as that delays truncate etc.
2865 So we need a second counter subordinate to i_count.
2866 This is held by the cleaner and by delayed truncate, and by i_count.
2867 Possibly ->my_inode holds this, which means it can be a single bit...
2869 When a lookup wants an inode, we need to load the inode data block and
2870 see if it has my_inode. If it does, we insert that inode in to the
2871 hash table. If not we fall back to regular inode creation....
2873 On reflection, that is too complicated and hard and error prone.
2874 When relocating a file we need the data so it had best be in the page
2875 cache so the filesystem really needs to know that the inode is still
2877 So cleaning needs to keep a reference to the inode.
2878 The cost of this is that if an inode is being deleted while it is
2879 being cleaned the truncate cannot happen until the cleaning
2880 completes. This means that space usage will be wrong.
2881 When nlink becomes zero we can drop the cleaner reference. When
2882 the inode is dropped/destroyed we can tie the cleaning in with the
2883 delayed truncate so that the final destruction doesn't happen until
2884 the cleaner has let go.
2886 So: how to track that the cleaner has a reference to the inode?
2887 Maybe every B_Realloc block owns a ref on the inode.... but dropping
2888 those references when i_nlink hits zero would be difficult.
2889 They could hold a secondary refcount which, if non-zero, implies a
2893 - Set B_Cleaning when we look at a block for cleaning, and clear
2894 it when we find Realloc clear and ....????
2895 - Whenever a block has B_Cleaning set, it holds a counted reference
2896 on LAFSI(b->inode)->cleaner_ref
2897 - When cleaner_ref is non-zero and I_Deleting is not set, we hold
2898 a reference on the inode (i_grab).
2899 - when i_nlink hits zero, set I_Deleting and drop any reference
2900 held by the cleaner.
2901 DONE - cleaner must be careful not to process any block that has been
2902 truncated, or file that is dead.
2903 DONE - Make sure the cleaner doesn't start up after the FinalCheckpoint.
2904 - What about filesystem inode... how do they fit in??
2907 Question. When are the index blocks for an inode flushed?
2908 We need to have them gone when the inode disappears.
2909 For deleted inodes, this happens in background truncate.
2910 For memory-pressure inodes it will hopefully happen well in advance,
2911 but we need to make sure in destroy_inode that everything is
2915 Thinking again about B_Cleaning, any B_Realloc block will hold a
2916 reference through to InoIdx and so dblock will be present and the
2917 inode won't be freed. So we only need an extra reference during
2918 the first little phase of cleaning when we are collecting blocks.
2919 After that a reference can be useful as it will delay flushing so it
2920 can be more efficient...
2922 Maybe this is all much simpler than I thought.
2923 If we hold a ref on the inode whenever the InoIdx block is Pinned
2924 and i_nlink is non-zero, then we won't be forgotten until all
2925 index blocks are written. We may still be deleted, but as that
2926 is one-way we can hold on to the inode at little cost.
2928 getting/putting that ref at exactly those times turns out to be
2930 It might be best to have a flag to say "We hold an extra ref".
2931 Then we occasionally call a function that validates the setting.
2932 It is most important to drop the count at the right time, so
2933 after unlink/rmdir/rename and when B_Pinned is dropped.
2936 set_phase which is called from:
2937 lafs_cluster_allocated when moving 'pin' across to data block
2938 so don't need checkpin
2940 only need check_pin if dropping spinlock
2942 only pins data blocks (Index are already pinned if relevant).
2944 where "inoidx block pinning" doesn't change
2947 do_incorporate_internal
2949 So only need check in lafs_pin_block_ph and maybe pin_all_children...
2952 - credits get out of sync from
2953 lafs_incorporate->refile->space_return from checkpoint.
2954 counter is one more than we can find.
2956 i [cfb9aaf0]327/0(2261)r1E:Index(1),Valid,PhysValid[0] NP
2957 Note it in an Index but not InoIdx. The parent is still in the tree.
2961 delete_inode -> truncate -> invalidate_page->erase_dblock->space_return
2964 - BUG credits<0 in space_return from lafs_incorporate from add_block_address
2966 Just Grew [cfbb5c70]331/0(NoPhysAddr)r2E:Index(2),Pinned,Phase1,InoIdx,Valid,Dirty,UninCredit{0,1}[0] child(1) inc(1)
2967 from [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2968 msg: (1,3,1)(1,1,-1)
2970 ib = [cfbb5a40]331/0(0)r205F:Index(1),Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(203) leaf(2)
2972 This is a predicted but not handled problem.
2973 The answer is that not all blocks need ICredit/UnincCredit.
2974 The purpose of this credit is to allow for a split in the parent.
2975 pre-existing index blocks can never split the parent themselves
2976 If an index block becomes full, it will split and this might split
2978 If an index block has free space, then it will only over flow if it
2979 gets multiple child updates and this will provide multiple credits.
2980 So an index block with space for 3 or more new addresses does not need
2981 and ICredit/UnincCredit. So when we split we don't need to provide an
2984 When we have a fully InoIdx block and a single new child with 1 UnincCredit,
2985 each block already is either 'Dirty' or has a 'Credit', and the InoIdx has
2986 an ICredit, then create a new intermediate such that
2987 InoIdx is Dirty and has an ICredit
2988 New Index is Dirty with no ICredit - it used the UnincCredit
2989 New child looses its UnincCredit
2990 When another block in the new index arrives, it's unincredit is used to
2993 When a leaf block cannot fit a single address it will have ICredit.
2994 The block is split so that each has 3 spaces and so do not need ICredit,
2995 but as soon as ICredit is available, they take it.
2997 Worst case is that every ancestor is full and the leaf is split
2998 We then get two full branches, each block half empty so not needing ICredit.
3002 free data being used in lafs_refile from cleaner.
3003 b->inode->i_sb is 0x6b6b6b6b, so inode has been freed before cleaner frees it.
3004 Answer: lafs_refile was derefering ->inode when it wasn't safe.
3005 Need to at least have a parent before it is safe.
3008 soft lockup cleaner->lafs_iget->ifind_fast ....
3009 Then (may be caused)
3010 Oh dear: [cfb63670]284/0(0)r1E:IOLock,PhysValid cleaning(1)
3011 .......: [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,PhysValid{0,0}[0] child(1) leaf(1)
3012 Why have I no credits? [cfb2ad20]284/0(2198)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,PhysValid{0,0}[0] child(1) leaf(1)
3013 ------------[ cut here ]------------
3014 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:656!
3016 It seems the cleaner gets confused and goes spinning.
3020 After the run, we have -14 used and 2055 available (of 4608), and
3021 cannot create anything.
3022 4 segments ar free, one is cleanable.
3023 free_blocks=2103 allocated=56 max_seg=512 clean_reserved=0
3025 free_blocks=1722 allocated=64 max_seg=512 clean_reserved=0
3027 df: tot=4608 free=4630 avail=977(1033-56) cb=10 pb=0 ab=-32
3028 free_blocks=1033 allocated=56 max_seg=512 clean_reserved=0
3029 and very little free
3031 ablocks_used is going negative - why?
3032 Probably we erase a dblock without clearing Prealloc.
3033 Then when Prealloc later gets cleared, ablocks_used is
3034 wrongly decremented.... no...
3037 10aug2009 (don't forget above problems)
3039 read->touch_atime->dirty_inode->inode_fillblock->dirty_dblock
3040 getiref_lock triggers BUG.
3041 This is presumably because I have just fixed it to get the correct
3042 iblock and not the iblock of the filesystem.
3044 FIXME I hacked around this but I'm not sure the result is right.
3045 The question is about when the InoIdx should be dirty and when
3046 the inode data block should be dirty.
3047 In this particular case we are writing a page of a small file.
3048 cluster_allocate calls flush_data_to_inode which tried to dirty
3049 the inode dblock but finds that iblock is not pinned...
3050 When we dirty a data page we aren't pinning the parent!
3051 That might be OK - we only need to count and reserve the parent.
3052 We don't need to pin it until it becomes dirty.
3054 Still need to resolve when which block gets to be dirty, and also
3055 exactly when an index block needs to be pinned. And how does that
3056 related to holding a ref on the inode when the inoidx is pinned.
3057 Maybe it should be when the inoidx is referenced.
3061 Another problem. unlink->handle_orphans->erase_dblock->allocated_block
3062 and get a zero from lafs_add_block_address but parent is not pinned.
3063 And... One unmount, orphan file still has pinned blocks so the inode
3065 And ... root still old phase after lots of 'rm' then sync.
3066 Inode 244 has pinned inode block held by writepage0 and writepage
3070 - lots of bugs introduced by change to marking inode blocks dirty:
3071 writepage/cluster_allocate wants to Dirty inode data block with no credits.
3072 because I put credit in iblock!
3074 - ohhh.... The phase contour is broken. When a block is added to a
3075 cluster for allocation it isn't in the phaseleafs any more, but prevents
3076 it's parent from joining. So we cannot assume that if dblock is on
3077 list then iblock or a child will be too.
3078 So when we find dblock we do need to remove it.... done that.
3080 - root not changing because Data 1/0 is Pinned and IOPending
3081 and held by writepage!!
3082 Problem is that IOPending blocks aren't put back on lru.
3083 But that should only be blocks on the cluster list.....
3084 But that is where I am putting it.
3085 Maybe I need exclusion between checkpointing and any other
3086 code that writes to checkpoint so checkpoint can wait
3087 for that ... can we use wc->lock?? That doesn't lock
3088 against cleaner, but that isn't a problem...
3089 But now 0/228 is still pinned and in writepage and IOPending
3090 So there is more to it than that.
3091 When checkpoint finds an IOLocked block, it might be about to
3092 join a cluster, in which case we don't really want to wait, or it
3093 might be undergoing incorporation in which case we want to wait.
3094 or it could be being erased, so wait..
3095 Maybe I wait until it appears on some list.... yes.
3098 At unmount Index 8/0 with child and leaf is still pinned
3099 This was pinned: [cfb29810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3103 A problem is that something goes wrong in the erase process.
3104 We find new children after we erase the inoidx block!
3106 This was pinned: [cfb3d810]8/0(9)r284016F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(284014)
3108 When/how do we erase indexblock and particularly inoidx blocks?
3109 Does and inValid InoIdx simply mean there is no indexing and does not
3110 reflect on the Data block?
3112 .xlooping on [cfbe28c0]331/0(0)r2F:Index(1),Pinned,Phase1,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,PhysValid{0,0}[0] inode_handle_orphan(1) leaf(1)
3117 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3118 This was pinned: [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3119 [cfb2ad20]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3120 [cfa5374c]0/1(772)r0E:Valid,Dirty,UninCredit,PhysValid
3121 [cfb54430]0/8(775)r0E:Valid,Dirty,SegRef,UninCredit,PhysValid
3122 [cfb54c90]0/16(777)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3127 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3128 This was pinned: [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3129 [cfb273b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3130 [cfb74c90]8/0(2404)r1E:Valid,SegRef,CN,PhysValid orphan(1)
3131 badcnt 0 0 [cfb268c0]0/0(13)r4E:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,1}[0] NP child(4)
3134 erase Inoidx [ce5ab3b0]172/0(402)r1F:Index(1),InoIdx,Valid,IOLock,OnFree,PhysValid[0] inode_handle_orphan(1)
3135 erase Inoidx [ce5ab5e0]74/0(0)r2F:Index(1),Pinned,Phase0,WPhase0,InoIdx,Valid,Dirty,SegRef,C,CI,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] child(1) leaf(1)
3136 ------------[ cut here ]------------
3137 WARNING: at /home/neilb/work/nfsbrick/fs/module/block.c:579 lafs_erase_iblock+0x
3138 unlink/orphan/erase_dblock_allocated_block
3139 ---[ end trace 61b8bd59512ea4da ]---
3140 zz [ce50d6a8]74/1059005010(0)r1E:SegRef,C,CI,UninCredit,IOLock,PhysValid,Orphan(0) orphan(1)
3141 [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3142 [ce5ab5e0]74/0(0)r1E:Index(1),InoIdx,SegRef,C,CI,CN,CNI,UninCredit,PhysValid[0] child(1)
3143 ------------[ cut here ]------------
3144 kernel BUG at /home/neilb/work/nfsbrick/fs/module/index.c:1955!
3146 BINGO. When we remove last entry from directory we erase the InoIdx block,
3147 then when we add entries, we hit problems.
3155 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3157 This was pinned: [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3158 [cfb2caf0]16/0(10)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3159 [ce9794f0]16/0(2200)r1E:Valid,SegRef,CN,CNI,PhysValid cleaning(1)
3161 This was pinned: [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3162 [cfb2d3b0]8/0(9)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3163 [ce968c90]8/0(2175)r3E:Valid,SegRef,C,CI,CN,CNI,PhysValid orphan(3)
3165 This was pinned: [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3166 [cfb2d180]1/0(8)r3F:Index(1),Pinned,Phase1,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf1(0)
3167 [ce968430]1/0(2174)r1E:Valid,SegRef,C,CI,PhysValid cleaning(1)
3169 We have stray 'cleaning' references.
3171 on a data block that was in a to-clean segment
3172 at which point we igrab the inode
3173 the block is put on the ->cleaning list.
3175 when we get an error finding the block
3176 when we find that it isn't in the segment
3177 when an error occurs loading the block-to-be-relocated
3178 and when we mark that block for cleaning.
3179 i.e. always unless we got EAGAIN or some space error.
3180 If we still hold some blocks, try_clean returns 0.
3182 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3183 This was pinned: [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1)
3184 [cfb253b0]0/0(13)r5F:Index(1),Pinned,Phase1,InoIdx,Valid,Root,C,CI,CN,CNI,PhysValid{0,0}[0] NP release(1) child(3) leaf(1) Leaf1(0)
3185 [cfa57b7c]0/1(2228)r0E:Valid,Dirty,UninCredit,PhysValid
3186 [ce5a4430]0/8(2231)r0E:Valid,Dirty,UninCredit,PhysValid
3187 [ce5a4c90]0/16(1028)r0E:Valid,Dirty,SegRef,C,CI,UninCredit,PhysValid
3189 NOTE these inode data blocks are not pinned and so did not get written!!
3191 FIXME I should wait for the checkpoint to finish
3195 VFS: Busy inodes after unmount of hdb. Self-destruct in 5 seconds. Have a nice day...
3196 This was pinned: [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1)
3197 [cfb27810]8/0(9)r3F:Index(1),Pinned,Phase0,InoIdx,Valid,SegRef,C,CI,CN,CNI,PhysValid{0,0}[0] release(1) child(1) leaf(1) Leaf0(0)
3198 [ce564c90]8/0(3983)r1E:Valid,SegRef,PhysValid orphan(1)
3201 When I clean and find an inode that is already deleted, I need to be
3202 very careful not to resurrect anything.. I wonder if I am.... Yes, I seem
3203 to be. lafs_delete_inode gets called a lot, but mostly for dead inodes.
3206 FIXED orphans don't get cleaned up. It seems a 'create' fails and leaves
3207 and orphan block un-released.
3208 - sometimes inodes 1,8,16 didn't get written out as they were dirty but not pinned
3209 - Not sure that we handle complete truncation, then adding blocks properly.
3210 - what should the state of the InoIdx block be?
3211 - On remount, the filesystem contains rubbish.
3212 - create fails even when there should be free space.
3213 - sometimes BUG in checkpoint.c - not finishing checkpoint properly...
3214 - iblock not valid for in 327 under cluster_flush/lafs_allocated_block
3215 and 74 has similar issue
3216 327 = adir/big1 74=adir
3220 Segusage blocks aren't always Pinned when we make them dirty.
3221 Yes. That is correct. They are not forced out by phase change but by
3222 lafs_seg_flush_all at the end of a checkpoint. So they need to be
3223 preallocated, but not Pinned.
3224 But, once we have finished the last checkpoint we don't want to
3225 dirty Segusage blocks any more.. I wonder if we are.
3226 No, but we were Pinning inodes without PinPending and they
3227 lost the pinning straight away!
3229 OK, other annoyance.
3230 InoIdx block and similar are getting erased at the wrong
3232 We can only safely erase them when they have no children.
3233 I guess what we really want is the incorporation leaves them
3234 existing but empty, and when we go to write them out, if they
3235 are empty we register an address of 0.
3236 When we drop the ->parent pointer of an Index block it
3239 When incorporate or truncate produces and empty index block
3240 it simply clears B_Valid.
3241 When incorporate want to add to an index block, we set B_Valid
3242 When cluster_allocate gets a non-Valid index block it call
3243 block_allocated with phys of 0.
3245 Yes, that seems to work. Mostly
3248 On remount, check_credits dies: 16/20-0
3249 In pin_dblock/reserve/seg_ref/prealloc/space_return ?? in lafs_mount.
3252 OK, this index block clearing is a mess. There must be a neat model I can
3253 follow that will make it "just work".
3254 The key seems to be children. If an index block has children, then it
3255 really must exist. If it has no children and no content, then it can
3256 be discarded, in which case it needs to be unlinked from its sibling list.
3257 What locking do we use here? Probably IOLock on the parent index block.
3258 So we need iolock while looking in a parent for children, and we take
3259 IOLock while incorporating or pruning.
3260 Once the empty index block has dropped out it will never be found again.
3261 When we incorporate the zero address, the index block becomes invisible
3262 unless it is shortly after it's predecessor in the sibling list. But
3263 that is hard to ensure, especially if the first child is the one that
3264 is being erased. So if an index block is erased, then it must be
3265 discarded quickly and any children need to be relocated...
3266 Or maybe not.... maybe if there are children, we just write and empty block?
3269 We need better locking of the index information.
3270 It seems best to use IOLock as that is already held during incorporation.
3271 So any code that accesses or updates and index block must hold IOLock.
3272 This might be a bit of a restriction if we try to do a lookup while
3273 writeout is happening.... Maybe we need a separate writeback flag for that.
3274 But I think it is good to use IOLock for now.
3275 Places we need this are:
3276 flush_data_to_inode needs to lock the InoIdx block
3278 lafs_leaf_find as it recurses down. This should return a locked leaf.
3280 callers of clear_index
3281 erase_dblock for depth=0??
3283 incorporate should lock new blocks for consistency
3286 Locking dependency rule is that if we hold a lock, we are allowed to
3287 lock a child index block, but not a parent. IF we hold a data block,
3288 we are allowed to lock the an index block.
3291 The read/write completion seems all wrong. It unlocks if the page was locked,
3292 and that isn't really safe, because it might not have been locked for read..
3293 We need to flag block0 to say if lock or writeback need to be cleared.
3294 Given that, I don't need IOPending any more:
3295 Read: We submit all reads, then set 'do_unlock', then check if we should unlock.
3296 Write: We queue all writes, then set 'do_clear_writeback', then check.
3298 Now... can we use a writeback flag to avoid waiting to read while writeout
3299 is happening? We would need:
3300 set writeback in cluster_allocate
3301 wait_writeback after some lock_block
3302 clear_writeback when writeout finishes.
3303 Extra checks where we already check for IOLock
3307 Lots of progress but....
3308 cluster_flush calls cluster_done calls refile call iput call
3309 drop_inode call write_inode_now calls writepage calls cluster_flush
3310 and we get a locking loop.
3311 I think we need the run that cluster_done from a different thread.
3314 We seem to have a refcnt problem with segsum.
3317 Lots more progress but.....
3319 orphan_release is finding that the orphan block has no credits.
3320 We can allocate credits and simply not do the update if they
3321 are not available: having an extra entry in the orphan file isn't
3322 a problem. However we need some mechanism to clean up other than
3323 waiting for a remount..
3324 I think we leave that until we redo orphan handling.
3326 and: adir sometimes loses one block so it and the contents don't get
3329 and: it seems we sometimes try to clean the segment being written
3330 to. We must avoid that.
3333 FIXME When pin fails, we need to remove PinPending from everything!!!
3334 and never followed up ... I wonder?
3339 Every orphan block goes on a per-fs list and gets removed only
3340 if the B_Orphan bit is clear.
3341 There are two times when we want to expedite orphan handling.
3342 1/ on rmdir we need to know if the directory is really empty.
3343 This requires that we expedite the orphan handling of all
3344 blocks. As soon as we find a non-orphan, we can give up.
3345 Then we need to make sure the index tree has collapsed. WE
3346 can borrow that code from truncate.
3348 2/ When writing past Trunc_next. We just pass the block to
3349 special orphan handling.
3351 This requires that orphan handling is re-entrant.
3352 For dir, that is protected by i_mutex, but rmdir needs to come
3354 For trunc, the iolock on the index blocks should be enough.
3355 I wonder if IOLock can be used on dir as well... allowing
3356 parallel orphan handling in the one dir even!!.
3358 We need to ensure exclusion of orphan handling, including:
3359 - only one orphan handler at a time
3360 - don't run orphan handler while still processing action
3361 that makes it an orphan.
3362 Maybe if we just use IOLock for that? Does that work? Maybe
3363 but it gets messy for directories (on first attempt anyway).
3364 For directories we can just use i_mutex.
3365 Maybe i_mutex for files as well?
3368 Orphan handling is going well... but not perfect.
3369 I'm using IOLock to ensure exclusion for orphan handling.
3371 I'm not really implementing that on directories
3372 Inodes go bad because lafs_erase_dblock needs the lock too.
3373 The call from rmdir will always faile because we hold i_mutex.
3375 Bigger problem. I'm IOLocking inodes across checkpoints to preserve
3376 Orphan status. But that might stop the checkpoint proceeding.
3377 .. so use i_mutex, not IOLock - find.
3379 Now... it seems I've confused myself. Orphans don't get handled
3380 immediately. In particular, inodes should not be handled until
3381 they final delete_inode. So setting the B_Orphan flag and putting
3382 on the list are two separate events. The flag must come first,
3383 but the list may come much later. So some of that mucking around
3384 with i_mutex is pointless.
3386 make_orphan makes sure it is in orphan file, sets bit, and removes
3387 from list (if present).
3388 add_orphan puts it on the list for handling.
3390 For inodes: lafs_new_inode sets the bit and delete_inode puts on queue,
3391 as does any unlink/rmdir/rename that fails.
3393 For directories: put it on list in commit/abort.
3397 I hit the BUG where find_leaf wants and address of 0.
3398 If an index block gets cleaned out it doesn't disappear
3399 immediately.. there is no leaf to find in that direction.
3400 We probably need to avoid non-Valid blocks or something...
3402 Orphans 0/299 to 0/329 and 0/280 are still on the list
3403 but are not orphans.
3404 Maybe I need to catch mutex_unlock to run the orphans??
3406 We underflow a segment through orphans are unmount.
3407 We are cleaning and truncating at the same time.
3408 The same block gets allocated to 0 and to 1225
3409 in quick succession.
3410 Problem is that we apply new address while in writeback
3411 so a new lafs_allocated_block
3415 Review of inodes in orphan list:
3416 lafs_new_inode makes are orphan for a non-existant inode.
3417 If the inode cannot be created, orphan_release is called.
3418 If it can, a 'struct inode' is filled in with valid type
3419 and nlink==1 (!!) and attached. The inode will only be
3420 detached when the refcnt hits 0, and the orphan list implies
3421 a refcount, so if we ever find something on the orphan list
3422 with a NULL my_inode, it must be very new and can be ignored.
3424 When we find an inode block with a my_inode there are a few options:
3425 if I_Trunc is set, we must progress truncation providing we can
3427 else if I_Deleting we must delete the inode
3428 else if nlink is 0, we remove from the list
3429 else nlink > 0 and we must remove orphan status.
3430 This means that if nlink is elevated, we need to be holding the mutex...
3431 So don't elevate nlink any more...
3433 When nlink becomes non-zero the block need to be put back on the
3434 orphan list (it must already be an orphan). Also when we set
3435 I_Deleting or I_Trunc it must go on the list.
3436 .. OK, I think I have all of that.
3440 I have some wierdness that seems to be caused by the orphan stuff,
3441 probably due to it all being async now.
3442 - A deleted inode clears I_Trunc and then sets it again. The only
3443 explanation seem to be that delete_inode is being called again,
3444 so I must be igrabing it again, maybe from cleaning.
3445 - bits of directories aren't getting deleted. Sometimes single
3446 blocks, though the referred files are deleted. Sometimes
3447 the whole directory... More interestingly, those blocks then
3448 don't get cleaned, so something about them means that they
3449 don't get deleted and don't get cleaned either.
3451 Even weird... I just had a case where file 331 had a different
3452 index block for every 4 data blocks...
3456 - What stops pinned blocks from being flushed by bdflush in middle
3457 of operation and so losing allocation? Must make sure to set
3458 them dirty very late.
3459 - orphan_release can fail, so much make sure we can always call
3460 it, even if my_inode is NULL.... but how?
3463 - make_orphan could fail due to lack of space, which is not OK.
3464 I made it loop, but I'm not 100% sure that is right... it isn't.
3465 I need to pass down the 'I'm freeing space' flag, and I need to
3466 not require Credit of Dirty is set, etc.
3469 - I seem to have a deadlock and unmount.
3470 umount is waiting for lafs_checkpoint_lock_wait in
3472 pdflush is in down_read in sync_supers
3473 lafs_cleaner is iget_locked/ifind_fast/inode_wait
3474 This is waiting for I_LOCK to be clear.
3478 - When a file shrinks and becomes level-0, make sure
3479 old addresses get deallocated. I seem to have
3480 a directory where they didn't.
3482 - Due to the fact that we over-preallocate, we really shouldn't
3483 return ENOSPC until we have flushed dirty data and performed
3487 - When I removed the last index from an inode
3488 (Indirect type) it seems that I didn't write
3489 out the corrected block..??
3492 I ran my simple test run repeatedly overnight.
3493 It ran 208 times before I stopped it.
3494 There are 3 possible failure modes:
3495 1/ didn't completed within 500 seconds
3497 3/ appeared to complete, the number of blocks
3498 in use was not the correct '7'.
3500 74 (35%) did not fail!
3501 31 () did not complete
3502 40 () triggered a BUG
3503 2 did not complete but did not trigger a bug
3505 94 of those that failed did not have a BUG
3506 92 actually completed. Of these:
3518 1 BUG: sleeping function called from invalid context at kernel/nsproxy.c:217
3519 1 BUG: spinlock lockup on CPU#0, rm/1330, cfb2dae4
3520 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:485!
3521 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/cluster.c:1219!
3522 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:821!
3523 2 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1177]
3524 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3525 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:351!
3526 5 kernel BUG at /home/neilb/work/nfsbrick/fs/module/lafs.h:276!
3527 6 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
3528 7 BUG: unable to handle kernel paging request at 6b6b6bfb
3529 11 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
3532 super.c:655 is "block is still pinned" at unmount time.
3533 The block was always an InoIdx with a child.
3534 Either inode 0 or 16.
3535 child is held by various things:
3536 [cfb555cc]16/1(2098)r131E:Valid,Async,SegRef,CN,CNI,UninCredit,PhysValid async(1) clean2(130)
3537 [cfb554f0]16/0(1050)r25E:Valid,SegRef,CN,CNI,PhysValid clean2(25)
3538 [cfa57c58]0/2(3676)r0E:Valid,Dirty,UninCredit,PhysValid
3539 [cfa5bc58]0/2(3110)r0E:Valid,Dirty,UninCredit,PhysValid
3540 [ce5b94f0]16/0(519)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3541 [cfb4d4f0]16/0(4249)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3542 [ce5ad4f0]16/0(612)r1E:Valid,Async,SegRef,CN,CNI,PhysValid async(1)
3543 [ce5c2fc8]0/74(0)r129E:SegRef,C,Claimed,PhysValid clean2(129)
3544 [cfa57c58]0/2(1895)r0E:Valid,Dirty,UninCredit,PhysValid
3545 [cfb4d5cc]16/1(4543)r105E:Valid,SegRef,CN,CNI,UninCredit,PhysValid clean2(105)
3546 [ce5754f0]16/0(1290)r178E:Valid,SegRef,CN,CNI,PhysValid clean2(178)
3548 The "unable to handle kernel paging request" is always in
3550 invalidate_inode_buffers(26/46)/lock_acquire
3554 This is iblock valid when erasing a block
3555 The block we are erasing is always 0/327 or 0/328. It is
3556 an orphan we are handling, iolocked but not always pinned
3559 Map an iblock which is not IOLocked
3560 always in lafs_clear_index for the InoIdx block for a directory
3561 which is in Writeback.
3562 Call is in lafs_allocated_block from cluster_flush.
3565 seg_inc reduces seg usage below 0
3566 - lots of blocks (inode 327) that were cleaned, where then erased twice.
3567 - 2 block (inode 328) were erased twice, both from prune
3571 The free list is empty.... odd as only first segment is currently
3575 Still orphan: 0/328 Index(1) is in Writeback and Dirty
3576 again inode_handle_orphan2 is in Writeback
3579 inode_handle_orphan are end, child list is not empty.
3580 The children seem to be in Realloc - cleaner need to let go.
3583 my_inode is null while cluster_flush an inode and want to set
3588 no ICredit for unincredit in dirty_dblock from dir_delete_commit
3592 spinlock lockup in subsequent to real bug
3593 ditto for sleeping function.
3595 Of the '44' which claimed final blocks of 9, 14 really had 7, and 4
3596 appear to have other strange values....
3598 A select '9' has two extra block for the directory '74'.
3599 But that directory is long gone.
3600 These dir blocks are currently fully populated with numbers.
3601 This seems to be the pattern with all non-7 blocks.
3605 Found a problem, possibly related to the dir blocks not being
3607 When lafs_incorporate sets ->depth to 1 it doesn't dirty the inode,
3608 so that fact is never copied in to the datablock.
3609 On further exploration, the I_Dirty bit is set but never used, which
3611 So: exactly when do we copy inode into datablock, and what do we do
3612 when dirty_inode is call (if anything).
3613 We could just set I_Dirty when dirty_inode is called, checking that
3614 the block is Pinned which it usually will be.
3615 Then we copy inode to data just before writing data block.
3616 However that defeats transactional properties. We to copy in the
3617 same transaction, and that means either straight away, or when
3618 the data block's phase changes.
3619 So dirty_inode either copies to the block, or sets I_Dirty.
3620 When lafs_refile unpins an inode data block, it need to check
3621 I_Dirty and possibly re-dirty it.
3623 To redirty it we must steal the NCredits. Any further dirty attempt
3624 will have to allocate more.
3625 The stealing is done automatically by dirty_dblock, so we just flip
3626 the phase and call dirty_inode ... making sure it doesn't try to
3629 Need to review when inodes get dirtied.
3630 - commit_write only sets I_Dirty !
3632 We call lafs_dirty_inode:
3633 dir_create_commit - a child of inode is PinPending
3635 lafs_link - before dir_create_commit
3636 lafs_unlink, lafs_rmdir - data block is pinned
3637 lafs_symlink - before create_commit
3638 lafs_mkdir - before create_commit, or block pinned
3639 lafs_mknod - before create_commit
3640 lafs_rename - (moved to) before create_commit/update_commit
3641 or data block is pinned
3642 lafs_dir_handle_orphan - (assured that) child is pinned.
3643 choose_free_inum - child is pinned
3644 lafs_incorporate - block is pinned
3646 So either the data block is pinned, or the index block is pinned.
3647 In either case it is OK to set something to Dirty.
3649 (the new) lafs_dirty_vfs_inode gets called by mark_dirty_inode{,_sync}
3650 this is called from:
3651 inode_inc_link_count
3652 inode_dec_link_count
3653 ..various quota ops...
3655 __set_page_dirty (Which we don't use)
3657 other quota stuff we won't use
3662 only the time updates are interesting. Others we have locking
3664 file_update_time is called from generic_file_aio_write_nlock etc
3665 before ->prepare_write/->commit_write. So they can pick up the
3667 Similarly before set_page_dirty is called.
3668 touch_atime is called from do_follow_link and readlink and
3669 file_accessed which is called all over the place.
3672 If block is pinned, then dirty it to ensure writeout.
3673 If not, don't. But copy data in any case.
3678 OK, I've decided that I don't like clearing B_Valid when an index
3679 block contains no indexes. The final straw was that I seemed
3680 to need to initialise the index block when I didn't hold IOLock.
3681 That was probably fixable, but I'm sure more problems were coming.
3683 So: what to do instead?
3684 One issue that must be resolved is that an index block can still
3685 have valid children even when it become empty.
3686 This can happen if we erase blocks from a file, then add them back
3687 after a checkpoint, and so in the next phase.
3688 The checkpoint writeout could need to show an empty index block,
3689 but the next phase will see real addresses.
3690 We cannot easily avoid this, so we must handle it.
3691 This interact badly with the index lookup algorithm that finds
3692 the best index block currently in the parent, and then scans
3693 the children. If there is no index block in the parent, we
3694 cannot find any children.
3695 This could be handled by responding to an empty index block by
3696 scanning all children. But that isn't a full solution as if
3697 just one index block got erased, it's unincorporated siblings
3698 would still be lost.
3699 We could treat empty index blocks like orphans. i.e. don't
3700 discard them immediately but leave them with possibly real
3701 addresses. Then when they have no children we allocate the
3703 But we still need to ensure that index blocks off which siblings
3704 have been split but not yet incorporated remain present in the
3705 tree to mark the place for their siblings.
3706 There is another problem. A horizontal split could leave the
3707 new block with no addresses and everything in the uninc list.
3708 Nothing can be found in there.
3710 So maybe we need to revise the lookup mechanism.
3711 The goal is to find an index block that starts at or before
3712 the target and contains an address at or after the target.
3713 Then out search can stop.
3717 I thought about this more over the weekend and think I have an answer.
3718 We need to treat internal and leaf index blocks somewhat differently.
3720 An internal index block must never be empty (while unlocked).
3721 Any child block which has not had it's address incorporated must be
3722 attached (simply in the sibling list) to a block which has been
3723 incorporated. This will be the block that it was split off.
3724 The uninc block needs to hold a reference so that the primary isn't
3726 When a 'primary' becomes empty it cannot be discarded, so the
3727 addresses in the first dependent index block must be copied
3728 across. This is awkward for indirect blocks so they might be
3729 allowed to be empty (they aren't internal so don't violate the
3731 When a horizontal split break a sequence of dependent blocks
3732 between two parents, the second parent must be incorporated
3733 immediately so that the first block in the second half of the
3734 sequence is incorporated.
3735 If an internal index block does become empty and it has no
3736 dependent blocks to fill from, it must be invalidated immediately.
3737 It cannot have any children - even in next phase - as at least one
3738 would have to be incorporated and so the block would not be empty.
3739 Invaliding involves allocating to address 0.
3740 If index lookup finds a block with PhysValid address of 0, it
3741 must look to the previous index block. If there was none .... it
3744 Leaf index blocks can become empty, but we try to avoid it.
3745 If a leaf has blocks which have been created in the next phase,
3746 and others which have been deleted in this phase, it can be empty
3747 but still have children. In this case we just treat it as a real
3748 index block that doesn't actually have any addresses. We still
3749 write it out even though that is a waste of space.
3751 We have been working on the assumption that every address always
3752 has a corresponding leaf index block. It is the leaf with the
3753 highest index at or below the target address.
3754 However this requires the every internal index block has a child
3755 with the same address as the parent.
3756 Preserving this requirement when the first child of an internal
3757 become empty requires either:
3758 - loading the 'next' child and reassigning this to the start
3759 - changing the address of the parent to match the first child.
3760 The former requires possibly reading a block from storage.
3761 The latter only involves modifying blocks that are due to be
3762 written out anyway, but makes block look up slightly interesting.
3763 When lookup finds an invalid block that is 'first', it needs to
3764 start again from the top.
3765 When incorporation creates an invalid block that is first, it
3766 needs to walk down from the top and any index block at the same
3767 address needs to be relocated/rehashed. If the block is
3768 incorporated, the incorporated address needs to be updated.
3770 - flag for unincorporated index blocks which implies a reference
3772 - after split, immediately incorporate second block
3773 - change lookup to retry when finding invalid block
3774 - When internal block becomes empty, either merge with
3775 first dependent or invalidate. If first in parent,
3776 update address and parent and recurse.
3777 Need some 'clever' locking here.
3778 Before unlocking the invalidated block, we take i_alloc_sem,
3779 then walk up the ->parent tree locking blocks as
3781 The index lookup, when it finds an invalid block will take
3782 i_alloc_sem, then drop it, then start again.
3783 Or maybe some other lock than i_alloc_sem...
3784 - When leaf becomes empty, invalidate only if it has no children.
3785 When internal leaf becomes unpinned, check if empty.
3788 That locking doesn't look like it will work, and we can never 'merge
3789 with first dependant' as it is not valid to have a index block
3790 where the first child is at a different address.
3791 And we cannot always change the parent address, particularly if it
3792 is zero - increasing it then cannot work.
3793 And there is no need to load a block if we are just going to change
3794 its start address (not internal index blocks anyway).
3795 Let's drop the idea of relocating the parent.
3796 If an internal index block becomes empty:
3797 If it is last in parent, no loss, just discard
3798 If parent would be empty, need to recurse up.
3799 If it is not last relocate the next sibling to this location,
3800 rehashing it and updating the parent.
3801 If a leaf index block becomes empty we cannot just delegate to
3802 next as it might be indirect... not a problem if address is
3803 stored. But that requires a format change... now might be a
3808 If we hold an index block locked and it becomes empty and we choose
3809 to invalidate it, we need to ensure that doing so does not
3810 break any indexing paths.
3811 So we take a separate lock (i_alloc_sem??) and flag the block as invalid
3812 by setting physaddr to 0 while PhysValid is set, and unlock the block.
3813 Any lookup that finds such a block must take and release i_alloc_sem,
3814 and then restart from the top.
3815 - If the block was not incorporated, we just remove from sibling list
3816 and all is done - the space in implicitly included in
3818 - If the block has a different fileaddr than the parent then update
3819 the parent directly, either removing the entry, or changing it to
3820 point to the first unincorporated sibling (if there is one).
3821 This requires taking the lock on the parent of course. That is
3822 why we dropped the lock on the child.
3824 - If the block has the same address as the parent we need to find
3825 a 'next block' to relocate to the start of the parent.
3826 It is either the first unincorporated sibling, or the next
3827 block in the index block, or nothing, meaning the parent is
3828 about to become empty.
3829 We lock the parent (still holding i_alloc_sem), and rehash the
3830 chosen child. If it doesn't exist, or is not dirty, we need
3831 to update the phys address directly in the
3832 accordingly, erasing or replacing the first address.
3833 Then we need to rehash the index block, but we need to lock
3834 the parent for that.
3835 So set a 'busy' flag on the block, unlock it, lock parent,
3836 rehash, clear busy flag, and repeat.
3837 - We can never relocate a block with fileaddr of zero, as the
3838 InoIdx block cannot be relocated. So leaf index block 0
3839 must never be erased unless the file is empty. So
3843 We store the start address of an indirect block in the block.
3844 These means that the meaning of any index block is completely
3845 independent of the location of the block, so we can change the location
3846 easily and without touching the block.
3847 So if a block becomes empty, we simply move the next block back to
3849 i.e. when an index block becomes truely empty (i.e. no children)
3850 - if it wasn't incorporated, simply remove it
3852 - if there is a dependent block, rehash it to take my address
3853 - if there is a next block that is dirty, rehash it
3854 - if there is a next block that is not dirty,
3855 update parent to merge my entry with next, and rehash next
3857 - if there is no next block but we are not first, just update
3859 - if no next block and we are first, parent becomes empty,
3863 - too long, I've forgotten what I was up to..
3864 + I've changed the format of indirect blocks to store an address.
3865 + I've handled incorporation of an empty block
3866 So now internal index blocks can never be empty - they get immediately
3867 unlinked if they are.
3868 Leaf index blocks can be empty while they have children. We don't
3869 flag them as empty, but rather wait until another child gets incorporated.
3870 But I don't think I really like that. It is an external ugliness based
3871 entirely on internal implementation details. Empty index blocks should
3872 not get written out. We need some way to reliably find an empty index
3873 block. The address won't appear in the parent so a lookup will find the
3874 previous block which we cannot link to now as it may not exist yet.
3875 Worse - if first index block goes empty, we can only unlink it by moving
3876 the parent to start at the next block. That would make this index block
3878 So I think we have to stick with writing out empty index blocks very
3879 rarely. So we need to be sure they disappear properly.
3880 The difficult case is if an index block becomes empty while it has some
3881 children which don't end up getting dirtied. e.g. an update aborts.
3882 We need to leave the block with enough credits to be written out.
3883 I guess the Ncredit should be enough...
3884 Maybe worry about that later.
3886 - what about InoIdx blocks when they become empty? It would be helpful
3887 to flag them so that inode deletion can check....
3888 Maybe just set depth to 0..
3890 ARRGGG... I've completely lost it. In need another ITO week.
3891 I just got a bug in summary.c:71!!
3895 ablocks_used has hit zero too soon.
3896 This should be the count of blocks for which space has been allocated
3897 (B_Prealloc is set) but have not been given a phys address yet - at which
3898 point the usage count is moved to cblocks_used or pblocks_used.
3899 The last block (which may not be the cause of the problem) does not have
3900 B_Prealloc set, yet physaddr == 0.
3901 The block is 0/1, so the inode for the inode usage map. This should have
3903 We did find 8, then change to 73, but then changed to 0!
3904 Ahhh... recent fix exposed a subtle bug ... fixed.
3906 Now cluster.c:619: [ce9233f8]0/282(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3907 cluster.c:619: [ce570a18]0/286(0)r2F:Pinned,Phase1,PinPending,SegRef,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3908 cluster.c:619: [ce588d6c]0/17(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3909 cluster.c:619: [ce51dfe4]0/283(0)r2F:Pinned,Phase0,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3910 cluster.c:619: [cfbb8430]0/328(0)r2F:Pinned,Phase1,PinPending,CN,CNI,Claimed,PhysValid iblock(1) leaf(1)
3911 We are allocating an InoIdx block, but data block is not valid??
3913 That isn't very reproducible so I'll have to leave it for now...
3914 erasedblock had been called on the data block .. inode 17??
3916 Problem is that I keep changing the rules.
3917 I don't erase the InoIdx block any more.
3918 I used to, then change it to iolock_block/cluster_allocate->0
3920 Problem: When all files are removed, usage is still quite high, two
3921 segments have over 400 blocks (out of 512). Cleaning keeps running and
3922 not making much progress.
3923 segment 6 has usage of 484.
3924 'cluster 3072' shows: cluster 3072, 3085, 3086 3092
3925 Inode 0: blocks 267 272 276
3926 Inode 277: blocks 0/4 6/2
3927 Inode 0: blocks 0/2 8 16
3928 Inode 0: block 16 70/2 131/3 135/4 140/9 150/2 ... 296/7
3934 All 'old', so must be the product of cleaning, as you would expect.
3935 All (most) of this has been deleted though, but count didn't drop.
3936 'Count' add to 508, plus the 4 cluster heads makes 512 - good.
3937 lafs_seg_move definitely isn't being called on these blocks.
3938 it is only called from lafs_summary_update
3939 cblocks_used "exactly" matches the number of un-removed blocks.
3943 bad [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3944 /home/neilb/work/nfsbrick/fs/module/modify.c:1652: [ce5bcf50]301/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3945 bad [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3946 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfbf6000]327/0(0)r1E:Index(0),Pinned,Phase1,WPhase0,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3947 bad [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3948 /home/neilb/work/nfsbrick/fs/module/modify.c:1656: [cfb62d20]291/0(0)r1E:Index(0),Pinned,Phase1,WPhase1,InoIdx,Valid,SegRef,C,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3951 free_blocks=1842 allocated=449 max_seg=512 clean_reserved=0
3954 ------------[ cut here ]------------
3955 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
3956 free list is empty - that should not be.
3959 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce9893b0]74/0(0)r1E:Index(1),Pinned,Phase0,WPhase1,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3960 /home/neilb/work/nfsbrick/fs/module/modify.c:1219: [ce5ba690]74/0(0)r1E:Index(1),Pinned,Phase1,WPhase0,InoIdx,Valid,Dirty,CN,CNI,UninCredit,IOLock,PhysValid{0,0}[0] leaf(1)
3961 [<d0a57bc8>] ? lafs_get_flushable+0x131/0x191 [lafs]
3962 [<d0a5856d>] ? lafs_do_checkpoint+0x1b3/0x3a2 [lafs]
3963 [<d0a5fe7e>] ? cleaner+0x105/0x1426 [lafs]
3964 [<c02256bf>] ? autoremove_wake_function+0x0/0x33
3965 [<d0a5fd79>] ? cleaner+0x0/0x1426 [lafs]
3969 Weirdness with truncating.
3970 The cleaner relocates a file resulting in the InoIdx block being
3971 Maybe-dirty and phys_addr == 0.
3972 Then truncate doesn't prune but just incorporates, finding
3973 something weird there..
3974 file 278, blocks around 4100
3975 seem to find 1949 instead??
3977 Note: When a non-InoIdx block is erased we set PhysValid
3978 and physaddr == 0 to record the fact because it will not be stored...
3980 modify.c:1654: [ce5b4460]327/336(16)r4F:Index(1),Pinned,Phase0,WPhase1,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3982 modify.c:1657: [cfb90690]327/340(787)r4F:Index(1),Pinned,Phase1,WPhase0,Valid,Async,SegRef,C,CI,CN,CNI,IOLock,PhysValid{0,0}[0] leaf(1) inode_handle_orphan2(1) async(1) inode_handle_orphan3(1)
3983 Still Async ... wonder what it means.
3985 - directory block got corrupted. Maybe conversion to indexed??
3988 Getting bug in remove_from_index because the addr isn't
3989 there, possibly block is empty. But incorporation is
3990 ??? instant? No it isn't.
3991 If an index block hasn't be incorporated it has B_PrimaryRef
3992 set as it hold a ref to something earlier index.
3993 But what if nothing is incorporated?
3996 Allocated [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,WPhase0,Valid,Dirty,Async,SegRef,CN,CNI,UninCredit,IOLock,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) leaf(1) uninc(1) async(1) inode_handle_orphan3(1) -> 0
3997 looping on [ce402230]328/340(0)r5F:Index(1),Pinned,Phase0,Valid,Dirty,Writeback,Async,SegRef,CN,CNI,UninCredit,PhysValid,Uninc{0,0}[0] inode_handle_orphan2(1) cluster(1) uninc(1) async(1) inode_handle_orphan3(1)
3999 Then spin in a soft-lockup in lafs_inode_handle_orphan
4003 - grow_index_tree needs to do initial incorporation so things can be found.
4004 just like end of do_incorporate_internal.
4005 NO - cannot incorp yet as do not have phys addr. Don't need to as
4006 lafs_leaf_find explicitly handles this.
4007 For truncate case we don't use the stored address, but ensure all
4008 leaf indexes must be dirty (or gone) so whole tree must be
4009 accessible for walking around.
4010 - do_incorporate_internal needs to set B_PrimaryRef and take the ref
4011 - when we remove a B_PrimaryRef without incorporating it, we need to
4012 drop a ref if the *next* in the list is B_PrimaryRef
4013 - need to use a constant to identify 'async' calls etc.
4014 - maybe I need other iolock_block in truncate ?? to ensure it is Valid so
4015 it isn't found as async....
4018 STILL struggling with incorporation.
4019 We have a premise that any file address is coverred by precisely
4020 one leaf index block. Every leaf index has an implicit address
4021 and it covers all addresses from there to the next leaf. The last
4023 So there must always be a leaf at address 0.
4024 This applies within the tree from an internal index block too.
4025 Beneath an internal index block there must be a leaf covering every
4026 address up to the next internal index block. So there must be
4027 a first. So storing the first address is pointless. And harmful.
4028 When an index block becomes empty and disappears its coverage is
4029 included in the previous block unless there is none, in which case
4030 the next index block must be re-addressed. If there is no 'next',
4031 this index block must be empty and so must disappear.
4033 BUT if we re-address an index block, we implicitly re-address the
4034 first child - recursively - so we need to move/rehash them all
4035 or lose them... or record where they are. Or do lookup not by
4037 I think just rehashing them all - with an iolock - is simple
4038 and safe. So just do that.
4041 So: I cleaned up index handling a truncation somewhat.
4042 Now running looptest to see what patterns emerge:
4044 block.c:197 (*9+1) During umount, the Root datablock is
4046 Maybe just need for cleaner to become inactive
4047 during umount - hope that doesn't deadlock
4048 didn't event work...
4049 block.c:529 (*4+1) erase dblock while iblock depth > 0
4050 When pruning InoIdx we want to set depth to 0.
4051 FIXME is this really want I want, or is depth=0
4052 only for data-inode ... FIXME
4053 cluster.c:533 (*2) cluster_allocate on invalid block
4054 Block is 8/0 in writepage from sync_inodes
4055 This is the orphan file.
4057 I guess the file gets truncated while we wait for it.
4058 Just need to re-test.
4059 index.c:1936 (*2). An index block is Root - FIXED??
4060 modify.c:1056 - secondary bug, ignore for now.
4061 modify.c:1650 update_index fails to find target.
4062 second call, phys==0
4063 Code was bad ... may not be the cause though.
4064 modify.c:1696 (*4) lafs_incorporate gets non-dirty Index(1) block
4065 from orphan handler.
4066 Maybe just change the do/while back to 'do'.
4067 modify.c:1704: (*2) lafs_inc gets leaf with uninc list???
4070 uninc list gets set in lafs_add_block_address (parent of iblk),
4071 do_incorporate_internal,
4072 Maybe the InoIdx still had children.
4073 segments.c:1028. (*4) The free list becomes empty.
4074 super.c:655 (*3) Busy inodes after umount, and root InoIdx block
4075 is still pinned as inode 16 data block was still dirty.
4076 segusage slow. Maybe same as block.c:197 ??
4077 invalid address 6b6b6bfb: invalidate_inode_buffers in shutdown
4079 presumably the inodes was freed before invalidated.
4080 spin on writeback during truncate (r3a) 8 times. now 10
4081 Probably because writeback cannot proceed while
4082 orphan processing keeps looping.
4083 kmalloc-1024 problems - (*2)
4084 A block - should be start of page - isn't not what it appears...
4086 Others complete with 'cb' ranging from 202 to 715
4091 Looking at segment.c:1028
4092 We run a seg_scan every checkpoint, so that should keep free segments
4094 Ahh.. do_checkpoint is looping because root isn't changing phase.
4096 Lowest block pinned to old phase is
4097 [cfb7df08]0/74(4253)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,CN,CNI,UninCredit,IOLock,Claimed,PhysValid
4098 which is not on leaf list because it has IOLock
4099 With more debugging:
4100 [ce5c5f08]0/74(4250)r0E:Pinned,Phase1,WPhase0,Valid,Dirty,Realloc,SegRef,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</io.c:368>
4101 or better (that was in lafs_iolock_written)
4102 [ce5c05e8]0/74(4257)r0E:Pinned,Phase0,WPhase0,Valid,Realloc,SegRef,C,CN,CNI,UninCredit,IOLock,Claimed,PhysValid</file.c:247>
4103 FIXED - I didn't unlock if it wasn't dirty any more.
4104 Well almost - it occurs much less now.
4106 8 BUG: soft lockup - CPU#0 stuck for 61s! [lafs_cleaner:1180]
4107 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4108 2 BUG: unable to handle kernel paging request at 6b6b6bfbt
4109 3 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4110 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!6
4111 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1650!
4112 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1696!8
4113 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4114 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!!
4116 So we now have 1/12 rather than 2/3.
4117 a/ pinned by IOLock from file.c:220 - FIXED
4119 c/ Root is pinned by 4 children
4120 328/0 with 196 of data blocks in writeback/realloc, in a cluster
4121 0/1, 74/0, 0/8 all in a cluster waiting writeout.
4122 Don't understand this.
4125 Of the 48, 11 ran to completion leaving blocks from 286 to 899
4128 Looking at the loss of blocks when truncating.
4129 tracing show small number of files with remaining blocks at delete.
4130 sum is 26+22+14+272+11+2 == 347 cf df shows cb=457
4131 next attempt: 14+24+26*11 =324 cf cb=1124
4132 next attempt 26+6+15+68+29 == 144 cf cb=383
4133 26+18+14+19+284 = 361 cf 379
4134 files are (in order)
4143 Thinking about truncate and index blocks becoming empty while
4144 they still have children.
4145 For leaf indexes, we need to leave the block in place in case
4146 the children get written. We need to find a time to ultimately
4148 For internal indexes,.... uhm, it just works, OK??
4150 When I drop an uninc block, I need to remove it from the
4151 uninc list, and from phase_leafs
4152 clearing dirty and refiling should remove from leafs.
4154 When we recurse to a parent, we need to remove
4155 *this* block from the uninc list for said parent.
4156 It should be the only thing in the list.
4157 But even when we don't recurse, the fact that we have
4158 incorporated means that we should tidy up the ->uninc
4164 unmount hung after lafs_run_orphans from lafs_put_super
4165 There are two orphans in Writeback which cannot progress
4166 until the current cluster is written...
4167 But they keep getting re-written!
4168 Other time, one orphan, index block is Dirty on a leaf ???
4170 orph=[cfbdcf24]0/331(3780)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) orphan_list(1) iblock(1)
4171 [cfb8e460]331/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(1)
4172 LAFS_cluster_flush 1
4175 orph=[ce5c9bb4]0/327(3317)r2E:Valid,SegRef,C,CI,CN,CNI,Claimed,PhysValid,Orphan(0) iblock(1) orphan_list(1)
4176 [cfbe3a40]327/0(NoPhysAddr)r1F:Index(1),Pinned,Phase0,InoIdx,Valid,Dirty,SegRef,CN,CNI,UninCredit{0,0}[0] leaf(1) Leaf0(0)
4178 OK, problem is that when we truncate and remove an index block, the
4179 next index block expands backwards to fill the space.
4180 Then we apply prune_some, but don't check if anything was done.
4181 We always mark it dirty, so it has to be written and then
4182 we loop through again...
4183 So need to check if prune_some did anything.
4186 - prune_some need to get more done at a time
4187 - let cleaner finish up before umount
4188 - use early segments first ??
4189 - look at write-clusters and check OK
4190 - check that df:cb= drops properly.
4193 1 BUG: spinlock lockup on CPU#0, sh/1168, c0441170 - SECONDARY BUG
4194 1 BUG: unable to handle kernel NULL pointer dereference at 000001b4
4195 3 BUG: unable to handle kernel paging request at 00100104
4196 5 BUG: unable to handle kernel paging request at 6b6b6bfb
4197 1 BUG: unable to handle kernel paging request at 7fffffff
4198 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:197!
4199 9 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:479!
4200 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:529!
4201 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/block.c:67!!
4202 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:828!
4203 4 kernel BUG at /home/neilb/work/nfsbrick/fs/module/inode.c:843!
4204 1 kernel BUG at /home/neilb/work/nfsbrick/fs/module/modify.c:1708!
4205 7 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:1028!
4206 2 kernel BUG at /home/neilb/work/nfsbrick/fs/module/segments.c:332!
4207 30 kernel BUG at /home/neilb/work/nfsbrick/fs/module/super.c:655!
4212 Pinned block in lafs_release:
4213 0/2 is Dirty with plenty of credits, so it is a child
4214 0/16 is Dirty/Realloc, or once Async
4217 seg_deref with refcnt , 2 in lafs_seg_put_all
4220 No free segments - no real pattern.
4223 lafs_incorporate on non-dirty/realloc block
4224 328/0 Index(1). 1 in uninc_table - probably during truncate.
4227 children present in truncate after final incorp...
4228 328/0. 64 children, no uninc list. Maybe we ran the orphans too early??
4229 or invalidate_page isn't removing the children.
4230 Might want print_tree here?
4233 Orphan handling - uninc but not dirty: is Realloc (sometimes)
4237 delref 'primary' from modify.c:2063 in the q2 branch.
4238 nxt has PrimaryRef... Maybe move earlier, but that shouldn't make a diff.
4239 ditto at modify.c:2035 nxt is primary as was I, so drop mine.
4242 erase with index depth > 1.
4243 0/328 in orphan handling. Still have 8 or 15 blocks registered!
4246 not enough credits to dirty block 2/0 in dir_delete_commit for unlink.
4248 16/1 in seg_inc/seg_move...allocated_block/cluster_flush
4251 invalidated pages finds dirty block after EOF, after iolock_written
4252 0/0 Dirty/Realloc in unmount - all Realloc!
4255 cleaner->cluster_flush->count_credits->lock??
4258 generic_drop_inode -- extra iput?? in lafs_inode_checkpin from refile
4259 6b6b6b invalidate_inode_buffers!! in kill. use-after-free
4262 seginsert from scan_seg