README update

author NeilBrown <neilb@suse.de>

Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)

committer NeilBrown <neilb@suse.de>

Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)
author NeilBrown <neilb@suse.de>
Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)
committer NeilBrown <neilb@suse.de>
Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)
diff --git a/README b/README

index a5e15f48c4d5dfaf62c05644f147ecdac558a477..6c3d454ebff43bc952ee8a1ed22fb8e395c0987c 100644 (file)
--- a/README
+++ b/README
@@ -5127,39 +5127,39 @@ DONE 15az/ Sanity check all values in cluster head during roll-forward
        i.e. in roll_valid.  If the head isn't complete, we can still
        use this to commit some previous checkpoints.
  
-15ba/ roll forward should not BUG on bad data like inodefile in
+DONE 15ba/ roll forward should not BUG on bad data like inodefile in
      non-primary filesystem.
  
-15bb/ Do I need to sync something before copying an update over part
+DONE 15bb/ Do I need to sync something before copying an update over part
      of an inode, then reloading the inode.
  
-15bc/ Handle DescHole in roll forward.
+DONE 15bc/ Handle DescHole in roll forward.
  
-15bd/ Call lafs_add_block_address from writeback rather than iolock
+DONE 15bd/ Call lafs_add_block_address from writeback rather than iolock
      in roll forward, just for consistency.
  
-15be/ Confirm various files loaded at mount time (segusage, orphan ...)
+DONE 15be/ Confirm various files loaded at mount time (segusage, orphan ...)
      are actually the correct type.
  
-15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
+DONE 15bf/ Avoid quadratics in lafs_seg_put_all - nothing else should be doing
     a lookup - or at least we can test for that.
     lafs_seg_apply_all has similar problems and needs a good solution.
  
-15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
+DONE 15bg/ lafs_seg_ref_block is worried about losing implicit ref on parent
      if parent splits.  See what to do about that.
  
-15bh/ after roll-forward, check that free_blocks hasn't gone negative.
+DONE 15bh/ after roll-forward, check that free_blocks hasn't gone negative.
    or handle if it has.
  
  DONE 15bi/ Set EmergencyClean a bit later - need at least one checkpoint first.
    to twostage.
  
-15bj/ Make sure .last link in segtracker is kepts uptodate, particularly in
+DONE 15bj/ Make sure .last link in segtracker is kept uptodate, particularly in
     segdelete.
  
-15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
+DONE 15bk/ make sure get_cleanable doesn't lose a race before calling add_clean
  
-15bl/ better checks for 'valid state block address' in valid_devblock
+DONE 15bl/ better checks for 'valid state block address' in valid_devblock
      include that segment_count is credible
      also in valid_stateblock
  
@@ -5173,12 +5173,13 @@ DONE 15bp/ review all superblocks - maybe use more anon??
  
  15bq/ check readonly status in lafs_get_sb
  
-15br/ sync_fs should probably wait for something if 'wait'.
+DONE 15br/ sync_fs should probably wait for something if 'wait'.
  
-15bs/ set f_fsid properly in lafs_statfs
+DONE 15bs/ set f_fsid properly in lafs_statfs
  
- - use new write_begin / write_end
-    - review how we ensure that credit remain with block.
+DONE  - use new write_begin / write_end
+
+15bt/    - review how we ensure that credit remain with block.
  
  15ca/ When pin inode data block, pin it as well as index block I think
      It is still kept of the leaf list until the index block is done with
@@ -5205,7 +5206,10 @@ DONE 15bp/ review all superblocks - maybe use more anon??
  15cc/ free any stray B_ASync block found in destroy_inode
  
  15cd/ Some code assumes a cluster header does not exceed 1 page.
-     Is this safe?  Is in true? Is it enforced?
+     Is this safe?  Is in true? Is it enforced?p
+     roll-forward now handles large cluster_head.
+     Need cleaner to handle it, and need to possibly write large
+     cluster head when making new clusters.
  
  15ce/ classify BUGs as
          - internal logic errors
@@ -5300,7 +5304,7 @@ DONE 36e/ When dirtying a block in roll_block, maybe use writeback rather
       than just iolock, for consistency...
  DONE 36f/ What to do if table becomes full when add_block_address in
       roll_block ??
-36g/ Write roll_mini for directories.
+DONE 36g/ Write roll_mini for directories.
  DONE 36h/ In roll_one, use the cluster counting code to find block number and
       make sure we don't exceed the segment.
  DONE 36i/ add more general error checking to lafs_mount - 
@@ -5362,8 +5366,8 @@ DONE 52/ NFS export
      Need owner/group/perm for device file, but not for symlink.
      Can we create unique inode numbers?
      hard links for dev-files would be problematic.
-    What do we gain?  Maybe something for sort symlinks.
-    40 seems a ood length to et 70% of symlinks.
+    What do we gain?  Maybe something for short symlinks.
+    40 seems a good length to get 70% of symlinks.
  
  59/ Fix NeedFlush handling so we don't drop-then-retake
      a mutex as that isn't sensible.
@@ -5396,6 +5400,8 @@ DONE 52/ NFS export
     If a cross-directory rename happens care is needed:  either flush updates
     first or ensure that a flush does happen before the cross-directory
     update is flushed.
+   Note that if the target of a rename is a directory, it must also be fully
+   flushed before the rename can proceed.
  
  26June2010
   Investigating 5a
@@ -6973,3 +6979,65 @@ WritePhase - what is that all about?
     It is OK to delay the write-out of these until an fsync, and not bother
     if a checkpoint happens.
     So add that to th TODO list - item 66.
+
+28feb2010
+  - roll forward directory updates ... I wonder if I got it right :-)(untested).
+
+
+  I don't seem to have easy-access notes about the various meaning of
+  'width' and 'stride'
+
+  width:  The number of independent devices across which the (virtual) device
+    is placed.  The normal goal is to write 'width' blocks on every single write.
+    On a RAID4/5/6 this will avoid the need to pre-read for parity calculations,
+    and it will keep all devices equally busy with writes.
+    The 'width' blocks probably aren't consecutive.
+
+    There are two different layouts - one with width*stride <= segment_size
+    and one with width*stride > segment_size.
+
+  width*stride <= segment_size
+     This is a traditional striped layout like RAID0/4/5/6.
+     The 'stride' is the chunk size, so 'width*stride' is the stripe size,
+     and segment_size must be a multiple of this.
+     In this case all addresses in a single segment are contigious.   We don't
+     necessarily write them in order if we want to write less than one stripe.
+     segment_offset will normally be a multiple  of width*stride though this isn't
+     enforced as one could have a partition with an non-aligned start.
+
+  width*stride > segment_size
+     This implies a catentated layout.  If parity-redundancy is in use when the
+     blocks which combine to form a stripe are 'stride' blocks apart.
+     The benefit of this layout is that an extra drive can be added by simply
+     zeroing it and joining it to the array - no re-stripe needed.
+     This will make all stripes slightly larger so at first the space will not
+     be available.  As cleaning happens the space will gradually become
+     available.  This still requires restriping, but unlike a normal
+     raid5 restripe, the space becomes available in small amounts immediately,
+     when there is no demand for more space, the re-striping (cleaning) can happen
+     at a very low priority with no cost.
+
+     In this case the blocks in a segment are not contiguous.  
+      'segment_size/width' are, then there is a large gap (in virtual address 
+      space) to the next chunk.
+
+     The segment_offset is an amount of space which is free at the start of
+     each device.  0..segment_offset and stride..stride+segment_offset etc
+     do not contain data and can be used for metadata.
+
+  When width > 1 it makes sense to replicate each state block across
+     every device - as we want to write the whole stripe anyway.
+  For now we only write and read the first two copies at the beginning, and
+  the last two at the end...
+
+  Question:  what do we want to do about metadata on flash devices?  We really
+   don't want a small number of locations to store the metadata, but a large
+   number that we search through - possibly a binary search. 
+   These could be all at start/end or scattered throughout the device.
+   The later would make it impossible to find efficiently - there is no way to
+   create useful linkage without writing something else at start of end.
+   As many devices optimise for random writes where the FAT table would be,
+   it make sense to just put the metadata there and not at the end.
+   We should allow one 'page' for each metadatum, which probably meanss
+   32K.
+   So we should allow all state blocks to be near the start.
author	NeilBrown <neilb@suse.de>
	Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)
committer	NeilBrown <neilb@suse.de>
	Fri, 4 Mar 2011 01:47:31 +0000 (12:47 +1100)