dir.doc

   1 [[[
   2  I want to use "find next block" rather than "find previous block".
   3  It has the advantage that we could up until we hit the block number.
   4  But I want to start in '0'.
   5  So:
   6    Entry with hash X is stored in first block at or after X+1.
   7    If there is no such block, it is stored in '0'.
   8  Thus when we walk for readdir, we start at block 1, and finish at
   9  block 0.
  10
  11 ]]]
  12 Directory format.
  13
  14 POSIX (and NFS) requires that we can find an entry by name or by a
  15 fixed length key.
  16
  17 So, we treat the directory file like a hash table with internal
  18 chaining.  The chain function is simple sequencing. i.e. if the
  19 hash key we want is in use, increment and try again.
  20
  21 We store a random secret in the directory inode so that knowledge of
  22 the hash values cannot be used to generate a large number of colliding
  23 entries (except for testing).
  24
  25 The hash for a name is then generated from the secret and the name,
  26 probably using 'tea' or whatever ext3 does.
  27
  28 The directory file is sparse.  The entry with a given key will be
  29 stored in the allocated block (non-hole) with the highest index that
  30 is not greater than the key.
  31
  32 A block can contain multiple entries.  They are arranged in an AVL
  33 tree.  Each entry is aligned on 1/256 of the block size and contains:
  34   inode (4 bytes)
  35   fore/back pointers (1 byte each)
  36   type (4 bits)
  37   hash-chain info (2bits +).
  38   AVL 'longer' flags (2 bits).
  39   name len (1 byte)
  40   filename
  41   optional hash offset.
  42
  43 The hash value for an entry is computed by calculating the hash for
  44 the name and adding some number from the hash-chain info:
  45   If this is 0-1 - that number is added
  46   If 2 - a single byte follows which is added
  47   If 3 - 4 bytes follow to be added.
  48
  49 An entry may be kept for a deleted name to ensure that internal
  50 chains are not broken. In this case the inode will be 0 and possibly
  51 the len will be zero and the hash offset will be 4 bytes.
  52
  53 Each directory block is divided into 256 'pieces'.  An entry starts
  54 at the start of a piece,  This one byte can index an entry.  The first
  55 piece is used for a header, so indexes are always non-zero.
  56 The header contains
  57  - head of AVL tree (one byte)
  58  - last piece in use (one byte)
  59  - number of pieces used by removed entries (one byte)
  60
  61
  62 Hash:
  63   We use a 31bit hash.  This still allows for 2billion entries,
  64   and means we can add '.' and '..' with indexes without pushing
  65   beyond 32bits.
  66
  67 Lookup:
  68   To find a name is a directory we calculate the hash and probe for
  69   that value.
  70   To "probe" means to find the last allocated block with an index not
  71   greater than the value, then search within that block.
  72   If the value isn't found, the name does not exist.
  73   If the value is found and the entry has the right name, the name
  74   does exist.
  75   Otherwise we increment that hash value and probe again.
  76
  77 Readdir:
  78   Readdir walks the blocks of the file in order, and returns
  79   (non-deleted) entries in hash order.  '.' is 0, '..' is 1, others
  80   has 2 added to them.
  81
  82 Create:
  83   To add an entry, repeat the probe pattern of lookup, stopping when
  84   a missing hash or a deleted entry is found.  Use that hash number,
  85   removing the 'deleted' entry if found.
  86   If there is room in the block, add the new entry in the same block.
  87   If not, we try repacking the directory block first.  If there
  88   is still not enough space, move half the entries into a new block
  89   and write it at an appropriate index.  This will involve copying all
  90   entries which may free up quite a bit of space.
  91
  92 Remove:
  93   To remove any entry we first find it, then find the entry with a
  94   hash one less.  If this entry does not exist, we can remove the
  95   target entry, otherwise we simply mark it as deleted.
  96   If 'one less' is in a separate block, we do not go looking for it,
  97   but instead schedule the block for orphan handling.
  98   If the entry is removed, we look up the next index.  If it is marked
  99   as deleted, we remove it and try again.  If the 'next' index is
 100   in a separate block, we similarly schedule orphan handling.
 101
 102 Orphan handling:
 103   If a block is processed as an orphan, there are two tasks required:
 104
 105     1/ lookup the index one less than the block number.  If that
 106        does not exist, then remove any leading 'deleted' entries in
 107        the block.
 108     2/ load the 'next' block and if the leading entry is 'deleted',
 109        schedule orphan processing on it.
 110
 111 ----------------------------------------------------
 112
 113 Commit:
 114  Any update is committed by a transaction that records the operation.
 115  Each transaction has a name, an inode number, a type, and optionally
 116  a blocknumber (well, it is always there but often ignored).
 117  Types are:
 118
 119   LINK - create the name as a new link to the inode
 120   UNLINK - remove the name.  If it is a directory, take care of
 121       the linkcount in the parent as well.
 122   REN_SOURCE - The name is about to be moved. Record the name
 123          and record the bnum to track 'this' rename
 124   REN_NEW_TARGET - This name didn't exist before but is collecting
 125          a previos REN_SOURCE.  Effect the rename at this point.
 126   REN_OLD_TARGET - remove this name, and replace it with the matching
 127          REN_SOURCE which gets unlinked.
 128
 129
 130 Locking:
 131    While a directory update is happening, the directory must be locked
 132    against other updates (sorry).
 133    Then we:
 134       Perform a lookup to find where changes might happen,
 135       preallocate all memory and reserve space in the filesystem.
 136
 137    If that works for all updates we:
 138       lock against a checkpoint
 139       log the updates
 140       complete the require commit steps
 141       including registering orphans
 142       unlock the checkpoint
 143       schedule the orphan handling
 144
 145    If that doesn't work we
 146       return any free blocks
 147
 148    Then unlock the directory.
 149