inode.doc

   1
   2 Creating a file:
   3 We create a file separately from creating a name.
   4 We create an orphan entry to ensure that the
   5 file gets cleanned up after a restart if no
   6 name exists.
   7
   8 The commit for a file creation involves
   9  - writing an update block for the inode
  10 The checkpoint for file creation involves
  11  - clearing the bit in the inode bitmap
  12  - recording the inode as an orphan
  13  - writing the inode block.
  14
  15 We need to allocate the inode number outside the
  16 checkpoint lock (do we?) but clear the bit inside.
  17
  18 To avoid races implicit in this (the threads allocate
  19 a inum before either clear the bit) we keep track of the
  20 last allocated inode number to avoid quick reallocation.
  21 We guarantee exclusion by flagging the inode when
  22 we create it... or something. More later
  23
  24
  25 Hmmm... this isn't inode specific but:
  26  The update block must be written, or scheduled
  27 at least, in the same checkpoint which holds
  28 the transaction.
  29 The means we have to write the update inside
  30 a checkpoint lock... but how long can we wait?
  31 Maybe we want a 'update-and-lock' operation..
  32 No. just pre-allocate/lock/commit as with everything
  33 else.
  34 So we want to be able to pre-allocate log space.
  35 How does that work?  We cannot block writes until
  36 the space is used...  What are we really preallocating?
  37
  38 This is like locking a block.  We need to be sure there
  39 is enough space in the log to write it. and we keep
  40 that buffer between us and 'full' until the write happens.
  41 If there are lots of concurrent reservations, that
  42 decreases the free space we have to work with and
  43 pushes checkpoints more often.
  44 The 'reserve' will block until there is enough free
  45 space to allow it, either because space has been returned,
  46 or has been freed up by a clean/checkpoint.
  47
  48 Uhm... no.  Something wrong here.
  49 Sometimes we want to lock a block and wait if there isn't
  50 enough room. But we cannot wait while we have a checkpoint lock.
  51 So we need to lock outside the checkpoint lock.  But currently
  52 we don't think that is OK for data blocks.
  53 Why? because they might be written to early?
  54 Index blocks stay permantently locked while they have
  55 a dirty or otherwise active child.  They get written iff
  56 they are dirty.
  57 Datablocks should stay locked until the refcount hits zero.
  58 So we lock them outside the checkpoint, which reserves space
  59 etc.  when a checkpoint happens it gets written if it is dirty.
  60 So it goes like this:
  61   - lock the block getting all reservations needed
  62   - take checkpoint lock
  63   - wait for block to be clean or in this phase
  64   - update block
  65   - mark block dirty.
  66
  67 Block stays locked while there is a ref count.
  68
  69 See writeout.doc - I revised the above.
  70
  71 ---------------------------------------------------
  72 When is the in-core inode flushed into the datablock?
  73 For regular file metadata (mode, owner etc) this probably
  74 isn't important as there are no interdependancies.
  75
  76 For internal metadata such as
  77  - head for directory btree
  78  - start of directory free-list
  79  - size of inode map block
  80 this is important and the related block changes much happen
  81 in the same checkpoint.
  82 I suspect we need a phase number that get set when
  83 the inode is synced to the block.  This can be 0, 1
  84 or unset.
  85 When flushing causes an inode to be written, if the
  86 phase number is set to the next phase, we skip the
  87 sync, otherwise we sync and clear the phase flag.
  88
  89 When we want to write something that has to be in this phase,
  90 to first check the phase flag,  if it is set to the wrong phase,
  91 we sync incore to block. Then we set the flag and update incore.
  92
  93 ----------------
  94 When ->write_inode is called, we only want to write non-structural info.
  95 This is the section of file metadata from flags to atime.
  96
  97
  98 So:
  99  mark_inode_dirty
 100    checks that the block is locked.  It must be
 101      as we only call this after something that has
 102      had a chance to lock, or fail.
 103    copies info into the buffer
 104    marks the block dirty
 105
 106 write_inode just writes a commit record if mark_inode_dirty has been called.
 107 If a commit record cannot be allocated, force a checkpoint.
 108
 109 Checkpoint will write out the whole inode eventually.
 110
 111 Size updates use a different commit system.
 112
 113 Hmmmm...
 114  If we are doing a checkpoint, and this inode hasn't been writen in the
 115  old phase yet, then we cannot copy data from inode to block.
 116  So we mark the inode as dirty. When phase flips, if it is dirty, we copy
 117  info
 118 -----------------------------
 119 Whenever we update metadata in an inode, we immediately
 120 (via ->dirty_inode) sync it with the block buffer, unless
 121 that is still in the old phase, in which case we flag the inode as
 122 dirty so that after a phase change, it get's synced.
 123
 124
 125 -----------------------------------------------
 126
 127 When an inode is deleted, we need to preserve the inode until
 128 truncation is complete, then clean everything up.
 129 This means that our destroy_inode should not actually destroy it,
 130 but rather mark it so that when the InoIdx block goes away, the
 131 inode is destroyed.
 132 But we need to be careful of inode number reuse...
 133
 134 So: when a file in finally deleted (lafs_delete_inode) we need to:
 135  - release all the data pages - this is really just updating lots of
 136    segment usage counts.
 137  - create a hole in the inode file
 138  - set the bit in the inode usage map
 139 This should be done without waiting in lafs_delete_inode, which means
 140 that lafs_destroy_inode should not free the inode if it is being deleted.
 141 The truncation will happen from the 'orphan handler' thread.
 142 We need an I_Flag for "Deleting".  If set, destroy won't free it but
 143 instead will set I_Destroyed.
 144 When the hole is created and the datablock is freed, we clear
 145 Deleting and then if Destroyed is set, we free.
 146
 147
 148 Deleting an inode is committed by the unlink that removes the last link.
 149 This makes it an orphan.  It stays an orphan until truncate finishes with it.
 150 So lafs_delete_inode just starts the truncate process.
 151 It sets trunc_next to 0.  Then trunc_next become MAX we can tear down
 152 the index tree, set the inode_map bit, and set the data block to be a Hole.
 153
 154 EEKK delete needs lots of thought/work.
 155
 156 As the inode has just been changed, it might be dirty etc.  We need to
 157 wait for the index block to naturally get clean up before throwing it
 158 away and creating the hole etc...
 159 So: how do the data block, then index blocks, get dealt with.
 160 Need to understand both nondelete and delete dropping of an inode.
 161
 162 inode holds a reference on the dblock.  It needs to be able to drop
 163 that when it is clean etc...
 164
 165 The key issue here is blocks disappearing from the index tree.
 166  - Datablocks are removed when their refcount becomes 0, and they are clean.
 167    truncate_pages should achieve this easiy.
 168  - Index blocks are removed as they too become clean.  This should also
 169    happen promptly, though the truncate operation will need to find all
 170    of them eventually.
 171  - The InoIdx block is referenced from the inode so won't go away in a hurry.
 172    But really, the reference from the inode should not stop the parent link
 173    from going away.  Maybe that reference should be counted differently.
 174    Maybe the iblock/dblock references shouldn't be counted at all.
 175    When the inode is dropped, we drop the reference from the dblock
 176    When the dblock is dropped, we clear the reference from the inode.
 177    ->my_inode->dblock
 178    InoIdx: ->inode->iblock are loops.
 179
 180 So flushing or checkpointing after truncation should result in all index
 181 blocks being allocated, found empty and a Hole Punched.
 182 So for Index blocks, we run incorporation and if it is empty, we might punch
 183 a Hole.  For the InoIdx block we only punch the real hole if I_Deleting.
 184
 185 butbutbut can we drop ->dblock?   A: only when the inode is clean and not pinned.
 186 So: how does deletion work?
 187   We mark the inode as an orphan (Should already have happened), set
 188   the "trunc_next" pointer to the start of the file and schedule the
 189   orphan handler. This ultimately allocates some 0s to the InoIdx block
 190   in which incorporation makes it dirty.
 191   This needs to be detected in incorporate and instead of the inode being
 192   written out, we punch a hole in the inode file and set the inode-available bit.
 193
 194   We write out the inode map update in the same checkpoint that the
 195   delete starts.  Until the delete completes, the inode will have B_Claimed set
 196   so that others won't try to use the same inode number.
 197
 198
 199 --------------------------------------------------
 200
 201 The duplicity of the inode index block and data block is becoming a bore.
 202 We need to often update them both which is awkward.
 203 And getting from one to the other is awkward.
 204 But we need them to be different because:
 205   1/ they contain different sorts of information
 206   2/ When an indexing tree grows, we need a new InoIdx block but not new data
 207   3/ ...
 208 Maybe we just need clearer rules on what gets updated when.
 209 Currenty uglinesses include:
 210
 211   cluster_allocate
 212         skip data block if InoIdx is pending
 213         use datablock when trying to do indexblock
 214
 215   cluster_done
 216         clear B_Alloc on InoIdx when clearing on data
 217
 218   cluster_flush
 219         clear Dirty on InoIdx when clear  on data
 220
 221   lafs_refile
 222         don't clear Pinned on data when InoIdx is pinned
 223         only put date on phase_leafs when index not pinned
 224       ??Destroy inode when data block gets freed (need better test)
 225
 226   inode_fill
 227         need to dirty both blocks
 228
 229   pin_iblock
 230         also pins dblock for an inode
 231
 232   phase_flip
 233         also flips dblock phase
 234   make_iblock
 235         copy Dirty flag across
 236
 237
 238 An issue is that we need to keep the flags correct so that 'Pinned' and 'parent'
 239 etc hang around as required.
 240
 241 An ugliness is that normally data blocks don't flip-phase. The just drop out and
 242 maybe get pinned again in the next phase.
 243 But inode datablocks seem to need to.  Maybe they shouldn't.
 244 Maybe the InoIdx being pinned should be enough.  While it is pinned it holds
 245 a reference on the data block(??) and when we want to force a write, we pass
 246 credits over to data block.. We could set Alloc and clear Dirty on InoIdx then
 247 start processing the data block.
 248
 249 HMmm... maybe just tidyup the code, and leave the functionality as it is.
 250
 251 ----------
 252 When a file is small, the data is in the inode.
 253 The InoIdx block is still an index block though, and there is a separate
 254 data block which the data gets copied into.
 255 So the one data block is a child of the InoIdx block which is a twin of the
 256 inode data block.
 257 When the file grows and the inode must contain real information, we simply update
 258 the content of the inode to be index info and make sure to write out the
 259 new data block.
 260 When the file grows again, the InoIdx block must be demoted to a regular
 261 Index block, and a new InoIdx block created to parent that Index block.
 262 As the InoIdx block doesn't have it's own buffer, it must steal the buffer from
 263 the new block that will become the new InoIdx block.
 264
 265 -------------------------------
 266 Block usage.
 267
 268 The Inode on disk records 'data_blocks' and 'index_blocks' which record
 269 how many blocks are in the index tree below the inode.
 270
 271 The in-memory inode records:
 272    cblocks.  This counts the number of blocks of the inode we flushed out
 273         now.  It includes all blocks stored plus any that have been allocated
 274         to disk as new blocks in the current phase.
 275         It is also reduced when "holes" are registered for incorporation.
 276         They are "committed" blocks.
 277    pblocks.  This counts changes that have been registered for incorporation in
 278         the *next* phase.  On a phase change, pblocks is added to
 279         cblocks, and zeroed.
 280         They are "phased" blocks.
 281    ablocks.  This counts new blocks that have been dirtied but have not yet
 282         been registered for incorporation - i.e. they haven't been written
 283         to storage.
 284         They are "allocated" blocks.
 285         We add to ablocks when we set B_Prealloc on a data block with physaddr==0
 286         we remove from ablock when we clear B_Prealloc on such a block
 287         we also remove when we set physaddr - prealloc must be set
 288
 289 So:
 290    'cblocks' is loaded from the inode, and written out.
 291    When a block which didn't have an address is "lafs_allocated_block"
 292    with an address, then we increment 'cblocks' or 'pblocks' depending
 293    on whether the index block is in-phase with the inode.
 294    We also decrement 'ablocks'.
 295    When we dirty a block with no physical address, we increment 'allocated'.
 296    When we remove a block (lafs_allocated_block with phys==0) or otherwise
 297    in truncate, we decrement 'cblocks' or 'pblocks', and also 'ablocks'
 298    if the block was dirty but had phys=0 already.
 299
 300
 301    For index blocks, we similarly have ciblocks and piblocks.
 302    No aiblocks is needed as we don't pre-allocate index blocks.
 303
 304    The value returned for getattr is the sum of cblocks, pblocks, ablocks.
 305    I wonder if ciblocks and piblocks should be added.
 306    The counters are protected by .... i_lock
 307
 308    We need to implement this in:
 309       lafs_allocated_block
 310       lafs_dirty_dblock
 311       hole handling
 312       truncate
 313
 314 ---------------------------------------
 315 InoIdx and Data - when is which dirty?
 316
 317 Currently when we try to dirty an inode data block, we actually
 318   dirty the InoIdx block instead if it exists.  And is pinned.
 319   But I'm not sure that is correct.
 320  In cluster_allocated we don't handle a non-pinned inode data block
 321    if the InoIdx block is Pinned  to the same phase... which doesn't
 322    make sense because the data block isn't pinned.
 323  but when the InoIdx block is ready, we pin the data block which allows
 324  it to be written...
 325
 326  There are two times when we want to write an inode block.
 327   1/ when writepage writes an embedded page 0.  In this case there are
 328    no other index blocks to be pinning the InoIdx.  So when the data
 329    block is successfully allocated we should immediately allocate
 330    the inoidx and thus the data, clearing all the 'Dirty' bits.
 331   2/ when a checkpoint gets to it.  The InoIdx will be pinned but the
 332    data block not.  Once we allocate the InoIdx, that will 'pin' the
 333    data block so it will then be processed.
 334
 335   If we perform metadata operations on an inode while no inoidx is present
 336   (as no data is being accessed) we want to Pin the data block to ensure it
 337   gets written... Or we could just write it and ensure roll-forward picks up
 338   the right bits.
 339   In that case an inode block is just like any data block.  We make changes
 340   and dirty it.  Eventually it gets written.  It can be pinned if we want,
 341   but there is no point pinning it if the InoIdx is pinned to the same phase.
 342   Roll-forward must pick up all content except index information.
 343
 344   Mark_inode_dirty is called for atime, ctime, size...
 345   Most of those we can control and see, just not atime.
 346   So only really write the inode if we know of an important change...
 347
 348 So:
 349   An inode-data-block can be dirtied:
 350    - in flush_data_to_inode when we copy from block 0
 351    - in lafs_cluster_allocate when the index block is being allocated
 352    - in lafs_inode_init
 353    - in lafs_inode_fillblock
 354    - in inode_map_new_commit ???  No needed, init does it already.
 355   An InoIdx block can be dirtied:
 356    - in lafs_dirty_dblock when dirtying the data block - don't do this!
 357    - prior to incorporating changes or adding addresses
 358           i.e. before lafs_add_block_address
 359
 360   If a db is pinned while the ib is pinned to the same phase, drop
 361    that pinning, it isn't needed.
 362   When an ib is allocated, pin the db and make sure it is dirty.
 363
 364   Writepage on an inodefile page should fail if any inodes have pinned
 365   InoIdx.  Otherwise it can succeed.