3 We create a file separately from creating a name.
4 We create an orphan entry to ensure that the
5 file gets cleanned up after a restart if no
8 The commit for a file creation involves
9 - writing an update block for the inode
10 The checkpoint for file creation involves
11 - clearing the bit in the inode bitmap
12 - recording the inode as an orphan
13 - writing the inode block.
15 We need to allocate the inode number outside the
16 checkpoint lock (do we?) but clear the bit inside.
18 To avoid races implicit in this (the threads allocate
19 a inum before either clear the bit) we keep track of the
20 last allocated inode number to avoid quick reallocation.
21 We guarantee exclusion by flagging the inode when
22 we create it... or something. More later
25 Hmmm... this isn't inode specific but:
26 The update block must be written, or scheduled
27 at least, in the same checkpoint which holds
29 The means we have to write the update inside
30 a checkpoint lock... but how long can we wait?
31 Maybe we want a 'update-and-lock' operation..
32 No. just pre-allocate/lock/commit as with everything
34 So we want to be able to pre-allocate log space.
35 How does that work? We cannot block writes until
36 the space is used... What are we really preallocating?
38 This is like locking a block. We need to be sure there
39 is enough space in the log to write it. and we keep
40 that buffer between us and 'full' until the write happens.
41 If there are lots of concurrent reservations, that
42 decreases the free space we have to work with and
43 pushes checkpoints more often.
44 The 'reserve' will block until there is enough free
45 space to allow it, either because space has been returned,
46 or has been freed up by a clean/checkpoint.
48 Uhm... no. Something wrong here.
49 Sometimes we want to lock a block and wait if there isn't
50 enough room. But we cannot wait while we have a checkpoint lock.
51 So we need to lock outside the checkpoint lock. But currently
52 we don't think that is OK for data blocks.
53 Why? because they might be written to early?
54 Index blocks stay permantently locked while they have
55 a dirty or otherwise active child. They get written iff
57 Datablocks should stay locked until the refcount hits zero.
58 So we lock them outside the checkpoint, which reserves space
59 etc. when a checkpoint happens it gets written if it is dirty.
61 - lock the block getting all reservations needed
62 - take checkpoint lock
63 - wait for block to be clean or in this phase
67 Block stays locked while there is a ref count.
69 See writeout.doc - I revised the above.
71 ---------------------------------------------------
72 When is the in-core inode flushed into the datablock?
73 For regular file metadata (mode, owner etc) this probably
74 isn't important as there are no interdependancies.
76 For internal metadata such as
77 - head for directory btree
78 - start of directory free-list
79 - size of inode map block
80 this is important and the related block changes much happen
81 in the same checkpoint.
82 I suspect we need a phase number that get set when
83 the inode is synced to the block. This can be 0, 1
85 When flushing causes an inode to be written, if the
86 phase number is set to the next phase, we skip the
87 sync, otherwise we sync and clear the phase flag.
89 When we want to write something that has to be in this phase,
90 to first check the phase flag, if it is set to the wrong phase,
91 we sync incore to block. Then we set the flag and update incore.
94 When ->write_inode is called, we only want to write non-structural info.
95 This is the section of file metadata from flags to atime.
100 checks that the block is locked. It must be
101 as we only call this after something that has
102 had a chance to lock, or fail.
103 copies info into the buffer
104 marks the block dirty
106 write_inode just writes a commit record if mark_inode_dirty has been called.
107 If a commit record cannot be allocated, force a checkpoint.
109 Checkpoint will write out the whole inode eventually.
111 Size updates use a different commit system.
114 If we are doing a checkpoint, and this inode hasn't been writen in the
115 old phase yet, then we cannot copy data from inode to block.
116 So we mark the inode as dirty. When phase flips, if it is dirty, we copy
118 -----------------------------
119 Whenever we update metadata in an inode, we immediately
120 (via ->dirty_inode) sync it with the block buffer, unless
121 that is still in the old phase, in which case we flag the inode as
122 dirty so that after a phase change, it get's synced.
125 -----------------------------------------------
127 When an inode is deleted, we need to preserve the inode until
128 truncation is complete, then clean everything up.
129 This means that our destroy_inode should not actually destroy it,
130 but rather mark it so that when the InoIdx block goes away, the
132 But we need to be careful of inode number reuse...
134 So: when a file in finally deleted (lafs_delete_inode) we need to:
135 - release all the data pages - this is really just updating lots of
136 segment usage counts.
137 - create a hole in the inode file
138 - set the bit in the inode usage map
139 This should be done without waiting in lafs_delete_inode, which means
140 that lafs_destroy_inode should not free the inode if it is being deleted.
141 The truncation will happen from the 'orphan handler' thread.
142 We need an I_Flag for "Deleting". If set, destroy won't free it but
143 instead will set I_Destroyed.
144 When the hole is created and the datablock is freed, we clear
145 Deleting and then if Destroyed is set, we free.
148 Deleting an inode is committed by the unlink that removes the last link.
149 This makes it an orphan. It stays an orphan until truncate finishes with it.
150 So lafs_delete_inode just starts the truncate process.
151 It sets trunc_next to 0. Then trunc_next become MAX we can tear down
152 the index tree, set the inode_map bit, and set the data block to be a Hole.
154 EEKK delete needs lots of thought/work.
156 As the inode has just been changed, it might be dirty etc. We need to
157 wait for the index block to naturally get clean up before throwing it
158 away and creating the hole etc...
159 So: how do the data block, then index blocks, get dealt with.
160 Need to understand both nondelete and delete dropping of an inode.
162 inode holds a reference on the dblock. It needs to be able to drop
163 that when it is clean etc...
165 The key issue here is blocks disappearing from the index tree.
166 - Datablocks are removed when their refcount becomes 0, and they are clean.
167 truncate_pages should achieve this easiy.
168 - Index blocks are removed as they too become clean. This should also
169 happen promptly, though the truncate operation will need to find all
171 - The InoIdx block is referenced from the inode so won't go away in a hurry.
172 But really, the reference from the inode should not stop the parent link
173 from going away. Maybe that reference should be counted differently.
174 Maybe the iblock/dblock references shouldn't be counted at all.
175 When the inode is dropped, we drop the reference from the dblock
176 When the dblock is dropped, we clear the reference from the inode.
178 InoIdx: ->inode->iblock are loops.
180 So flushing or checkpointing after truncation should result in all index
181 blocks being allocated, found empty and a Hole Punched.
182 So for Index blocks, we run incorporation and if it is empty, we might punch
183 a Hole. For the InoIdx block we only punch the real hole if I_Deleting.
185 butbutbut can we drop ->dblock? A: only when the inode is clean and not pinned.
186 So: how does deletion work?
187 We mark the inode as an orphan (Should already have happened), set
188 the "trunc_next" pointer to the start of the file and schedule the
189 orphan handler. This ultimately allocates some 0s to the InoIdx block
190 in which incorporation makes it dirty.
191 This needs to be detected in incorporate and instead of the inode being
192 written out, we punch a hole in the inode file and set the inode-available bit.
194 We write out the inode map update in the same checkpoint that the
195 delete starts. Until the delete completes, the inode will have B_Claimed set
196 so that others won't try to use the same inode number.
199 --------------------------------------------------
201 The duplicity of the inode index block and data block is becoming a bore.
202 We need to often update them both which is awkward.
203 And getting from one to the other is awkward.
204 But we need them to be different because:
205 1/ they contain different sorts of information
206 2/ When an indexing tree grows, we need a new InoIdx block but not new data
208 Maybe we just need clearer rules on what gets updated when.
209 Currenty uglinesses include:
212 skip data block if InoIdx is pending
213 use datablock when trying to do indexblock
216 clear B_Alloc on InoIdx when clearing on data
219 clear Dirty on InoIdx when clear on data
222 don't clear Pinned on data when InoIdx is pinned
223 only put date on phase_leafs when index not pinned
224 ??Destroy inode when data block gets freed (need better test)
227 need to dirty both blocks
230 also pins dblock for an inode
233 also flips dblock phase
235 copy Dirty flag across
238 An issue is that we need to keep the flags correct so that 'Pinned' and 'parent'
239 etc hang around as required.
241 An ugliness is that normally data blocks don't flip-phase. The just drop out and
242 maybe get pinned again in the next phase.
243 But inode datablocks seem to need to. Maybe they shouldn't.
244 Maybe the InoIdx being pinned should be enough. While it is pinned it holds
245 a reference on the data block(??) and when we want to force a write, we pass
246 credits over to data block.. We could set Alloc and clear Dirty on InoIdx then
247 start processing the data block.
249 HMmm... maybe just tidyup the code, and leave the functionality as it is.
252 When a file is small, the data is in the inode.
253 The InoIdx block is still an index block though, and there is a separate
254 data block which the data gets copied into.
255 So the one data block is a child of the InoIdx block which is a twin of the
257 When the file grows and the inode must contain real information, we simply update
258 the content of the inode to be index info and make sure to write out the
260 When the file grows again, the InoIdx block must be demoted to a regular
261 Index block, and a new InoIdx block created to parent that Index block.
262 As the InoIdx block doesn't have it's own buffer, it must steal the buffer from
263 the new block that will become the new InoIdx block.
265 -------------------------------
268 The Inode on disk records 'data_blocks' and 'index_blocks' which record
269 how many blocks are in the index tree below the inode.
271 The in-memory inode records:
272 cblocks. This counts the number of blocks of the inode we flushed out
273 now. It includes all blocks stored plus any that have been allocated
274 to disk as new blocks in the current phase.
275 It is also reduced when "holes" are registered for incorporation.
276 They are "committed" blocks.
277 pblocks. This counts changes that have been registered for incorporation in
278 the *next* phase. On a phase change, pblocks is added to
280 They are "phased" blocks.
281 ablocks. This counts new blocks that have been dirtied but have not yet
282 been registered for incorporation - i.e. they haven't been written
284 They are "allocated" blocks.
285 We add to ablocks when we set B_Prealloc on a data block with physaddr==0
286 we remove from ablock when we clear B_Prealloc on such a block
287 we also remove when we set physaddr - prealloc must be set
290 'cblocks' is loaded from the inode, and written out.
291 When a block which didn't have an address is "lafs_allocated_block"
292 with an address, then we increment 'cblocks' or 'pblocks' depending
293 on whether the index block is in-phase with the inode.
294 We also decrement 'ablocks'.
295 When we dirty a block with no physical address, we increment 'allocated'.
296 When we remove a block (lafs_allocated_block with phys==0) or otherwise
297 in truncate, we decrement 'cblocks' or 'pblocks', and also 'ablocks'
298 if the block was dirty but had phys=0 already.
301 For index blocks, we similarly have ciblocks and piblocks.
302 No aiblocks is needed as we don't pre-allocate index blocks.
304 The value returned for getattr is the sum of cblocks, pblocks, ablocks.
305 I wonder if ciblocks and piblocks should be added.
306 The counters are protected by .... i_lock
308 We need to implement this in:
314 ---------------------------------------
315 InoIdx and Data - when is which dirty?
317 Currently when we try to dirty an inode data block, we actually
318 dirty the InoIdx block instead if it exists. And is pinned.
319 But I'm not sure that is correct.
320 In cluster_allocated we don't handle a non-pinned inode data block
321 if the InoIdx block is Pinned to the same phase... which doesn't
322 make sense because the data block isn't pinned.
323 but when the InoIdx block is ready, we pin the data block which allows
326 There are two times when we want to write an inode block.
327 1/ when writepage writes an embedded page 0. In this case there are
328 no other index blocks to be pinning the InoIdx. So when the data
329 block is successfully allocated we should immediately allocate
330 the inoidx and thus the data, clearing all the 'Dirty' bits.
331 2/ when a checkpoint gets to it. The InoIdx will be pinned but the
332 data block not. Once we allocate the InoIdx, that will 'pin' the
333 data block so it will then be processed.
335 If we perform metadata operations on an inode while no inoidx is present
336 (as no data is being accessed) we want to Pin the data block to ensure it
337 gets written... Or we could just write it and ensure roll-forward picks up
339 In that case an inode block is just like any data block. We make changes
340 and dirty it. Eventually it gets written. It can be pinned if we want,
341 but there is no point pinning it if the InoIdx is pinned to the same phase.
342 Roll-forward must pick up all content except index information.
344 Mark_inode_dirty is called for atime, ctime, size...
345 Most of those we can control and see, just not atime.
346 So only really write the inode if we know of an important change...
349 An inode-data-block can be dirtied:
350 - in flush_data_to_inode when we copy from block 0
351 - in lafs_cluster_allocate when the index block is being allocated
353 - in lafs_inode_fillblock
354 - in inode_map_new_commit ??? No needed, init does it already.
355 An InoIdx block can be dirtied:
356 - in lafs_dirty_dblock when dirtying the data block - don't do this!
357 - prior to incorporating changes or adding addresses
358 i.e. before lafs_add_block_address
360 If a db is pinned while the ib is pinned to the same phase, drop
361 that pinning, it isn't needed.
362 When an ib is allocated, pin the db and make sure it is dirty.
364 Writepage on an inodefile page should fail if any inodes have pinned
365 InoIdx. Otherwise it can succeed.