The LaFS directory structure

28 February 2009, 12:37 UTC

I've spent most of this week working on my filesystem - LaFS, having not touched it since December. As usual, working on it re-awakens my enthusiasm for it and as I seem to be in a blogging mood, I'm going to write about it. Currently. my favourite part of my filesystem is the directory structure. So I'm going to write about that.

The difficulty with designing directories for a Posix filesystem is that Posix provides two ways to find entries in a directory.

The most obvious way to find an entry is to look it up by name. So to be able to implement large directories at all efficiently, you need an index that can lead you quite quickly from a name to an entry.

However Posix also requires you to be able to look up an entry given a small fixed-length identified. This is needed to implement seekdir. The filesystem gets to choose the identifier and it returns it via the readdir or telldir function. However whatever identifier is returned must continue to work for that entry indefinitely until that entry is removed. There is no mechanism for the identifer to time out or be refreshed. It must be really stable. Even if Posix didn't require this, NFS does. To be able to export a filesystem via NFS, you really need to be able to find entries given small fixed-length keys.

As names in directories are not fixed length, and the maximum length is not small, this seems to imply that you need two separate indexes for a directory, and then need to keep them in-sync with each other, and a number of filesystems do just this.

My clever idea, which I only realised after failing to make a couple of other approaches work, and after arguing with Ted T'so about the ext3 directory structure, is that you can get by with only one index. Here is how.

Other Directory Layouts

It is not at all a new idea to use a hash of the filename as the primary key to find the directory entry for that name. ext3 has done this for some time. The problem with this approach is that collisions are quite possible, even if they are fairly unlikely. The approach used by ext3 (and others?) is to group all the colliding names together and try to return them as a group so that no code ever tried to seek into the middle of that group.

So ext3 uses the name hash as both the key for name lookup and the key for seekdir lookup. Though it is a fairly imperfect solution, it mostly works.

However it is worth noting that ext3 still has two indexes here. There is the index that maps from hash value to location in the directory-file. But there is also an index that maps location-in-file to location-on-storage. So there are still two indexes that need to be looked up.

It's been a while since I've looked at btrfs, but I think it also has two indexes. Both are B-Trees in storage device (not in files which themselves are indexed in the device). One B-Tree use simple lexical ordering of names for lookup. I'm not sure what the other uses for "small fixed length key". It might be inode number or something like that. Anyway, there are two indexes.

For LaFS, I only have one index. This is how it works.

The clever idea for LaFS

Like ext3 I create a 32bit hash of the filename. If there is already a file with that name, I use internal chaining and add 1 to the hash value and look for that slot. In most cases there is no collision. But in the rare case that there is, I have a well defined mechanism to handle it and have a unique short key for every entry.

This does mean that I need to store, in each directory entry, the amount be which the hash was increased to avoid a collision, and it also means that I cannot deleted entries that have collided until all collision that are later in the file have been removed. But this should be very rare unless the directory is truly enormous, or the hash function is very bad. In any case, it will continue to work correctly, though it could slow down in very bad circumstances.

The next clever idea is where to store directory entries in the file (I do use a well defined file which is like ext3 but different, I think, from btrfs).

The first approximation is that I store each directory entry in the block numbered by the (adjusted) hash value. As I support 32 bit block indexes, this is a perfect fit. I then only have one index to look up - the index from file address to device address. And this is an index that is used a lot for normal files, and so should be well tested.

Of course one block per entry would be a terrible waste. So we combine multiple entries into one block and use the following trick to be able to find them.

In Posix, files can have 'holes' in them. i.e. section of the file which are not actually stored anywhere, have never been written to, and read as being full of zeros. Posix doesn't have any standard way of finding where the holes are in a file - you have to guess (though various OS implementations have various extensions to provide this information). But the filesystem itself has complete knowledge of which blocks are allocated and which aren't. So it can use this information to help find information in the file.

So with LaFS, we gather multiple directory entries into a single block, and store that in some block with an index larger than the hash of any entry in the block, but not larger than the hash of any entry that comes later in the directory. As a special case, the last block of directory entries is stored in block zero (zero is treated as being larger than any hash value).

This is probably a bit confusing, so I'll spell out how it gets used.

To find the entry for a given hash, find the first allocated block (not a hole) in the file with an index greater than that hash. If no such block exists, use block zero. Then look in that block. You will either find the entry, or be sure that the entry doesn't exist.

When the first entry is created in a directory, it is placed in block zero, no matter what its hash is. Subsequent entries are created in block zero until block zero becomes full. When we run out of room in block zero we need to split the block. We do this by creating two block, one with the entries with lower hash values, and one with the entries with higher hash values. The cut-off point is chosen to balance the usage across the two files.

The second of these blocks is stored in block zero of the file. The first is stored in block N+1 where N is the largest hash value in the that block. Now the above rule will continue to find the correct block for any entry.

As blocks get full, they split. Each time the entries with higher hash values stay where they are, and the entries with lower hash values go into a new block indexed at N+1 for a different N.

Clearly this process will not run out of block locations in the directory file before with hit 2^32 directory entries, which is the maximum allowed.

When directory blocks become under-full due to deletions they can be merged in a lazy fashion. There is no need to merge them for correctness, just to avoid space wastage.

This mechanism provides perfectly correct semantics with a directory size limit of 2^32, and good performance as long as we remain an order of magnitude away from that limit. And it only uses a single index, the same index that is used for all file block lookup.

Directories created using this scheme will be very sparse, very rarely having two block that are adjacent. This means that we need to handle sparse files quite efficiently, but this relatively straight forward.

Block Organisation

The layout of entries inside a single block is much less interesting. The are probably lots of good ways to do this and I'm not aware of any particularly clever or efficient approaches (else I would have copied them).

I pack entries at the front of the block, but starting on N-byte boundaries where 256N is the block size. This means I can use an 8bit value to index entries in the block. I then use an AVL tree to allow fast lookup in the block. Each entry contains 2 one-byte tree pointers. The entry also contains the file name, inode number, file type, AVL balance flag and the hash offset that was added by internal chaining. 2 bits are allocated for this offset. If the value is more than 1 but less than 256, an extra byte is needed. For larger values, 4 bytes are added to the entry.

When entries are deleted, we just set the inode number to be zero. When we run out of space at the end, we compact the block to see if that makes room before we consider splitting it. Deleting the last entry from a block could be optimised, and maybe it is - I'd have to check the code.