08 March 2011, 07:03 UTC
"LaFS" is a filesystem that I have been working on since 1999. Yes, that is a long time. It has never been my first priority so I have had to fit it around my work on NFSd and md/RAID and regular bugfixing and such. I have maybe used about 10%-15% of my time on it, so I guess that it getting up to 12-18 months "fill time".
LaFS is a log-structured filesystem. The original goal was to work well with RAID and NFS. That is still much of the goal but I see other value in it as well -- such as working well with Flash storage.
How it compares with NILFS or BTRFS I cannot really say, I haven't looked very closely at either. Maybe I will one day and report here what I find.
Current source code can be found at git://neil.brown.name/LaFS and git://neil.brown.name/lafs-utils, or browse on the web at http://neil.brown.name/git/
08 March 2011, 07:47 UTClog segments and RAID6 reshaping
Part of the design approach of LaFS - and any other log structured filesystem - is to divide the device space into relatively large segments. Each segment is many megabytes in size so the time to write a whole segment is much more than the time to seek to a new segment. Writes happen sequentially through a segment, so write throughput should be as high as the device can manage.
(obviously there needs to be a way to find or create segments with no live data so they can be written to. This is called cleaning and will not be discussed further here).
One of the innovations of LaFS is to allow segments to be aligned with the stripes in a RAID5 or RAID6 array so that each segment is a whole number of stripes and so that LaFS knows the details of the layout including chunk size and width (number of data devices).
This allows LaFS to always write in whole 'strips' - where a 'strip' is one block from each device chosen such that they all contribute to the one parity block. Blocks in a strip may not be contiguous (they only are if the chunksize matches the block size), so one would not normally write a single strip. However doing so is the most efficient way to write to RAID6 as no pre-reading is needed. So as LaFS knows the precise geometry and is free with how it chooses where to write, it can easily write just a strip if needed. It can also pad out the write with blocks of NULs to make sure a whole strip is written each time.
Normally one would hope that several strip would be written at once, hopefully a whole stripe or more, but it is very valuable to be able to write whole strips at a time.
This is lovely in theory but in practice there is a problem. People like to make their RAID6 arrays bigger, often by adding one or two devices to the array and "restriping" or "reshaping" the array. When you do this the geometry changes significantly and the alignment of strips and stripes and segments will be quite different. Suddenly the efficient IO practice of LaFS becomes very inefficient.
There are two ways to address this, one which I have had in mind since the beginning, one which only occurred to me recently.
28 February 2009, 12:37 UTCThe LaFS directory structure
I've spent most of this week working on my filesystem - LaFS, having not touched it since December. As usual, working on it re-awakens my enthusiasm for it and as I seem to be in a blogging mood, I'm going to write about it. Currently. my favourite part of my filesystem is the directory structure. So I'm going to write about that.
The difficulty with designing directories for a Posix filesystem is that Posix provides two ways to find entries in a directory.
The most obvious way to find an entry is to look it up by name. So to be able to implement large directories at all efficiently, you need an index that can lead you quite quickly from a name to an entry.
However Posix also requires you to be able to look up an entry given a small fixed-length identified. This is needed to implement seekdir. The filesystem gets to choose the identifier and it returns it via the readdir or telldir function. However whatever identifier is returned must continue to work for that entry indefinitely until that entry is removed. There is no mechanism for the identifer to time out or be refreshed. It must be really stable. Even if Posix didn't require this, NFS does. To be able to export a filesystem via NFS, you really need to be able to find entries given small fixed-length keys.
As names in directories are not fixed length, and the maximum length is not small, this seems to imply that you need two separate indexes for a directory, and then need to keep them in-sync with each other, and a number of filesystems do just this.
My clever idea, which I only realised after failing to make a couple of other approaches work, and after arguing with Ted T'so about the ext3 directory structure, is that you can get by with only one index. Here is how.