<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#">
<link rel="alternate" type="text/html" href="http://neil.brown.name/blog/LaFS"/>

<title>The LaFS filesystem</title>
<modified>2011-03-08T07:47:33Z</modified>
<author></author>
<entry>
<title>log segments and RAID6 reshaping</title>
<issued>2011-03-08T07:47:33Z</issued>
<modified>2011-03-08T07:47:33Z</modified>
<id>http://neil.brown.name/blog/20110308074733</id>
<link rel="alternate" type="text/html" href="http://neil.brown.name/blog/20110308074733"/>
<content type="text/html" mode="escaped">

&lt;p&gt;Part of the design approach of LaFS - and any other log structured filesystem - is to divide the device space into relatively large segments.  Each segment is many megabytes in size so the time to write a whole segment is much more than the time to seek to a new segment.   Writes happen sequentially through a segment, so write throughput should be as high as the device can manage.

&lt;p&gt;(obviously there needs to be a way to find or create segments with no live data so they can be written to.  This is called cleaning and will not be discussed further here).

&lt;p&gt;One of the innovations of LaFS is to allow segments to be aligned with the stripes in a RAID5 or RAID6 array so that each segment is a whole number of stripes and so that LaFS knows the details of the layout including chunk size and width (number of data devices).

&lt;p&gt;This allows LaFS to always write in whole 'strips' - where a 'strip' is one block from each device chosen such that they all contribute to the one parity block.  Blocks in a strip may not be contiguous (they only are if the chunksize matches the block size), so one would not normally write a single strip.  However doing so is the most efficient way to write to RAID6 as no pre-reading is needed.  So as LaFS knows the precise geometry and is free with how it chooses where to write, it can easily write just a strip if needed.  It can also pad out the write with blocks of NULs to make sure a whole strip is written each time.

&lt;p&gt;Normally one would hope that several strip would be written at once, hopefully a whole stripe or more, but it is very valuable to be able to write whole strips at a time.

&lt;p&gt;This is lovely in theory but in practice there is a problem.  People like to make their RAID6 arrays bigger, often by adding one or two devices to the array and &amp;quot;restriping&amp;quot; or &amp;quot;reshaping&amp;quot; the array.
When you do this the geometry changes significantly and the alignment of strips and stripes and segments will be quite different.  Suddenly the efficient IO practice of LaFS becomes very inefficient.

&lt;p&gt;There are two ways to address this, one which I have had in mind since the beginning, one which only occurred to me recently.
&lt;p&gt;&lt;a href=http://neil.brown.name/blog/20110308074733&gt;read more...(No comments)&lt;/a&gt;</content>
</entry>
<entry>
<title>The LaFS directory structure</title>
<issued>2009-02-28T12:37:09Z</issued>
<modified>2009-02-28T12:37:09Z</modified>
<id>http://neil.brown.name/blog/20090228123709</id>
<link rel="alternate" type="text/html" href="http://neil.brown.name/blog/20090228123709"/>
<content type="text/html" mode="escaped">

&lt;p&gt;I've spent most of this week working on my filesystem - LaFS, having not touched it since December.
As usual, working on it re-awakens my enthusiasm for it and as I seem to be in a blogging mood, I'm going to write
about it.  Currently. my favourite part of my filesystem is the directory structure.  So I'm going to write about that.

&lt;p&gt;The difficulty with designing directories for a Posix filesystem is that Posix provides two ways to find entries
in a directory.

&lt;p&gt;The most obvious way to find an entry is to look it up by name.  So to be able to implement large directories at all efficiently, you need an index that can lead you quite quickly from a name to an entry.

&lt;p&gt;However Posix also requires you to be able to look up an entry given a small fixed-length identified.  This is needed
to implement seekdir.  The filesystem gets to choose the identifier and it returns it via the readdir or telldir 
function.  However whatever identifier is returned must continue to work for that entry indefinitely until
that entry is removed.  There is no mechanism for the identifer to time out or be refreshed.  It must be really
stable.  Even if Posix didn't require this, NFS does.  To be able to export a filesystem via NFS, you 
really need to be able to find entries given small fixed-length keys.

&lt;p&gt;As names in directories are not fixed length, and the maximum length is not small, this seems to imply that
you need two separate indexes for a directory, and then need to keep them in-sync with each other, and
a number of filesystems do just this.

&lt;p&gt;My clever idea, which I only realised after failing to make a couple of other approaches work, and after 
arguing with Ted T'so about the ext3 directory structure, is that you can get by with only one index.
Here is how.
&lt;p&gt;&lt;a href=http://neil.brown.name/blog/20090228123709&gt;read more...(No comments)&lt;/a&gt;</content>
</entry>

</feed>

