Document Storage

11 July 2004, 08:29 UTC

The next question to ask and answer is: how do we internal store a document.

The key issues are:

Layout

Possibly the simplest storage for format for editting text is to have a largish buffer, with the text before the cursor at the beginning and the text after the cursor at the end. With this layout, additions and deletions near the cursor are trivial, and local cursor movement is quite efficient.

Two significant problems with this approach as it stands are:

Both of these can be eased by having a list of moderately sized buffers each with a hole in the middle where edits can take place.

When a section of the document grows larger than it's buffer, the buffer is split in two. When a cursor movement takes the cursor out of one of these buffers to another, only the destination buffer needs contents to be shuffled around.

Space efficiency

Providing some effort is made to coalesce adjacent buffers that together will fit into one buffer, space usage should never be more than twice the size of the document, normally less. If we also keep track of space utilisation and make an effort to move large holes closer together, then we should be able to keep utilisation substantially higher than that.

This assumes that there is not a significant amount of extra meta-data per buffer. We can keep the meta-data to data ratio low simply by making the main buffers suitably large.

Stable Pointers

Stable pointers are fairly easy to implement. Each buffer will have a doubly linked list of pointers that point to content in that buffer. This list will be sorted and will probably have a handle at the cursor and a handle at the start/end. Each pointer will store a reference to the buffer and an offset, positive from the beginning or negative from the end. Cursor movements that cross a pointer will simply update the offset of each pointer they cross. When a buffer is split, or when two buffers are joined, more work is required, but it is in proportion to the other work of splitting or joining, and in general, splitting and joining will not be frequent operations.

Attributes

To discuss attribute storage, we need a clear understanding on the purpose and use of attributes.

Attribute are summary information produced by parsing the text of a document with respect to some grammar. The attributes are not part of the document. They are not saved with the document or copied when part of the document is copied to a cut-buffer. They are simply summary information and can always be regenerated if there is a need, though doing so too often might be expensive.

Attributes are used to help the display mechanism and the editting mechanism understand the context of a particular point in the document.

An attribute associates a range of characters with a node in the abstract syntax of the document and a number of name=value pairs, such as "family=sansserif" or "language=french" or "spelling=doubtful".

It would seem that the "obvious" implementation for attributes is as separate objects each with a start and end pointer into the document.

The operation on attributes that needs to be particularly efficient is to find all the attributes for a particular point and to maintain that list while stepping a point through the document.

This would require the start/end pointers of an attribute should contain links back to the attribute, and the from each buffer, a list of all attributes that span the buffer can be easily found.

One approach would be to store all attributes that span one or more buffers in a list that must be searched. Another approach would be to have a separate list of spanning attributes for each buffer.

Possibly it would be best to find a balance between these. i.e. to have a number of lists of spanning attributes, each listing all attributes that span any buffer in some set. That way the lists needn't get too long, but there needn't be too many.

Either approach would require it to be possible to compare two buffers for their order within the document. This would best be done using sparse numbering of the buffers, with a renumbering process that happened whenever the numbers become too dense.

An interesting question is whether attributes should be mutable. i.e., should it be possible to change the name=value pairs associated with an attributes, and should it be possible to move the endpoints. The alternative is to require that new attributes are created as needed and old ones destroyed. As cloning an attribute but changing bits should be fairly cheap, immutable should be OK. The only show-stopper would be if an attribute we referenced, but a program variable for instance, and we wanted to make a change and leave the reference intact. I think long-term storage of attributes in variables is probably a bad idea, so lets just make them immutable.

Undo

Finally we need to consider how undo information is stored.

As attributes and pointers do not need to saved with undo information, we only need to store the text. The document will be re-parsed after an undo operation to make sure the attributes are correct.

The operations that need to be reversable are simply inserts and deletes, and possibly combinations of these. Saving the text of a deletion is relatively straight forward - it can go into one or more buffers.

Saving the location of a deletion, or the endpoints of an insertion is not so trivial. These could be stored as normal pointers, however this would cause a quick proliferation of pointers that could impact performance.

However fully stable pointers are not needed. The pointer only needs to be accurate at the time of the modification. For this reason we can simply use an index from the start of the document.

For this to be useful, we need to maintain in each buffer some idea of where in the document it is (in terms of character count). We cannot keep this always correct efficiently, but we can keep track of the last buffer in which it is known to be correct, and can then regenerate it at need.

So each undo record is simply an index into the document, a length that is deleted, a text that is inserted, and a flag to say whether this is the last of a group or not. These can simply be stored sequentially in a buffer.

How to manage redo and undo of redo etc in a way that is really useful is a topic for another day, but it shouldn't require changes to the data structure.

Summary

There are a number of buffers containing parts of the text. Each buffer contains:

Each pointer is a structure containing:

An attribute contains:




[æ]