The next question to ask and answer is: how do we internal store a document.
The key issues are:
- space-efficient storage of large texts
- time-efficient small manipulation of text (add/delete in middle)
- easy storage of stable pointers into the text
- easy storage of assorted attributes with the text
- easy undo-list maintenance
Layout
Possibly the simplest storage for format for editting text is to have a largish buffer, with the text before the cursor at the beginning and the text after the cursor at the end. With this layout, additions and deletions near the cursor are trivial, and local cursor movement is quite efficient.Two significant problems with this approach as it stands are:
- The document cannot grow larger than the buffer
- Large cursor movements can be quite expensive.
Both of these can be eased by having a list of moderately sized buffers each with a hole in the middle where edits can take place.
When a section of the document grows larger than it's buffer, the buffer is split in two. When a cursor movement takes the cursor out of one of these buffers to another, only the destination buffer needs contents to be shuffled around.
Space efficiency
Providing some effort is made to coalesce adjacent buffers that together will fit into one buffer, space usage should never be more than twice the size of the document, normally less. If we also keep track of space utilisation and make an effort to move large holes closer together, then we should be able to keep utilisation substantially higher than that.This assumes that there is not a significant amount of extra meta-data per buffer. We can keep the meta-data to data ratio low simply by making the main buffers suitably large.
Stable Pointers
Stable pointers are fairly easy to implement. Each buffer will have a doubly linked list of pointers that point to content in that buffer. This list will be sorted and will probably have a handle at the cursor and a handle at the start/end. Each pointer will store a reference to the buffer and an offset, positive from the beginning or negative from the end. Cursor movements that cross a pointer will simply update the offset of each pointer they cross. When a buffer is split, or when two buffers are joined, more work is required, but it is in proportion to the other work of splitting or joining, and in general, splitting and joining will not be frequent operations.Attributes
To discuss attribute storage, we need a clear understanding on the purpose and use of attributes.Attribute are summary information produced by parsing the text of a document with respect to some grammar. The attributes are not part of the document. They are not saved with the document or copied when part of the document is copied to a cut-buffer. They are simply summary information and can always be regenerated if there is a need, though doing so too often might be expensive.
Attributes are used to help the display mechanism and the editting mechanism understand the context of a particular point in the document.
An attribute associates a range of characters with a node in the abstract syntax of the document and a number of name=value pairs, such as "family=sansserif" or "language=french" or "spelling=doubtful".
It would seem that the "obvious" implementation for attributes is as separate objects each with a start and end pointer into the document.
The operation on attributes that needs to be particularly efficient is to find all the attributes for a particular point and to maintain that list while stepping a point through the document.
This would require the start/end pointers of an attribute should contain links back to the attribute, and the from each buffer, a list of all attributes that span the buffer can be easily found.
One approach would be to store all attributes that span one or more buffers in a list that must be searched. Another approach would be to have a separate list of spanning attributes for each buffer.
Possibly it would be best to find a balance between these. i.e. to have a number of lists of spanning attributes, each listing all attributes that span any buffer in some set. That way the lists needn't get too long, but there needn't be too many.
Either approach would require it to be possible to compare two buffers for their order within the document. This would best be done using sparse numbering of the buffers, with a renumbering process that happened whenever the numbers become too dense.
An interesting question is whether attributes should be mutable. i.e., should it be possible to change the name=value pairs associated with an attributes, and should it be possible to move the endpoints. The alternative is to require that new attributes are created as needed and old ones destroyed. As cloning an attribute but changing bits should be fairly cheap, immutable should be OK. The only show-stopper would be if an attribute we referenced, but a program variable for instance, and we wanted to make a change and leave the reference intact. I think long-term storage of attributes in variables is probably a bad idea, so lets just make them immutable.
Undo
Finally we need to consider how undo information is stored.As attributes and pointers do not need to saved with undo information, we only need to store the text. The document will be re-parsed after an undo operation to make sure the attributes are correct.
The operations that need to be reversable are simply inserts and deletes, and possibly combinations of these. Saving the text of a deletion is relatively straight forward - it can go into one or more buffers.
Saving the location of a deletion, or the endpoints of an insertion is not so trivial. These could be stored as normal pointers, however this would cause a quick proliferation of pointers that could impact performance.
However fully stable pointers are not needed. The pointer only needs to be accurate at the time of the modification. For this reason we can simply use an index from the start of the document.
For this to be useful, we need to maintain in each buffer some idea of where in the document it is (in terms of character count). We cannot keep this always correct efficiently, but we can keep track of the last buffer in which it is known to be correct, and can then regenerate it at need.
So each undo record is simply an index into the document, a length that is deleted, a text that is inserted, and a flag to say whether this is the last of a group or not. These can simply be stored sequentially in a buffer.
How to manage redo and undo of redo etc in a way that is really useful is a topic for another day, but it shouldn't require changes to the data structure.
Summary
There are a number of buffers containing parts of the text. Each buffer contains:
- an array of characters containg text before and after the buffer cursor (say, 8 kilobytes)
- a sparse sequence number - 32 bits
- a index of the first char of this buffer in the file. This is only known to be accurate if the spare sequence number is less than a global sparse sequence number of the last known accurate index
- a doubly linked list of pointers
- a pointer which is the cursor and is in the above list. This pointer stores the size of the hole in the buffer in it's private field
- a pointer to a list of attributes which might span this buffer
- forward and backward pointers to surrounding buffers in the buffer list
Each pointer is a structure containing:
- A pointer to the buffer holding the point
- An index into that buffer
- A private field which can hold either a pointer or an integer
- Forward and backward pointers to surrounding points in the same buffer
- A flag to indicate if the point should be placed before or after a character inserted at the point. This may well be the least significant bit of the index field, which itself has been shifted one place
An attribute contains:
- A pointer to the start and end of the range
- A packed list of words which identify properties. The first is the type of node in the abstract syntax tree. Subsequent words might have an embeded equals sign to show name=value. Each word is preceeded by its length and is not nul terminated.
