Parsing Procedure

03 May 2004, 20:24 UTC

Parsing of documents needs to happen incrementally, both a high-level parse that just finds major sections, and low-level parses that only do a small local part of the document.

UNFINISHED

For all but the most trivial documents, the text of the document needs to be parsed before it can usefully be displayed. However parsing the whole document before displaying the first screenful may take longer than is acceptable.

Also, it is probably asking too much of the editing procedures to require that they always keep the parse completely up-to-date. Therefore, the document will (potentially) need to be re-parse whenever a change is made. Reparse the whole document would again be unacceptable.

To allow for these potential problems we need an infrastructure that allows and encourages partial parsing.

top-level, low granularity, parsing

To address the first issue, it would be good to support a very course level of parsing that just found the high level structures and didn't bother too much about the fine detail.

Depending on the actual language (grammar) that was being parsed, this may or may not be able to skim lightly over much of the text just looking for particular high-level markers. For other languages, a closer inspection of all the text will be needed to, for example, count matching brackets to find the outer-most sets. Even this is case the parsing can be fairly efficient, partly because a lot of detail can still be ignored, and partly because the result of the parse (attributes) do not need to be generated and stored.

It should be sufficient for the top-level parse to just record the top one or two levels of the syntax tree and to leave the remaining detail to on-demand parsing.

full detail, on demand parsing

To address the second issue we need to support the parsing of small sections of text - arbitrarily small. For this reason the stored attributes always return the syntax node the corresponds to each range of text. Parsing a section of text should just entail finding a suitable larger section of text that encloses the given range, and parsing that starting from the node mentioned in the attribute.

This approach will fall apart if there are any syntax errors in the parsed text, or if the section of text doesn't match the required syntax node. To cope with this, we need to have an effective mechanism for dealing with apparent syntax errors.

This approach also depends on keeping track of which parts of the text have been changed (or have not yet been parse) and so need to be re-parsed. This could be achieved by setting an "unparsed" attribute on ranges of text. Then if a parse of a piece of text is needed (for display or editing) and any of the text is "unparsed", then a suitable syntax attribute that includes all the unparsed text is found and the text spanned by that is parsed.


Parsing in the face of "errors"

It is (hopefully) reasonable to assume that most of the time, most of a document is syntactically correct. MoreHere




[æ]