Formatting language for document editor

11 April 2004, 22:19 UTC

If we want to be able to behave like a formatted document editor, we need a language for defining formating. We could use something like HTML or TeX, but the would be fairly boring and limiting. It is much more fun and educational to define a new one.

Ofcourse, support HTML and TeX as well would be useful but if I wanted to start from scratch, what would it look like...

The key syntax aspects of a formatting language are representing structure and attributes. Everything else is "just text" which may have some complications for non-native character sets but is fairly simple.

Structure can be represented in two ways: bracketing and operators.

Braketing involves having a distinguished character or string mark the start of a structural subelement, and a matching character or string mark the end. HTML uses <tag> and </tag>. TeX uses { and }. LaTeX also uses \begin{tag} and \end{tag}. Python uses matched intenting which which doesn't quite fit the description, but works the same way.

operators involve having distinguished symbols or words that word as prefix, infix, or postfix operators have have precendence.

A good example is a space which is an infix operator between a word and a list of words. Also blank-line is an infix operator between a paragraph and a list of paragraphs. When describing mathematical text, more operators become useful.

Attributes give detail to a structural unit. These might include labels, font/face changes, spacing information etc.

HTML uses "name=value" inside <tag ...> units. TeX uses various things, mostly \name related.

Attributes are usefully seen as just another structural element so the same syntax can be used to introduce them.

While it is generally useful to use both operators and bracketing, it is good if bracketing can be used for all structuring, with operators as a useful and common alternate.

I like [ ] for bracketing. They are less common than ( ) and similar to { }, but as the usage will be quite different from TeX, a different symbol is probably a good thing.

Immediately after the openning bracket is a tag followed by either white-space or a closing bracket. This tag defined the structural element that is being created. Thus [word someletters] makes a word.

If there is white space, everything after it is the content which is interpreted differently depending on the tag and the context. In all cases [ will introduce a new structural element, and ] will close the current element. Also, in all cases \[ will be a literal '[' and \] will be a literal ']'.

Each tag can introduce some operators which will be sequences of characters, or of some higher level token. So <space> is an infix operator which causes the preceeding to be a word and the following to be a wordlist. <space> is also a prefix operator which is an identity operator (thus multiple spaces mean the same as one). So "hello there" is the same as "[word hello][word there]".

<newline><newline> is an infix operator that encloses the preceeding wordlist as a paragraph and starts a new wordlist.

Under a different tag, spaces and newlines might be treated quite differently. A "preformatted" tag might leave them all as literal characters.

There should be a level of indirection maintained between tags used in the document, and tags used to directly affect formatting. So "quote" might be used in the document, and it sets formatting to temporarily reduce left and right margins, and might allow an extra tag "ref" which contains a reference this is right justified at the end, in a different font.

Aside from tables, the most interesting part of describing the layout of text is the measurement of paragraphs. This includes positioning, indents on both sides which may vary from line to line, interline space, interword space (whether it should expland to justify the line). The formatting should be able to do at-least simple maths on the sizes of componenets when giving measurements.

The indents/width of lines in the paragraph should be able to be specified per-line or per-measure. e.g. the first 2 lines are indented just so, or the first 1.5 centimeters are indented thusly.

Tables are, I think, generally poorly done. I know that I often fight with tables in TeX trying to get them to do what I want. HTML is better in some ways in the the defaults are better. But it does not seem possible to specify a style to apply to all rows of a table. Maybe style sheets make this easier.

A table-within-a-table is a conceptually simple idea that does not seem to be handled well. A row in a table can often be thought of as a label in the first column, and some data in subsequent columns. If a particular label really wants multiple rows of data, then you have a table in a table. A trivial example might have days in the first column, hours in the next, and events in the third.

Attempting to deal with this tends to take a symptomatic approach of asking to combine two cells, one above the other, into one large cell. Hence the "rowspan" attributer to <td> in HTML.

Doing this hides a significant facet of the table - the table-in-a-table aspect.

It is tempting also to allow tables to be specified in column-order instead of the normal row-order (did I get those names the right way around?). It is not immediately clear how this would be an advantage, but a desire for generality advocates it.

Describing the table contents would be simple enough:

[table[row[cell xx][cell yy][cell [table ...]]]]

where "row" here means a linear group of cells that could be horizontal or vertical. However describing the required structure and intended formatting is maybe not so trivial. It is best to keep this quite separate from the data (unlike how TeX does heading for tables - more like tbl is perferred) but has to be part of the table, not part of the document style.

Maybe the best is a "template" where one or more rows are given as just formatting commands with a place-holder for the actual content. The template would have to specify row-order or column-order.

Question: Suppose we wanted to present the data as a list of column headings, then a list of row labels, then the actual data in row-order. How would that be done?

Answer: The table would be two rows, the second of which is a table. The first is a cell followed by table of column headering. The second row is a table with a row of row labels followed by a table.... I think this can work.

Once the table has been arranged with suitable subtables, and attributes set for each cell, how will it be formatted? The important question is one of spacing.

If multiple cells allow the content to wrap, how is space distributed over them? One could imagine a column with one very long line and lots of short lines, and another column with uniformly moderate length lines.

We would want the column with moderate length lines to get most of the available width as it would benefit the most.

Some sort of "presure" semantic seems to be needed. A first-guess is made by distributing space evenly. Then we evaluate the presure in each column as the sum of the number of line breaks. We then redistribute the space based on the pressure. The redistribution would need to be dampped so that there is no risk of oscilation. The would probably need to be extra pressure from "bad" line breaks so that columns containing single words aren't made smaller than those words unless absolutely necessary.

The last interesting aspect of document formatting (yes, I'm sure there are many more, but it's getting late) is cross-references.

This included tables of contents and indexes. It inclued foot notes and end notes. And it includes running-heads that follow the current section name.

In each case, text (whether visible (contents) or invisible (index/footnote) in one part of the document causes text to appear in another part of the document.

In the cases of a running head, several texts may try to affect the running head text, but only one will over-rule on each page. In the other cases, all texts are collected. and appear somewhere.

It is simplest to decide that all cross-references add text to some other document, and that document might choose only to display one of them.

For running-head, there might be two virtual documents for each page, one with section heading from the pervious page, one with section headings from this page. The running-head display choice might be the first from the "this-page" list such exists, else the last from the "previous-page" list. This running head should be inserted on the list of the next page so that very long sections are dealt with properly.

Each page needs to have a layout which includes virtual documents and general formatting such as positioning of headers, foots, page numbers, footnote areas, highlights (little box in the middle containing important sentence in large type).

Rendering with footnotes can be a bit interesting as the foot note takes up space that we might want to use for the main document.

It would be good to do a trial layout, see what virtual documents have been created, do another trial with those documents in-place, which could reduce the size of those documents. Use that to finalise the sub-documents that use competing space, and then perform the final rendering.

This would be eased by describing the page in an order that the conditional virtual documents took the space they needed first. So it is "set the header, set the footer, set the footnotes, set the highlight, finally set the body".

Note that this means that any virtual document needs to be ready to not render all of its content on this page (just as the main document does). It may either discard the remainder, or deliver it into a virtual document on the next, or a later page.

It also means that if we enlarge the space of the main document, it can only result in enlarging the virtual documents.




[æ]