On Thu, Jun 30, 2011 at 2:45 AM, Jan Paul Posma <jp.posma@gmail.com> wrote:
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote:
> 1. How should the treatment of table garbage be specified?  My
>   recommendation is to change the semantics compared to the original
>   and just specify that table garbage should be ignored.

This kind of garbage handling would be acceptable, right? This probably falls in the category "shouldn't do that anyway", IMO.

I'm actually not certain how much of that is done by MediaWiki's parser/sanitizer, how much by Tidy postprocessing, and how much by the browser...  Probably should define rules for how to handle it consistently since some is gonna end up in there anyway... same reason that HTML 5 is specifying parse error recovery behavior.

A few things are legit inside a table outside rows, such as a <caption> or an explicit <thead>, <tbody> or <tfoot> (though with limitations), so it's not all potential garbage. :)


> 1. Once both types of tables have been opened, use internal tokens
> interchangeably.
>
> 2. Let inner tables take precedence and disable tokens of outer table type.
>
> 3. Let outer tables take precedence and implicitly terminate inner table
> if table tokens of outer table type is encountered.

I wrote a whole piece down about how enabling and disabling tokens shouldn't be needed here, but I see the problem now. I think option 2 is the most intuitive. But can't we include TD and TR tokens in the wikitext table specification? That would extend the grammar a bit, but that is no problem.

1) is how things work in MediaWiki's existing parser, so that's what we'll be doing for the new internal parser at least. It might not feel very pure, but in principle it's kind of like having a "<SPAN>...</span>" in case-insensitive HTML -- they're basically different encodings of the same functional symbol.

Since we also have to handle tables opened and closed at different hierarchical levels (eg between different templates), we currently expect to describe them in the parse tree in a way that the open & close tags get matched up later on during processing of the parse tree rather than during the original raw-source parse. This should also fit nicely with the notion that the open & close tokens can be slightly different variants (versus recording them as a single node in a hierarchical tree with children for contents) -- a specific table's contents can be snurched out with an iterator going from the start to the end tokens over the tree, much like you'd do for extracting an arbitrary text selection.

-- brion