On Thu, Jun 30, 2011 at 2:45 AM, Jan Paul Posma <jp.posma(a)gmail.com> wrote:
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote:
1. How should the treatment of table garbage be
specified? My
recommendation is to change the semantics compared to the original
and just specify that table garbage should be ignored.
This kind of garbage handling would be acceptable, right? This probably
falls in the category "shouldn't do that anyway", IMO.
I'm actually not certain how much of that is done by MediaWiki's
parser/sanitizer, how much by Tidy postprocessing, and how much by the
browser... Probably should define rules for how to handle it consistently
since some is gonna end up in there anyway... same reason that HTML 5 is
specifying parse error recovery behavior.
A few things are legit inside a table outside rows, such as a <caption> or
an explicit <thead>, <tbody> or <tfoot> (though with limitations), so
it's
not all potential garbage. :)
1. Once both
types of tables have been opened, use internal tokens
interchangeably.
2. Let inner tables take precedence and disable tokens of outer table
type.
3. Let outer tables take precedence and implicitly terminate inner table
if table tokens of outer table type is encountered.
I wrote a whole piece down about how enabling and disabling tokens
shouldn't be needed here, but I see the problem now. I think option 2 is the
most intuitive. But can't we include TD and TR tokens in the wikitext table
specification? That would extend the grammar a bit, but that is no problem.
1) is how things work in MediaWiki's existing parser, so that's what we'll
be doing for the new internal parser at least. It might not feel very pure,
but in principle it's kind of like having a "<SPAN>...</span>"
in
case-insensitive HTML -- they're basically different encodings of the same
functional symbol.
Since we also have to handle tables opened and closed at different
hierarchical levels (eg between different templates), we currently expect to
describe them in the parse tree in a way that the open & close tags get
matched up later on during processing of the parse tree rather than during
the original raw-source parse. This should also fit nicely with the notion
that the open & close tokens can be slightly different variants (versus
recording them as a single node in a hierarchical tree with children for
contents) -- a specific table's contents can be snurched out with an
iterator going from the start to the end tokens over the tree, much like
you'd do for extracting an arbitrary text selection.
-- brion