2011-07-01 01:55, Brion Vibber skrev:
On Thu, Jun 30, 2011 at 2:45 AM, Jan Paul Posma
<jp.posma(a)gmail.com
<mailto:jp.posma@gmail.com>> wrote:
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote:
1. How should the treatment of table garbage be
specified? My
recommendation is to change the semantics compared to the original
and just specify that table garbage should be ignored.
This kind of garbage handling would be acceptable, right? This
probably falls in the category "shouldn't do that anyway", IMO.
I'm actually not certain how much of that is done by MediaWiki's
parser/sanitizer, how much by Tidy postprocessing, and how much by the
browser... Probably should define rules for how to handle it
consistently since some is gonna end up in there anyway... same reason
that HTML 5 is specifying parse error recovery behavior.
As I wrote, the garbage text will appear _before_ the rendered table, so
the output from the parser will be (almost) valid html.
A few things are legit inside a table outside rows, such as a <caption>
or an explicit <thead>, <tbody> or <tfoot> (though with limitations),
so
it's not all potential garbage. :)
1. Once both types of tables have been opened,
use internal tokens
interchangeably.
2. Let inner tables take precedence and disable tokens of outer
table type.
3. Let outer tables take precedence and implicitly terminate inner
table
if table tokens of outer table type is
encountered.
I wrote a whole piece down about how enabling and disabling tokens
shouldn't be needed here, but I see the problem now. I think option
2 is the most intuitive. But can't we include TD and TR tokens in
the wikitext table specification? That would extend the grammar a
bit, but that is no problem.
1) is how things work in MediaWiki's existing parser, so that's what
we'll be doing for the new internal parser at least. It might not feel
very pure, but in principle it's kind of like having a
"<SPAN>...</span>" in case-insensitive HTML -- they're
basically
different encodings of the same functional symbol.
Since we also have to handle tables opened and closed at different
hierarchical levels (eg between different templates), we currently
expect to describe them in the parse tree in a way that the open & close
tags get matched up later on during processing of the parse tree rather
than during the original raw-source parse. This should also fit nicely
with the notion that the open & close tokens can be slightly different
variants (versus recording them as a single node in a hierarchical tree
with children for contents) -- a specific table's contents can be
snurched out with an iterator going from the start to the end tokens
over the tree, much like you'd do for extracting an arbitrary text
selection.
But what you are describing is not the same behavior as 1). MediaWiki
processes html and wikitext tables separately, and to mee it seems
highly unintentional that the inner tokens can be used interchangeably
and this only happens if both types of tables are nested with each
other. But you are here proposing yet another option, which I believe
is a lot simpler to implement:
4. There is only one type of table, but the tokens have aliases.