Now, when there are more people on this list I thought I might bring up tables for discussion again. There are two things that I would like to have specifyed: treatment of "table garbage", and mixing of table flavours.
There are two flavours of tables: html-tables and wikitext tables. A wikitext table has the structure:
^'{|' table garbage ^'|' block element contents ^'|-' table garbage ^'|}'
An html table has the structure:
'<table>' table garbage '<tr>' table garbage '<td>' block element contents '</td>' table garbage '</tr>' table garbage '</table>'
MediaWiki processes tables by extracting any recognizable part of the table from text, and writing out the rendered html at a position right _after_ the position where the table appears. The things that I call "table garbage" are left in place and will thus suprisingly appear before the table in the rendered output. (Table garbage is parsed the same way as block element contents.)
1. How should the treatment of table garbage be specified? My recommendation is to change the semantics compared to the original and just specify that table garbage should be ignored.
The behavior of mediawiki is that the internal table tokens ('<td>', '<tr>' etc for html tables and ^'|', ^'|-' etc for wikitext tables) are activated when opening up a table of the corresponding type. But when nesting tables of different types, the internal table tokens can be used more or less interchangeably.
<table> <td> {| | cell <td> cell <tr><td> cell |- | cell |} </table>
renders as this html:
<table> <td> <table> <tr> <td> cell </td><td> cell <tr></td><td> cell
</td></tr> <tr> <td> cell </td></tr></table> </table>
I have previously suggested that it should be specifyed that only the internal table tokens of the right type can used. Thus, opening a wikitext table inside an html table would activate parsing of the wikitext table tokens and deactivate parsing of html table tokens. This is a behavior that I find appealing. But since PEGs are currently in fashion, this is a behavior that might be problematic to implement. So there is also a third alternative: implicitly terminate the inner table when encountering table tokens from the outer table, which should be straightforward to implement with a PEG grammar.
So to summarize the alternatives:
1. Once both types of tables have been opened, use internal tokens interchangeably.
2. Let inner tables take precedence and disable tokens of outer table type.
3. Let outer tables take precedence and implicitly terminate inner table if table tokens of outer table type is encountered.
Which should be specified? I recommend 2 or 3.
Best regards,
Andreas Jonsson
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote:
(..) ^'{|' table garbage ^'|' block element contents ^'|-' table garbage ^'|}' (..) '<table>' table garbage '<tr>' table garbage '<td>' block element contents '</td>' table garbage '</tr>' table garbage '</table>'
- How should the treatment of table garbage be specified? My
recommendation is to change the semantics compared to the original and just specify that table garbage should be ignored.
This kind of garbage handling would be acceptable, right? This probably falls in the category "shouldn't do that anyway", IMO.
- Once both types of tables have been opened, use internal tokens
interchangeably.
Let inner tables take precedence and disable tokens of outer table type.
Let outer tables take precedence and implicitly terminate inner table
if table tokens of outer table type is encountered.
I wrote a whole piece down about how enabling and disabling tokens shouldn't be needed here, but I see the problem now. I think option 2 is the most intuitive. But can't we include TD and TR tokens in the wikitext table specification? That would extend the grammar a bit, but that is no problem.
Cheers, Jan Paul
On Thu, Jun 30, 2011 at 2:45 AM, Jan Paul Posma jp.posma@gmail.com wrote:
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote:
- How should the treatment of table garbage be specified? My
recommendation is to change the semantics compared to the original and just specify that table garbage should be ignored.
This kind of garbage handling would be acceptable, right? This probably falls in the category "shouldn't do that anyway", IMO.
I'm actually not certain how much of that is done by MediaWiki's parser/sanitizer, how much by Tidy postprocessing, and how much by the browser... Probably should define rules for how to handle it consistently since some is gonna end up in there anyway... same reason that HTML 5 is specifying parse error recovery behavior.
A few things are legit inside a table outside rows, such as a <caption> or an explicit <thead>, <tbody> or <tfoot> (though with limitations), so it's not all potential garbage. :)
- Once both types of tables have been opened, use internal tokens
interchangeably.
- Let inner tables take precedence and disable tokens of outer table
type.
- Let outer tables take precedence and implicitly terminate inner table
if table tokens of outer table type is encountered.
I wrote a whole piece down about how enabling and disabling tokens shouldn't be needed here, but I see the problem now. I think option 2 is the most intuitive. But can't we include TD and TR tokens in the wikitext table specification? That would extend the grammar a bit, but that is no problem.
1) is how things work in MediaWiki's existing parser, so that's what we'll be doing for the new internal parser at least. It might not feel very pure, but in principle it's kind of like having a "<SPAN>...</span>" in case-insensitive HTML -- they're basically different encodings of the same functional symbol.
Since we also have to handle tables opened and closed at different hierarchical levels (eg between different templates), we currently expect to describe them in the parse tree in a way that the open & close tags get matched up later on during processing of the parse tree rather than during the original raw-source parse. This should also fit nicely with the notion that the open & close tokens can be slightly different variants (versus recording them as a single node in a hierarchical tree with children for contents) -- a specific table's contents can be snurched out with an iterator going from the start to the end tokens over the tree, much like you'd do for extracting an arbitrary text selection.
-- brion
2011-07-01 01:55, Brion Vibber skrev:
On Thu, Jun 30, 2011 at 2:45 AM, Jan Paul Posma <jp.posma@gmail.com mailto:jp.posma@gmail.com> wrote:
On 30-Jun-2011, at 9:09, Andreas Jonsson wrote: > 1. How should the treatment of table garbage be specified? My > recommendation is to change the semantics compared to the original > and just specify that table garbage should be ignored. This kind of garbage handling would be acceptable, right? This probably falls in the category "shouldn't do that anyway", IMO.
I'm actually not certain how much of that is done by MediaWiki's parser/sanitizer, how much by Tidy postprocessing, and how much by the browser... Probably should define rules for how to handle it consistently since some is gonna end up in there anyway... same reason that HTML 5 is specifying parse error recovery behavior.
As I wrote, the garbage text will appear _before_ the rendered table, so the output from the parser will be (almost) valid html.
A few things are legit inside a table outside rows, such as a <caption> or an explicit <thead>, <tbody> or <tfoot> (though with limitations), so it's not all potential garbage. :)
> 1. Once both types of tables have been opened, use internal tokens > interchangeably. > > 2. Let inner tables take precedence and disable tokens of outer table type. > > 3. Let outer tables take precedence and implicitly terminate inner table > if table tokens of outer table type is encountered. I wrote a whole piece down about how enabling and disabling tokens shouldn't be needed here, but I see the problem now. I think option 2 is the most intuitive. But can't we include TD and TR tokens in the wikitext table specification? That would extend the grammar a bit, but that is no problem.
- is how things work in MediaWiki's existing parser, so that's what
we'll be doing for the new internal parser at least. It might not feel very pure, but in principle it's kind of like having a "<SPAN>...</span>" in case-insensitive HTML -- they're basically different encodings of the same functional symbol.
Since we also have to handle tables opened and closed at different hierarchical levels (eg between different templates), we currently expect to describe them in the parse tree in a way that the open & close tags get matched up later on during processing of the parse tree rather than during the original raw-source parse. This should also fit nicely with the notion that the open & close tokens can be slightly different variants (versus recording them as a single node in a hierarchical tree with children for contents) -- a specific table's contents can be snurched out with an iterator going from the start to the end tokens over the tree, much like you'd do for extracting an arbitrary text selection.
But what you are describing is not the same behavior as 1). MediaWiki processes html and wikitext tables separately, and to mee it seems highly unintentional that the inner tokens can be used interchangeably and this only happens if both types of tables are nested with each other. But you are here proposing yet another option, which I believe is a lot simpler to implement:
4. There is only one type of table, but the tokens have aliases.
Andreas Jonsson wrote:
But what you are describing is not the same behavior as 1). MediaWiki processes html and wikitext tables separately, and to mee it seems highly unintentional that the inner tokens can be used interchangeably and this only happens if both types of tables are nested with each other. But you are here proposing yet another option, which I believe is a lot simpler to implement:
- There is only one type of table, but the tokens have aliases.
I think that it used to be the case, and was later changed to only work with its own type, but couldn't find that.
wikitext-l@lists.wikimedia.org