Wikitext Madness of the Day: List Items

List overview All Threads
Download

newer

older

On HTML elements in wikitext

MediaWiki parser implementation

Andreas Jonsson

9 Aug 2010 9 Aug '10

9:54 a.m.

List item tokens must appear at the beginning of line as lists aren't allowed inside a preformatted box:

: foo -> <dl><dd> foo </dd></dl>

: foo -> <pre>  : foo </pre>

However, if you put a table inside it, the list item token may appear indented:

:{| |foo |} -> <dl><dd><table><tbody><tr><td> foo </td></tr></tbody></table></dd></dl>

/Andreas

Show replies by date

Mark Clements (HappyDog)

9 Aug 9 Aug

11:04 a.m.

"Andreas Jonsson" andreas.jonsson@kreablo.se wrote in message news:4C60168E.4000901@kreablo.se...

...

List item tokens must appear at the beginning of line as lists aren't allowed inside a preformatted box:

: foo -> <dl><dd> foo </dd></dl>

: foo -> <pre>  : foo </pre>

However, if you put a table inside it, the list item token may appear indented:
:{|
|foo |} -> <dl><dd><table><tbody><tr><td> foo

</td></tr></tbody></table></dd></dl>

This kind of unexpected edge-case is arguably something that should be fixed in any formal markup specification.

- Mark Clements (HappyDog)

David Gerard

10 Aug 10 Aug

5:09 a.m.

On 9 August 2010 17:04, Mark Clements (HappyDog) gmane@kennel17.co.uk wrote:

...

This kind of unexpected edge-case is arguably something that should be fixed in any formal markup specification.

How prevalent is it in actual wikitext? Is it an edge case people actually use much, or are all instances of it basically errors? That'll be the question.

- d.

Thomas Dalton

5:45 a.m.

On 10 August 2010 11:09, David Gerard dgerard@gmail.com wrote:

...

On 9 August 2010 17:04, Mark Clements (HappyDog) gmane@kennel17.co.uk wrote:

...
This kind of unexpected edge-case is arguably something that should be fixed in any formal markup specification.

How prevalent is it in actual wikitext? Is it an edge case people actually use much, or are all instances of it basically errors? That'll be the question.

Its only potential use is in making the wikitext more easily readable, which doesn't seem important enough to warrant just a weird edge-case. Any formal spec is going to end up breaking things, that can't really be helped (unless we just write down a spec for the current behaviour, bugs and all, which sounds like a lost opportunity to me).

Andreas Jonsson

11 Aug 11 Aug

9:54 a.m.

2010-08-10 12:45, Thomas Dalton skrev:

...

On 10 August 2010 11:09, David Gerarddgerard@gmail.com wrote:

...
On 9 August 2010 17:04, Mark Clements (HappyDog)gmane@kennel17.co.uk wrote:

...
This kind of unexpected edge-case is arguably something that should be fixed in any formal markup specification.

How prevalent is it in actual wikitext? Is it an edge case people actually use much, or are all instances of it basically errors? That'll be the question.

Its only potential use is in making the wikitext more easily readable, which doesn't seem important enough to warrant just a weird edge-case. Any formal spec is going to end up breaking things, that can't really be helped (unless we just write down a spec for the current behaviour, bugs and all, which sounds like a lost opportunity to me).

If you consider the large body of information tied to MediaWiki syntax, it is likely that for any border case, there is a revision of some page that will trigger that border case.

Regarding strategy on how to replace the MediaWiki parser, I can see two extremes:

1. Search out all wierd edge cases and reproduce them in parser rules. Walk through the revisions of Wikipedia and for each edge case, note all revisions for which the parser rule for the edge case is executed. Based on the data determine which edge cases can be safely removed. Or define a conversion for the content.

2. Don't support any edge cases. Just consider the content broken and let the wiki users themself fix it. Historic revisions of pages will be permanently broken.

I am trying to support as many edge cases as far as reasonable in my attemt to write a new parser. I seems, however, as if the parser is actively developed, and backwards compliancy with edge cases maybe isn't much of a concern. For instance, in 1.16.0beta3 we have:

$text = $this->doAllQuotes( $text ); $text = $this->replaceInternalLinks( $text ); $text = $this->replaceExternalLinks( $text );

which in trunk is:

$text = $this->replaceInternalLinks( $text ); $text = $this->doAllQuotes( $text ); $text = $this->replaceExternalLinks( $text );

So, it is now possible to have apostrophes in internal links, but still not in external.

From the parser's point of view, the edge cases can be divided into "harmless", where a rule to support it does not increase the complexity of the parser significantly, and "harmful", where adding a rule to support them would either dramatically increase the size of the parser or make it possible to craft contents that will take more than linear time or memory to process. The edge cases surrounding links definitely fall into the harmful category. I will be writing a separate post about links later.

Maybe it would be a good idea to provide som feedback to the user regarding bad syntax. In my parser implementation, I am considering generating special events for syntax that should be avoided. For instance:

begin_table: begin = BEGIN_TABLE NEWLINE* ( { X->beginGarbageBlock(X, "Unsupported syntax: content between the {| and the first column in a table."); } ((inline_element)=> garbage_inline_text NEWLINE* )* block_elements? { X->endGarbageBlock(X); } )* { X->beginTable(X, $begin->custom); } ;

Could for instance be rendered in html as: <div class="garbage" title="Unsupported syntax: content between ..."> </div>.

/Andreas

Daniel Kinzler

10:50 a.m.

Andreas Jonsson schrieb:

...

Don't support any edge cases. Just consider the content broken and let the wiki users themself fix it. Historic revisions of pages will be permanently broken.

Or keep the old parser around to deal with old revisions. Revisions that work with the new parser can be flagged as such.

-- daniel

Liangent

11:11 a.m.

On 8/11/10, Daniel Kinzler daniel@brightbyte.de wrote:

...

Or keep the old parser around to deal with old revisions. Revisions that work with the new parser can be flagged as such.

This made me think of Quirks mode in browsers.

We can make the new parser (and standardized wikitext syntax) more strict by treating many edge cases as errors, and if the new parser detected any error when parsing, use the old parser instead.

David Gerard

12:02 p.m.

On 11 August 2010 17:11, Liangent liangent@gmail.com wrote:

...

On 8/11/10, Daniel Kinzler daniel@brightbyte.de wrote:

...

...
Or keep the old parser around to deal with old revisions. Revisions that work with the new parser can be flagged as such.

...

This made me think of Quirks mode in browsers. We can make the new parser (and standardized wikitext syntax) more strict by treating many edge cases as errors, and if the new parser detected any error when parsing, use the old parser instead.

If Tim will buy it :-) The non-quirks mode had better cover *almost all* current revisions on the major WMF wikis, at the least. (The most recent current version dumps would be suitable test data.)

- d.

Magnus Manske

12 Aug 12 Aug

3:03 a.m.

On Wed, Aug 11, 2010 at 6:02 PM, David Gerard dgerard@gmail.com wrote:

...

On 11 August 2010 17:11, Liangent liangent@gmail.com wrote:

...
On 8/11/10, Daniel Kinzler daniel@brightbyte.de wrote:

...
...
Or keep the old parser around to deal with old revisions. Revisions that work with the new parser can be flagged as such.

...
This made me think of Quirks mode in browsers. We can make the new parser (and standardized wikitext syntax) more strict by treating many edge cases as errors, and if the new parser detected any error when parsing, use the old parser instead.

If Tim will buy it :-) The non-quirks mode had better cover *almost all* current revisions on the major WMF wikis, at the least. (The most recent current version dumps would be suitable test data.)

Do we have a short list of "worst case scenario" pages, which use lots of special cases for some reason, and that we could use as a test set? Not something specially constructed, but real, live wikipedia pages.

Magnus

Daniel Kinzler

2:35 a.m.

Liangent schrieb:

...

On 8/11/10, Daniel Kinzler daniel@brightbyte.de wrote:

...
Or keep the old parser around to deal with old revisions. Revisions that work with the new parser can be flagged as such.

This made me think of Quirks mode in browsers.

We can make the new parser (and standardized wikitext syntax) more strict by treating many edge cases as errors, and if the new parser detected any error when parsing, use the old parser instead.

There are no errors. This is an axiom of wiki markup: any text is valid wiki text. It may not look like what you though it would, but there will be no "syntax error" messages, ever.

We *could* however issue warnings. That could actually be helpful. Perhaps in the form of special css classes / hidden markers, that can be made visible with some JS gadget. Maybe they should even be visible per default to logged in users.

-- daniel

David Gerard

4:41 a.m.

On 12 August 2010 08:35, Daniel Kinzler daniel@brightbyte.de wrote:

...

There are no errors. This is an axiom of wiki markup: any text is valid wiki text. It may not look like what you though it would, but there will be no "syntax error" messages, ever.

Yes. This is important: humans do not naturally write correctly-formed XML - they splatter a bunch of tag soup on a page, see if it rendered properly in "preview" and save when it looks good enough. That is, they learn the subtle nuances of wikitext the way they learn natural language.

Trying to turn wikitext into a proper language, with invalid formations, will not I predict work well. Unless you aim for humans mostly not to use wikitext and instead to use only WYSIWYG or other structured editors. This would be a major change in the way WMF wikis work, however, and may not fly.

...

We *could* however issue warnings. That could actually be helpful. Perhaps in the form of special css classes / hidden markers, that can be made visible with some JS gadget. Maybe they should even be visible per default to logged in users.

There are some examples that clearly break. If you fail to close a <ref> tag, watch the sea of red in the rendering. So you may be able to push wikitext towards being something where syntax erriors exist. If that is considered a desirable goal. (I don't consider it one, but then again I want everything and a pony, like any unreasonable human does.)

- d.

Platonides

11 Aug 11 Aug

5:20 p.m.

Andreas Jonsson wrote:

...

I am trying to support as many edge cases as far as reasonable in my attemt to write a new parser. I seems, however, as if the parser is actively developed, and backwards compliancy with edge cases maybe isn't much of a concern. For instance, in 1.16.0beta3 we have:
     $text = $this->doAllQuotes( $text );
     $text = $this->replaceInternalLinks( $text );
     $text = $this->replaceExternalLinks( $text );
which in trunk is:
     $text = $this->replaceInternalLinks( $text );
     $text = $this->doAllQuotes( $text );
     $text = $this->replaceExternalLinks( $text );
So, it is now possible to have apostrophes in internal links, but still not in external.

Yes. That was for supporting apostrophes in page titles without having to write entities in the link. External links had some issues if you tried to do it.

...

Maybe it would be a good idea to provide som feedback to the user regarding bad syntax. In my parser implementation, I am considering generating special events for syntax that should be avoided. For instance:

I have supported for a long time the notion of a parser giving optional warnings about "unsupported" wikitext.

5242

Age (days ago)

5245

Last active (days ago)

wikitext-l@lists.wikimedia.org

11 comments

8 participants

tags (0)

participants (8)

Andreas Jonsson
Daniel Kinzler
David Gerard
Liangent
Magnus Manske
Mark Clements (HappyDog)
Platonides
Thomas Dalton