A happy little milestone

List overview All Threads
Download

newer

older

Draft 10 published

So, the hardest wikitext construct...

Steve Bennett

7 Feb 2008 7 Feb '08

4:11 p.m.

I have successfully parsed my first nested table. It's 3 in the morning but I'm quite happy :)

One of the really complicated bits about the nested table syntax is that the contents of multi-line cells looks exactly like normal text (with lists, headers, tables and so forth) except that each row can't begin with a pipe. I tried at least 4 different ways of implementing that rule (my practical ANTLR knowledge is still pretty weak), and finally this simple method worked:

It's a complete duplicate of the normal "line" rule, except with the addition of "nonpipe" before the paragraph.

Anyway, now it's onto the next round of "yes, the grammar works, now to stop ANTLR spewing 5000 warnings at me".

Steve

Show replies by date

Steve Bennett

7 Feb 7 Feb

4:32 p.m.

On 2/8/08, Steve Bennett stevagewp@gmail.com wrote:

...

I have successfully parsed my first nested table.

Hey, it parsed this mediawiki example almost perfectly:

{| border="1" cellpadding="5" cellspacing="0" align="center" |+'''An example table''' |- ! style="background:#efefef;" | First header ! colspan="2" style="background:#ffdead;" | Second header |- | upper left |   | rowspan=2 style="border-bottom:3px solid grey;" valign="top" | right side |- | style="border-bottom:3px solid grey;" | lower left | style="border-bottom:3px solid grey;" | lower middle |- | colspan="3" align="center" | {| border="0" |+''A table in a table'' |- | align="center" width="150px" | [[Image:Wiki.png]] | align="center" width="150px" | [[Image:Wiki.png]] |- | align="center" colspan="2" style="border-top:1px solid red; border-right:1px solid red; border-bottom:2px solid red; border-left:1px solid red;" | Two Wikimedia logos |} |}

(when I say almost, it treats the image links literally...because, as I now realise, it doesn't allow nesting elements in single-line table cells. d'oh...)

Steve

Magnus Manske

8 Feb 8 Feb

1:46 p.m.

On Feb 7, 2008 4:32 PM, Steve Bennett stevagewp@gmail.com wrote:

...

On 2/8/08, Steve Bennett stevagewp@gmail.com wrote:

...
I have successfully parsed my first nested table.

Hey, it parsed this mediawiki example almost perfectly:

{| border="1" cellpadding="5" cellspacing="0" align="center" |+'''An example table''' |- ! style="background:#efefef;" | First header ! colspan="2" style="background:#ffdead;" | Second header |- | upper left |   | rowspan=2 style="border-bottom:3px solid grey;" valign="top" | right side |- | style="border-bottom:3px solid grey;" | lower left | style="border-bottom:3px solid grey;" | lower middle |- | colspan="3" align="center" | {| border="0" |+''A table in a table'' |- | align="center" width="150px" | [[Image:Wiki.png]] | align="center" width="150px" | [[Image:Wiki.png]] |- | align="center" colspan="2" style="border-top:1px solid red; border-right:1px solid red; border-bottom:2px solid red; border-left:1px solid red;" | Two Wikimedia logos |} |}

(when I say almost, it treats the image links literally...because, as I now realise, it doesn't allow nesting elements in single-line table cells. d'oh...)

Congrats!

My http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php parses it correctly as well, but it's still manual PHP hacks, while your is a real parser - respect!

Cheers, Magnus

Steve Bennett

3:31 p.m.

On 2/9/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

My http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php parses it correctly as well, but it's still manual PHP hacks, while your is a real parser - respect!

Not too much respect. I think I have only *just* worked out why I need all these syntactic predicates and what backtracking is used for.

Throughout my grammar everywhere I have had to place these predicates like this:

The bit before the => on each line basically says "look ahead, and if the syntax matches the bit in brackets, then go ahead and parse it as the bit after the =>.

I never knew why I needed them to make it work, but now I see: in the case of an image, if it just dove straight into trying to parse a string like [[image:foo]] (not a valid image), it would hit the first [[, think the image rule matched, and keep going. Eventually it would realise the rule didn't match but it would be too late: because the grammar is blatantly not LALR (I think?), it would just fail (unless it could backtrack, which I'm not using). By using the syntactic predicate, it's able to prevent itself from falling in a hole - it looks ahead, sees "that looks like an image...oh wait, no it's not!", and tries the next rule instead.

There's a huge amount of messiness in the grammar so far caused by me not really understanding this stuff. I also haven't been very clean about where newlines and whitespace are handled exactly.

Anyway, my latest rant about tables (sorry Magnus :)) In the following table, which part is the style attribute for a table cell, and which part is the cell contents:

{| |an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop |}

(reminder: cell definitions with style attributes look like this: | style | contents ||...

Buggered if I know. I might have to impose a rule involving the range of possible characters that could appear in the style attribute. I didn't really want to have actually parse that bit properly...

Steve

Magnus Manske

8:49 p.m.

On Feb 8, 2008 3:31 PM, Steve Bennett stevagewp@gmail.com wrote:

...

On 2/9/08, Magnus Manske magnusmanske@googlemail.com wrote:

...
My http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php parses it correctly as well, but it's still manual PHP hacks, while your is a real parser - respect!

Not too much respect. I think I have only *just* worked out why I need all these syntactic predicates and what backtracking is used for.

Throughout my grammar everywhere I have had to place these predicates like this:
((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket
// try and save it some time on [[[foo]]]? |(literal_left_bracket bracketed_url) => literal_left_bracket |(image) => image |(category) => category |(external_link) => external_link |(internal_link) => internal_link |(magic_link) => magic_link |pre_block |(formatted_text_elem) =>formatted_text_elem )

The bit before the => on each line basically says "look ahead, and if the syntax matches the bit in brackets, then go ahead and parse it as the bit after the =>.

I never knew why I needed them to make it work, but now I see: in the case of an image, if it just dove straight into trying to parse a string like [[image:foo]] (not a valid image), it would hit the first [[, think the image rule matched, and keep going. Eventually it would realise the rule didn't match but it would be too late: because the grammar is blatantly not LALR (I think?), it would just fail (unless it could backtrack, which I'm not using). By using the syntactic predicate, it's able to prevent itself from falling in a hole - it looks ahead, sees "that looks like an image...oh wait, no it's not!", and tries the next rule instead.

There's a huge amount of messiness in the grammar so far caused by me not really understanding this stuff. I also haven't been very clean about where newlines and whitespace are handled exactly.

Anyway, my latest rant about tables (sorry Magnus :)) In the following table, which part is the style attribute for a table cell, and which part is the cell contents:

{| |an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop |}

(reminder: cell definitions with style attributes look like this: | style | contents ||...

Buggered if I know. I might have to impose a rule involving the range of possible characters that could appear in the style attribute. I didn't really want to have actually parse that bit properly...

That's exactly what I did in wiki2xml, and it works (yesss, still ahead;-)

Of course, I cheap out in another regard there: wiki2xml parses images and links alike, and parses even links with "too many" parameters. My reasons for that: * Lazyness * No need to know the language/wiki settings (which make "Image:" special for en) * Flexible for "add-ons" (who knows, we might use three-part links someday...) * Not much additional burden for the next level (XML-to-something)

Cheers, Magnus

Steve Bennett

9 Feb 9 Feb

2:21 a.m.

On 2/9/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

Of course, I cheap out in another regard there: wiki2xml parses images and links alike, and parses even links with "too many" parameters. My reasons for that:

It just depends a lot on what exactly the levels of parsing/code generation/extensions etc are in the final product. It would seem incomplete for me to not parse image options and to pretend that "thumbnail" is some mysterious third party extension keyword that I don't need to know anything about. OTOH, I'm not planning on doing anything with <ref> beyond recognising that it's a particular, valid XML-style construct.

(Another mini milestone, it now parses that example perfectly, images and all. Though I'm sure there are still plenty of tables that will break it)

Steve

Magnus Manske

13 Feb 13 Feb

10:27 a.m.

On Feb 9, 2008 2:21 AM, Steve Bennett stevagewp@gmail.com wrote:

...

(Another mini milestone, it now parses that example perfectly, images and all. Though I'm sure there are still plenty of tables that will break it)

For wiki2xml, I started a test suite using the parser tests from MediaWiki. That might be an idea for your parser as well.

Note that wiki2xml automatically gets rid of "irrelevant" whitespace, whereas MediaWiki keeps it; that leads to "correct" output by wiki2xml, which is marked as wrong in an "equal" comparison of strings.

Magnus

Mark Clements

7 Feb 7 Feb

5:09 p.m.

----- Original Message ----- From: "Steve Bennett" stevagewp@gmail.com To: "Wikitext-l" wikitext-l@lists.wikimedia.org Sent: 07 February 2008 16:11 Subject: [Wikitext-l] A happy little milestone

...

I have successfully parsed my first nested table. It's 3 in the morning but I'm quite happy :)

Well done Steve! You're doing a great job on this! :-)

- Mark Clements (HappyDog)

Thomas Dalton

5:33 p.m.

...

Anyway, now it's onto the next round of "yes, the grammar works, now to stop ANTLR spewing 5000 warnings at me".

Or, it could even be time for bed... Is there a Parser Implementers Anonymous meeting near you?

6132

Age (days ago)

6138

Last active (days ago)

wikitext-l@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Magnus Manske
Mark Clements
Steve Bennett
Thomas Dalton