On Feb 8, 2008 3:31 PM, Steve Bennett stevagewp@gmail.com wrote:
On 2/9/08, Magnus Manske magnusmanske@googlemail.com wrote:
My http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php parses it correctly as well, but it's still manual PHP hacks, while your is a real parser - respect!
Not too much respect. I think I have only *just* worked out why I need all these syntactic predicates and what backtracking is used for.
Throughout my grammar everywhere I have had to place these predicates like this:
((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket
// try and save it some time on [[[foo]]]? |(literal_left_bracket bracketed_url) => literal_left_bracket |(image) => image |(category) => category |(external_link) => external_link |(internal_link) => internal_link |(magic_link) => magic_link |pre_block |(formatted_text_elem) =>formatted_text_elem )
The bit before the => on each line basically says "look ahead, and if the syntax matches the bit in brackets, then go ahead and parse it as the bit after the =>.
I never knew why I needed them to make it work, but now I see: in the case of an image, if it just dove straight into trying to parse a string like [[image:foo]] (not a valid image), it would hit the first [[, think the image rule matched, and keep going. Eventually it would realise the rule didn't match but it would be too late: because the grammar is blatantly not LALR (I think?), it would just fail (unless it could backtrack, which I'm not using). By using the syntactic predicate, it's able to prevent itself from falling in a hole - it looks ahead, sees "that looks like an image...oh wait, no it's not!", and tries the next rule instead.
There's a huge amount of messiness in the grammar so far caused by me not really understanding this stuff. I also haven't been very clean about where newlines and whitespace are handled exactly.
Anyway, my latest rant about tables (sorry Magnus :)) In the following table, which part is the style attribute for a table cell, and which part is the cell contents:
{| |an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop |}
(reminder: cell definitions with style attributes look like this: | style | contents ||...
Buggered if I know. I might have to impose a rule involving the range of possible characters that could appear in the style attribute. I didn't really want to have actually parse that bit properly...
That's exactly what I did in wiki2xml, and it works (yesss, still ahead;-)
Of course, I cheap out in another regard there: wiki2xml parses images and links alike, and parses even links with "too many" parameters. My reasons for that: * Lazyness * No need to know the language/wiki settings (which make "Image:" special for en) * Flexible for "add-ons" (who knows, we might use three-part links someday...) * Not much additional burden for the next level (XML-to-something)
Cheers, Magnus