I'm pleased to report that my ANTLR grammar outperforms* the current mediawiki parser on the following pathological text:
[[[[image:foo.jpg|thumb|[[[o]]][[foo||]]|[[image:bar.jpg|thumb|[[roo my doo|zoo|]]]]]]]]]
It's really amazing what you discover about Wikitext when you sit down to analyse it like this. For example, a square bracket - [ - is: - the start of an external link, if the rest of it is present, and not in a context where external links are forbidden (notably, captions of internal links or other external links), and not inside a nowiki tag - part of the start of an internal link, as long as the rest is present, and it couldn't be interpreted as an internal link, and in an appropriate context - a literal otherwise - that is, in any non-linkable context, not followed by the appropriate tags to make it a link, or inside a nowiki
A pipe - | - is: - an option separator for an image, provided that it's not within an embedded object such as internal link or another image, and provided that it's not within a nowiki - a link caption separator, provided that it's not in nowiki tags - any of a dozen other cases that I haven't dealt with yet, like tables, templates, parser functions, categories, ... - literal otherwise.
It's fun! I think...
Steve * The current parser gives up. ANTLR, after a monumental struggle involving 21 levels of method call and a bit of backtracking, parses it correctly.
- The current parser gives up. ANTLR, after a monumental struggle
involving 21 levels of method call and a bit of backtracking, parses it correctly.
By what definition of "correct"? The only definition we have is what the current parser does, so giving up is, technically speaking, the correct behaviour.
On 11/26/07, Thomas Dalton thomas.dalton@gmail.com wrote:
By what definition of "correct"? The only definition we have is what the current parser does, so giving up is, technically speaking, the correct behaviour.
I think we finally achieved consensus that the definition of "correct" is somewhere between what the parser currently does, and what people think it ought to do as evidenced by the code they write.
You could put it this way: The definition of "correct" is whatever the parser does, unless that appears to be incorrect.
Steve
On 11/26/07, Thomas Dalton thomas.dalton@gmail.com wrote:
You could put it this way: The definition of "correct" is whatever the parser does, unless that appears to be incorrect.
"It's right, except when it's wrong." I like that. :)
Here's another definition: The correct treatement of wikitext is whatever the current parser looks like it's trying, possibly unsuccessfully, to do.
Steve
On Mon, Nov 26, 2007 at 10:51:39AM +1100, Steve Bennett wrote:
You could put it this way: The definition of "correct" is whatever the parser does, unless that appears to be incorrect.
And the award for Outstanding Achievement in Accidental Humor in a Mailing List Posting goes ... *to*
Cheers, -- jra
On Nov 25, 2007 11:29 PM, Steve Bennett stevagewp@gmail.com wrote:
I'm pleased to report that my ANTLR grammar outperforms* the current mediawiki parser on the following pathological text:
[[[[image:foo.jpg|thumb|[[[o]]][[foo||]]|[[image:bar.jpg|thumb|[[roo my doo|zoo|]]]]]]]]]
It's really amazing what you discover about Wikitext when you sit down to analyse it like this. For example, a square bracket - [ - is:
- the start of an external link, if the rest of it is present, and not
in a context where external links are forbidden (notably, captions of internal links or other external links), and not inside a nowiki tag
- part of the start of an internal link, as long as the rest is
present, and it couldn't be interpreted as an internal link, and in an appropriate context
- a literal otherwise - that is, in any non-linkable context, not
followed by the appropriate tags to make it a link, or inside a nowiki
A pipe - | - is:
- an option separator for an image, provided that it's not within an
embedded object such as internal link or another image, and provided that it's not within a nowiki
- a link caption separator, provided that it's not in nowiki tags
- any of a dozen other cases that I haven't dealt with yet, like
tables, templates, parser functions, categories, ...
- literal otherwise.
It's fun! I think...
FWIW, my wiki2xml doesn't give up either, and generates XML very quickly. However, there's still a fluke in there (more than one;-) that causes "[[image:foo.jpg" to be a link target. Might be the correct behaviour, though, when you think about it...
I'll look at this more closely, eventually; nevertheless, it generates"good" XML already, which IMHO is the most important thing.
Magnus
wikitext-l@lists.wikimedia.org