One issue I've been having has to do with high level punctuation
getting tangled up in embedded text. In wikitext, it's generally ok to
write a literal ]] - it means two right square brackets in a row. But
of course in an [[image:foo.jpg|caption - ]] means the end of the
image element, not just raw text.
I can see, and have sort of tried, three ways to handle this:
1) Using traditional grammar approaches, backtracking and so forth,
hoping the parser is smart enough to match the right string, and "pull
back" at the right moment. Unfortunately, this seems very difficult
without an extremely good knowledge of the compiler compiler, and is
probably slow to boot.
2) Using bottom up* context flags like "inside image element", so when
an "]]" is found, we know whether or not we can treat them as
literals. Problem: you end up smearing knowledge about the image
element everywhere: why does the RIGHT_SQUARE_BRACKET literal want to
know anything about image elements?
3) Using top down restrictions on literals like "prohibit literal
double right square bracket". Similar to 2), but when a "]]" is found
it just dumbly looks at the corresponding flag to decide whether to
match it as literal.
Method 3 seems the most promising now. I was using 2), but it seemed
to become very complex all of a sudden.
I now have code that looks like this:
image_caption
@init {prohibit_literal_link_end++; prohibit_literal_pipe++;}
: inline_text?
-> ^(TEXT inline_text);
finally {prohibit_literal_link_end--; prohibit_literal_pipe--;}
...
literal_link_end: {prohibit_literal_link_end <= 0}? => link_end;
This seems to be relatively readable too: "An image caption is any
text, except that there can't be an unescaped literal pipe or link_end
(]]) in it." and "A literal link end is whenever you encounter a raw
link_end, unless someone has said you can't."
Seems to keep me a bit saner, too. Anyway, just thought I would share.
Steve
* I'm using the terms 'bottom up' and 'top down' extremely loosely here.
Steve (and others): What needs to be done for the ANTLR grammar that
can be parallelised, so that the many people desperately after
reliable independent parsing of wikitext can contribute to the effort?
Also: how to speed up ANTLR-generated PHP, so this has half a chance
of being implemented?
- d.
Forwarding, just in case anyone is on this list that isn't on the main
mediawiki one.
---------- Forwarded message ----------
From: Dirk Riehle <dirk(a)riehle.org>
Date: 20 Jan 2008 19:25
Subject: [Mediawiki-l] Wiki Creole grammar, schema, transformations
made available
To: wiki-research-l(a)lists.wikimedia.org, mediawiki-l(a)lists.wikimedia.org
For those who were interested in a Mediawiki grammar etc, here is a
first step:
--------
For research purposes as well as the Wiki Creole community's
convenience, we are making our EBNF grammar, the XML schema definition,
and the to/from XML transformations available. You can use these
specifications to create your own parsers as well as use standard
technology (DOM, XSLT) to work with wiki pages and display or save them.
For more, see the dedicated Wiki Creole page at
http://www.riehle.org/wiki-creole as well as the WikiCreole community at
http://www.wikicreole.com
Dirk
--
Phone: + 1 (650) 215 3459
Web: http://www.riehle.org
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
How are numbered lists implemented in the present grammar? Would it be
hard (in future) to put in some sort of number-from provision or tell
the parser not to insert a </ol>?
- d.
---------- Forwarded message ----------
From: Herta Van den Eynde <herta.vandeneynde(a)gmail.com>
Date: 16 Jan 2008 13:23
Subject: Re: [Mediawiki-l] numbered list broken by image or template
To: MediaWiki announcements and site admin list
<mediawiki-l(a)lists.wikimedia.org>
On 16/01/2008, Kilian <winkelklammern(a)texttheater.de> wrote:
> Am Mittwoch, den 16.01.2008, 13:38 +0100 schrieb Herta Van den Eynde:
> > When you use a numbered list, and insert an image or a template, the
> > numbering is broken.
> > E.g.
> >
> > # one
> > # two
> > [[Image:some-image.png]]
> > # three
> >
> > will display:
> >
> > 1. one
> > 2. two
> >
> > Image:some-image.png
> >
> > 1. three
> >
> >
> > Is there a way to restart the numbering where you left of, so that the
> > third element still reads:
> >
> > 3. three
> >
> > Kind regards,
> >
> > Herta
> >
>
> Hi Herta,
>
> the problem is not the image but the line break. Here's how to mask it
> such as not to break the item:
>
> # one
> # two<br/>[[Image:Some-image.png]]
> # three
>
> ~ Kilian
Thanks, Kilian. That does indeed solve the problem with images.
Unfortunately many (most?) of our templates contain line breaks. Any
way to work around those?
Kind regards,
Herta
--
Herta Van den Eynde
"Life on Earth may be expensive,
but it comes with a free ride around the Sun."
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Compare and contrast:
1. <pre> a <nowiki> block </nowiki> </pre>
2. <pre> a <nowiki> block </pre>
3. a <nowiki> block </nowiki>
4. a <nowiki> block
Why is the <nowiki> rendered literally in 2, but stripped out in 1?
My working understanding of nowiki and pre was that both of them
altered the parsing/lexing behaviour, treating everything other than
its closing partner literally. So <pre> <nowiki> </pre> should render
<nowiki> literally, and <nowiki> <pre> </nowiki> should render <pre>
literally. But this doesn't seem to be quite the case.
Would anyone care to hazard a guess as to what the correct behaviour
*should* be? Does anyone rely on one treatment over the other? The
current behaviour seems inconsistent, especially comparing 2 with 4
above.
Steve