[Mediawiki-l] Wikitext grammar
Trevor Parscal
tparscal at wikimedia.org
Fri Aug 6 17:59:07 UTC 2010
The current "parser" is, as David Gerard said, not much of a parser by
any conventional definition. It's more of a macro-expander (for parser
tags and templates) and a series of mostly-regular-expression-based
replacement routines, which result in partially valid HTML which is then
repaired in most cases to be valid HTML.
This past spring I wrote a parser which tokenizes and parses wikitext
into a node-tree. It understands template nesting and it completely
ignores HTML comments and parser tags using a masking technique.
/start of long-winded explanation/
The key to parsing wikitext is to use a mental model of what's going on,
not get stuck on the source code of the "parser" or get too worked up
about BNF and it's variants. Wikitext is based on blocks - blocks are on
or more consecutive lines which share a rendering intent, such as a
paragraph, list, table, or heading. Some blocks (one or more lines)
should be merged together with neighboring blocks of the same type such
as list items, while some mixed lines (single lines containing more than
one logical block) should be broken apart, such as raw text being typed
on the same line just after a table closing.
The parser I wrote explains these rules and all syntax in a simple
meta-language expressed in PHP arrays. I've been running real Wikipedia
articles through it for a while with excellent results. I do not have a
template-expander or HTML renderer yet, so right now the results are
merely syntax highlighted wikitext visually broken into logical blocks,
or raw JSON/XML dumps of the node-tree.
The reason I went about writing this parser was to solve a problem on
the front-end, which is that there's no way to know where any given
portion of a page came from, and the current parser doesn't follow any
rules of encapsulation. Could have been text directly within the
article, the results of expanding one or more templates, or processing a
parser tag. By parsing the wikitext into a node-tree, it can be rendered
in an encapsulated way and IDs and classes can be added to the output to
explain where each bit of text came from.
By encapsulation, I'm specifically meaning that the results of any
generated content such as template expansion or parser-hooks should be
complete validated HTML, opening all tags it closes and closing all tags
it opens. This is different from the way templates and parser-hooks
currently work, and would require adjustments with some templates, but
such template reform is feasible, and such use of templates is
defensibly evil anyways.
I showed a demo of this parser working in Berlin this year, and got more
done on it while stuck in Berlin thanks to the volcano of death, but
since I've been back to work have not had much time to complete it. I
intend on getting this code up on our SVN soon as part of my flying-cars
version of MediaWiki I've been hacking away at on my laptop.
Just wanted to throw this all in here, hopefully it will be useful. I'm
glad to share more about what I learned embarking on this endeavor and
share my code as well - might commit it within a week or two.
/end of long-winded explanation/
In short, the current "parser" is a bad example of how to write a
parser, but it does work. I have found that studying how it works is far
less useful than observing what it does in practice and reverse
engineering it with more scalable and flexible parsing techniques in mind.
- Trevor
On 8/4/10 3:58 PM, David Gerard wrote:
> On 4 August 2010 20:45, lmhelp<lmbox at wanadoo.fr> wrote:
>
>> I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext"
>> language (or an *exhaustive* (and formal) set of rules about how is
>> constructed
>> a "Wikitext").
>> I've looked for such a grammar/set of rules on the Web but I couldn't find
>> one...
>
> There isn't one. The "parser" is not actually a parser - it takes
> wikitext in, does things to it and spits HTML out. Much of its
> expected behaviour is actually emergent properties of the vagaries of
> PHP.
>
> Many have tried to write a description of wikitext that isn't the code
> itself, all so far have failed ...
>
>
>> - Is a grammar available somewhere?
>> - Do you have any idea how to extract the first paragaph of a Wiki article?
>> - Any advice?
>> - Does a Java "Wikitext" "parser" exists which would do it?
>
> If anyone ever does come up with an algorithm that accurately
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
More information about the MediaWiki-l
mailing list