[Mediawiki-l] Wikitext grammar

Trevor Parscal tparscal at wikimedia.org
Fri Aug 6 17:59:07 UTC 2010


  The current "parser" is, as David Gerard said, not much of a parser by 
any conventional definition. It's more of a macro-expander (for parser 
tags and templates) and a series of mostly-regular-expression-based 
replacement routines, which result in partially valid HTML which is then 
repaired in most cases to be valid HTML.

This past spring I wrote a parser which tokenizes and parses wikitext 
into a node-tree. It understands template nesting and it completely 
ignores HTML comments and parser tags using a masking technique.

/start of long-winded explanation/

The key to parsing wikitext is to use a mental model of what's going on, 
not get stuck on the source code of the "parser" or get too worked up 
about BNF and it's variants. Wikitext is based on blocks - blocks are on 
or more consecutive lines which share a rendering intent, such as a 
paragraph, list, table, or heading. Some blocks (one or more lines) 
should be merged together with neighboring blocks of the same type such 
as list items, while some mixed lines (single lines containing more than 
one logical block) should be broken apart, such as raw text being typed 
on the same line just after a table closing.

The parser I wrote explains these rules and all syntax in a simple 
meta-language expressed in PHP arrays. I've been running real Wikipedia 
articles through it for a while with excellent results. I do not have a 
template-expander or HTML renderer yet, so right now the results are 
merely syntax highlighted wikitext visually broken into logical blocks, 
or raw JSON/XML dumps of the node-tree.

The reason I went about writing this parser was to solve a problem on 
the front-end, which is that there's no way to know where any given 
portion of a page came from, and the current parser doesn't follow any 
rules of encapsulation. Could have been text directly within the 
article, the results of expanding one or more templates, or processing a 
parser tag. By parsing the wikitext into a node-tree, it can be rendered 
in an encapsulated way and IDs and classes can be added to the output to 
explain where each bit of text came from.

By encapsulation, I'm specifically meaning that the results of any 
generated content such as template expansion or parser-hooks should be 
complete validated HTML, opening all tags it closes and closing all tags 
it opens. This is different from the way templates and parser-hooks 
currently work, and would require adjustments with some templates, but 
such template reform is feasible, and such use of templates is 
defensibly evil anyways.

I showed a demo of this parser working in Berlin this year, and got more 
done on it while stuck in Berlin thanks to the volcano of death, but 
since I've been back to work have not had much time to complete it. I 
intend on getting this code up on our SVN soon as part of my flying-cars 
version of MediaWiki I've been hacking away at on my laptop.

Just wanted to throw this all in here, hopefully it will be useful. I'm 
glad to share more about what I learned embarking on this endeavor and 
share my code as well - might commit it within a week or two.

/end of long-winded explanation/

In short, the current "parser" is a bad example of how to write a 
parser, but it does work. I have found that studying how it works is far 
less useful than observing what it does in practice and reverse 
engineering it with more scalable and flexible parsing techniques in mind.

- Trevor

On 8/4/10 3:58 PM, David Gerard wrote:
> On 4 August 2010 20:45, lmhelp<lmbox at wanadoo.fr>  wrote:
>
>> I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext"
>> language (or an *exhaustive* (and formal) set of rules about how is
>> constructed
>> a "Wikitext").
>> I've looked for such a grammar/set of rules on the Web but I couldn't find
>> one...
>
> There isn't one. The "parser" is not actually a parser - it takes
> wikitext in, does things to it and spits HTML out. Much of its
> expected behaviour is actually emergent properties of the vagaries of
> PHP.
>
> Many have tried to write a description of wikitext that isn't the code
> itself, all so far have failed ...
>
>
>> - Is a grammar available somewhere?
>> - Do you have any idea how to extract the first paragaph of a Wiki article?
>> - Any advice?
>> - Does a Java "Wikitext" "parser" exists which would do it?
>
> If anyone ever does come up with an algorithm that accurately
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l




More information about the MediaWiki-l mailing list