The current "parser" is, as David Gerard said, not much of a parser by any conventional definition. It's more of a macro-expander (for parser tags and templates) and a series of mostly-regular-expression-based replacement routines, which result in partially valid HTML which is then repaired in most cases to be valid HTML.
This past spring I wrote a parser which tokenizes and parses wikitext into a node-tree. It understands template nesting and it completely ignores HTML comments and parser tags using a masking technique.
/start of long-winded explanation/
The key to parsing wikitext is to use a mental model of what's going on, not get stuck on the source code of the "parser" or get too worked up about BNF and it's variants. Wikitext is based on blocks - blocks are on or more consecutive lines which share a rendering intent, such as a paragraph, list, table, or heading. Some blocks (one or more lines) should be merged together with neighboring blocks of the same type such as list items, while some mixed lines (single lines containing more than one logical block) should be broken apart, such as raw text being typed on the same line just after a table closing.
The parser I wrote explains these rules and all syntax in a simple meta-language expressed in PHP arrays. I've been running real Wikipedia articles through it for a while with excellent results. I do not have a template-expander or HTML renderer yet, so right now the results are merely syntax highlighted wikitext visually broken into logical blocks, or raw JSON/XML dumps of the node-tree.
The reason I went about writing this parser was to solve a problem on the front-end, which is that there's no way to know where any given portion of a page came from, and the current parser doesn't follow any rules of encapsulation. Could have been text directly within the article, the results of expanding one or more templates, or processing a parser tag. By parsing the wikitext into a node-tree, it can be rendered in an encapsulated way and IDs and classes can be added to the output to explain where each bit of text came from.
By encapsulation, I'm specifically meaning that the results of any generated content such as template expansion or parser-hooks should be complete validated HTML, opening all tags it closes and closing all tags it opens. This is different from the way templates and parser-hooks currently work, and would require adjustments with some templates, but such template reform is feasible, and such use of templates is defensibly evil anyways.
I showed a demo of this parser working in Berlin this year, and got more done on it while stuck in Berlin thanks to the volcano of death, but since I've been back to work have not had much time to complete it. I intend on getting this code up on our SVN soon as part of my flying-cars version of MediaWiki I've been hacking away at on my laptop.
Just wanted to throw this all in here, hopefully it will be useful. I'm glad to share more about what I learned embarking on this endeavor and share my code as well - might commit it within a week or two.
/end of long-winded explanation/
In short, the current "parser" is a bad example of how to write a parser, but it does work. I have found that studying how it works is far less useful than observing what it does in practice and reverse engineering it with more scalable and flexible parsing techniques in mind.
- Trevor
On 8/4/10 3:58 PM, David Gerard wrote:
On 4 August 2010 20:45, lmhelplmbox@wanadoo.fr wrote:
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
There isn't one. The "parser" is not actually a parser - it takes wikitext in, does things to it and spits HTML out. Much of its expected behaviour is actually emergent properties of the vagaries of PHP.
Many have tried to write a description of wikitext that isn't the code itself, all so far have failed ...
- Is a grammar available somewhere?
- Do you have any idea how to extract the first paragaph of a Wiki article?
- Any advice?
- Does a Java "Wikitext" "parser" exists which would do it?
If anyone ever does come up with an algorithm that accurately
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l