[Wikitech-l] Re: An alternate parser

14 Aug 2004


      Magnus Manske wrote:
...
If we'd limit ourselves to PHP, we could avoid XML, but PHP is not very 
well suited for this kind of text parsing, even with regexps, especially 
when it comes to performance. Also, using XML as an itermediate at one 
point opens potential for exchange with other services, either for 
output (generating static dumps, statistic analyses, etc.) or input 
(e.g., using other wiki markup in MediaWiki, for other projects).
I can sympathise with this line of reasoning. The lexing and parsing 
needs to be done in C/C++ for efficiency, but then the generated parse 
tree should be processed and compiled by PHP, so we need to "transfer" 
it, and XML does that job.
...
The one I'm currently working on is "manual" C++, but I see no reason 
not to use an parser-generator. Question: Can a parser-generator ensure 
the output (no matter the input) is valid XML?
Krzysztof Kowalczyk already tried to explain this, but I'll put it in 
different words. The parser generator doesn't "ensure" the output is 
valid XML; the programmer has to do that. But it's easy. A parser 
generator generates the code to turn wikitext into a parse tree; to go 
from a parse tree to a valid-XML representation of that parse tree is 
like 1+1=2. It's trivial.
Do you know what a parse tree is? If not, let me know, and I'll try to 
explain that to you too.
...
Can it remove potential harmful HTML tags?
Don't think of it as "removing". Don't think of the whole process as 
turning one text directly into another. It's not like that.
You give a parser generator a "grammar" (see [[formal grammar]]). If 
this grammar says that "<marquee>" isn't a syntax element, then it will 
be interpreted as text, and stored as such in the parse tree. Later 
(much later), when the parse tree is compiled into actual (X)HTML, the 
text would (obviously) be HTML-escaped. So it would become &lt;marquee&gt;.
Alternatively, of course, one can explicitly define <marquee> to be a 
null syntax element in the grammar, so that the final output doesn't 
contain it. But that shouldn't be necessary.
...
The chain would then be preprocessing-parsing-XML-XHTML.
Uhm... no. See:
http://mail.wikipedia.org/pipermail/wikitech-l/2004-August/012135.html
where I describe the process.
You don't want to do anything to the text before parsing it (assuming 
here that parsing includes lexing, although technically they're separate 
steps). You want to do all processing after parsing.
Why? Well, because this is the purpose of parsing. We want to turn the 
text into a data structure that computers can handle better than text. 
It is much easier and much less error-prone to say "if this object is a 
tree node representing template-inclusion, then do this" than to say 
"search the string for some fuzzy pattern that looks a bit like a 
template inclusion, but look out for nesting, and make sure you get the 
parameters right, because they might contain pipes inside piped links, 
and try not to mess things up."
...
The XML could be cached, as all changes influenced by user options would 
happen only in the final step.
That is correct! But this XML would still be just the parse tree for the 
wiki text.
...
Caveat: Cache will have to be invalidated 
for variables and templates that change (e.g., {{NUMBEROFARTICLES}} and 
edited templates).
This is obvious. We already have to invalidate the cache for everything 
that changes.
However, currently we also need to invalidate the parser cache for pages 
that include a template we have edited. This should not be necessary. We 
should be able to retrieve the parse tree for a page independently of 
that for the included templates. To put them together at page-view time 
is not costly, because we have to compile the parse tree into HTML 
anyway (which is also not costly, but it means one sweep through the 
parse tree).
...
Once the parser is basically working, we can [...]
run some benchmarks against the current parser. That 
should give us some more facts to base a decision on.
Personally, I'm almost inclined to say that having a proper parser is 
more important than performance. :-) But I'm confident that it will 
outperform the current parser by far.
Timwi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: An alternate parser