Re: [Wikitech-l] Re: An alternate parser

14 Aug 2004


      Timwi wrote:
...
Magnus Manske wrote:
...
The one I'm currently working on is "manual" C++, but I see no reason 
not to use an parser-generator. Question: Can a parser-generator 
ensure the output (no matter the input) is valid XML?
Krzysztof Kowalczyk already tried to explain this, but I'll put it in 
different words. The parser generator doesn't "ensure" the output is 
valid XML; the programmer has to do that. But it's easy. A parser 
generator generates the code to turn wikitext into a parse tree; to go 
from a parse tree to a valid-XML representation of that parse tree is 
like 1+1=2. It's trivial.
Do you know what a parse tree is? If not, let me know, and I'll try to 
explain that to you too.
I do know what it is. I already wrote a compiler or two during my 
informatics classes. The way I put the questions in the original mail 
was confusing, and I apologize. It was rather meant rhetorical, like in 
"I don't think it does".
...
...
The chain would then be preprocessing-parsing-XML-XHTML.
Uhm... no. See:
http://mail.wikipedia.org/pipermail/wikitech-l/2004-August/012135.html
where I describe the process.
You don't want to do anything to the text before parsing it (assuming 
here that parsing includes lexing, although technically they're 
separate steps). You want to do all processing after parsing.
By "preprocessing", I mean replacing the {{template}} things with the 
appropriate text stored in the database. The external parser can't do 
that, and it can't be done afterwards, when everything is XML already 
(see below). So, we'll have to do it as the very first step, just like 
C/C++ does.
...
Why? Well, because this is the purpose of parsing. We want to turn the 
text into a data structure that computers can handle better than text. 
It is much easier and much less error-prone to say "if this object is 
a tree node representing template-inclusion, then do this" than to say 
"search the string for some fuzzy pattern that looks a bit like a 
template inclusion, but look out for nesting, and make sure you get 
the parameters right, because they might contain pipes inside piped 
links, and try not to mess things up."
Example :
{| {{template}}
| stuff
|}
with {{template}} being "bgcolor=#FFFFFF". Wouldn't that be filtered out 
if we do the template replacement when we're already in XML? Because, 
XML probably would look something like
<table <wikitemplate>template</wikitemplate>>
<tr><td>stuff</td></tr></table>
if we let the parser loose on the template inclusion, which is *not* 
valid XML.
C/C++ also does "#define" before the actual parser. It is exactly the 
same as our templates.
Of course, we can explicitly forbid cases like the above.
...
...
The XML could be cached, as all changes influenced by user options 
would happen only in the final step.
That is correct! But this XML would still be just the parse tree for 
the wiki text.
Yes, the XML->(X)HTML step (compiling) would have to be done again each 
time. But I figure that will take way less time than parsing does now.
...
Personally, I'm almost inclined to say that having a proper parser is 
more important than performance. :-) But I'm confident that it will 
outperform the current parser by far.
All in due time (pun intended:-)
Magnus

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: An alternate parser