On Sat, Nov 10, 2007 at 11:26:42PM +0100, Merlijn van Deen wrote:
Most importantly, I think we should stop storing wikitext. Storing wikitext makes it hard to make changes in the syntax, because it would break pretty much every existing page. Wikitext is an ambiguous way of storing 'the way it is meant'; XML is a clear way of doing this. As the text is compressed, using wikitext or XML does not make that big of a difference.
We did this one about 6 months ago, check the archives.
However, XML makes parsing much easier. Yes, it will need two steps, but when regenerating the page from the database, it's much easier (no ugly regexps, just a simple SAX parser). Besides, as a pywikipedia developer, I'd like to have XML output ;)
Sure, but we *still* need to regularize the parser before we can do that.
To summarize: We should switch to storing a much more descriptive format so changes in the wikitext format do not break anything: the wikitext can just be generated from the XML, in whichever format you want. This means it should be able to use (cleaned up) mediawiki wikitext, wikicreole or many other systems - per user. (Although as far as I can see wikicreole isn't available as context free grammar either..)
I should note that it seems likely to become harder to calculate diffs if we store the parse tree instead of the wikitext... but on this point I'll be willing to admit I might be entirely off base.
Cheers, -- jra