I've been working off and on on a blogging script (in Perl), and I implemented some wikimarkup in it. Under stress tests, it became apparent that I was going about it the wrong way, as every time a page was viewed, the markup had to be parsed.
My first thought was naturally to use some sort of caching system, but that would take up an awful lot of space on my little server, and the speedup wouldn't really justify it in my case.
What I did instead was parse it when it was saved, and then if I wanted to go back and edit it, the script would simply "de-parse" it for presentation to me in the editing form. A couple lines out of my script as an example:
$body =~ s/<em><strong>(.*?)</em></strong>/''''''$1'''''/g;
$body =~ s/<a href="(.*?)">(.*?)</a>/[$1 $2]/g;
By now, I was wondering how the Wikipedia software was handling this. Turns out it's storing the wikimarkup, not the HTML, and parsing it in the viewing code.
By now it's probably obvious where I'm going with this. Could one of these methods (either storing a parsed and non-parsed version or the approach I took with "de-parsing") be used for some performance gain on Wikipedia's webserver?
Nicholas Knight wrote:
By now it's probably obvious where I'm going with this. Could one of these methods (either storing a parsed and non-parsed version or the approach I took with "de-parsing") be used for some performance gain on Wikipedia's webserver?
De-parsing strikes me as a rather odd way to do it. Furthermore, Jimbo has often remarked that disk space is not a problem. (he may come to regret that remark when we hit a million articles.... but hey! ;)
I would suggest we consider semi-parsing. Save two versions of the article: a) wikitext b) the wikitext parsed into HTML, with wikilinks still as [[link]]. Note that this would not be a fully-formed HTML document, just a fragment since it would not have a head section or enclosing tags.
upon page read, it's b) that is inserted into the delivered page. Links are parsed live, since their status as existing / stub / ghost depends on the state of the database at that moment.
upon page edit, a) is sent to the edit box of the edit page
On Wednesday 13 August 2003 09:18, tarquin wrote:
Nicholas Knight wrote:
By now it's probably obvious where I'm going with this. Could one of these methods (either storing a parsed and non-parsed version or the approach I took with "de-parsing") be used for some performance gain on Wikipedia's webserver?
De-parsing strikes me as a rather odd way to do it.
It's odd, no doubt about that, just happens to be the best fit in my case. :)
Furthermore, Jimbo has often remarked that disk space is not a problem. (he may come to regret that remark when we hit a million articles.... but hey! ;)
In general I wouldn't expect it to be a problem. It's just a concern on my personal server, which I don't have much in the way of funds avalible for upgrading. Thought I'd throw it out there anyway as it struck me as a rather elegant solution for cases where disk space is a problem. :)
I would suggest we consider semi-parsing. Save two versions of the article: a) wikitext b) the wikitext parsed into HTML, with wikilinks still as [[link]]. Note that this would not be a fully-formed HTML document, just a fragment since it would not have a head section or enclosing tags.
upon page read, it's b) that is inserted into the delivered page. Links are parsed live, since their status as existing / stub / ghost depends on the state of the database at that moment.
Oops! Right, forgot about that since it's not applicable to my script ("all the world's a blog" syndrome? ;)). The 'semi-parsing' solution seems perfect to me.
wikitech-l@lists.wikimedia.org