Lee Daniel Crocker lee@piclab.com said:
(David A. Wheeler dwheeler@dwheeler.com) said:
- Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching.
Sorry, I wasn't clear. I wasn't thinking of caching - I was thinking of accessing the filesystem INSTEAD of MySQL when getting the current wikitext.
Why? Well, I suspect that accessing the filesystem directly is much faster than accessing the data via MySQL - if most accesses are simple reads, then you can access it without user-level locks, etc., etc. Even more importantly, checking for existence is a simple filesystem check - which is likely to be much faster than the MySQL request.
Would it be faster? I don't know; the only _real_ way to find out is to benchmark it.
Of course, if wikipedia is near the breaking point for performance, another approach would be to change the design so that reading only requires one lookup (for the data itself). You noted the two big problems, and I agree that they're the sticking points. You could abandon many user settings, except ones that the user can supply themselves to select between different stylesheets, and abandon displaying links differently depending on whether or not they're there. Less desirable, but you've already abandoned supporting search! Then you can cache the generated HTML as well.
If it's a choice between having a working wikipedia, and having the bells & whistles, I think working is the better plan. You can always include them as settable options, to be returned once the system doesn't have performance problems.
Although databases are more flexible for storing structured data, for simple unstructured data, a simple filesystem-based approach might be more suitable. This also lets you use other existing tools (like the many tools that let you store indexes for later rapid searching based on files).
A quick start might be to temporarily disable all checking of links, and see if that helps much.
[Rendering] could also be sped up, e.g., by rewriting it in flex. My "html2wikipedia" is written in flex - it's really fast and
didn't
take long to write. The real problem is, I suspect that isn't the bottleneck.
It isn't. And there's no reason to expect flex to be any faster than any other language.
Actually, for some lexing applications flex can be MUCH faster. That's because it can pre-compile a large set of patterns into C, and compile the result. Its "-C" option can, for some applications, result in blazingly fast operations. You CAN do the same thing by hand, but it takes a long time to hand-optimize that kind of code.
However, there's no point in rewriting what is not the bottleneck. Which I why I was hoping to hear if someone has done measurements to identify the real bottlenecks, e.g., "50% of the system time is spent doing X". If most time is spent rendering articles for display (without editing), then it's worth examining what's taking the time. If the time is spent on checking if links exist, then clearly that's worth examining.
Oh, one note - if you want to simply store whether or not a given article entry exists or not, and quickly check it, one fancy way of doing this is by using a Bloom filter. You can hash the article title, and then using a fancy data structure can store its existance or non-existance. More info, and MIT-licensed code, for a completely different application are at: http://www.ir.bbn.com/projects/SPIE (there, they hash packets so that later queries can ask "did you see this packet"?). Given the relatively small size of article text, it's not clear you need this (you can store all the titles in memory), but I just thought I'd mention it.
Anyway, thanks for listening. My hope is that the Wikipedia doesn't become a victim of its own success :-).