Lee Daniel Crocker lee@piclab.com said:
(David A. Wheeler dwheeler@dwheeler.com) said:
- Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching.
Sorry, I wasn't clear. I wasn't thinking of caching - I was thinking of accessing the filesystem INSTEAD of MySQL when getting the current wikitext.
Why? Well, I suspect that accessing the filesystem directly is much faster than accessing the data via MySQL - if most accesses are simple reads, then you can access it without user-level locks, etc., etc. Even more importantly, checking for existence is a simple filesystem check - which is likely to be much faster than the MySQL request.
Would it be faster? I don't know; the only _real_ way to find out is to benchmark it.
Of course, if wikipedia is near the breaking point for performance, another approach would be to change the design so that reading only requires one lookup (for the data itself). You noted the two big problems, and I agree that they're the sticking points. You could abandon many user settings, except ones that the user can supply themselves to select between different stylesheets, and abandon displaying links differently depending on whether or not they're there. Less desirable, but you've already abandoned supporting search! Then you can cache the generated HTML as well.
If it's a choice between having a working wikipedia, and having the bells & whistles, I think working is the better plan. You can always include them as settable options, to be returned once the system doesn't have performance problems.
Although databases are more flexible for storing structured data, for simple unstructured data, a simple filesystem-based approach might be more suitable. This also lets you use other existing tools (like the many tools that let you store indexes for later rapid searching based on files).
A quick start might be to temporarily disable all checking of links, and see if that helps much.
[Rendering] could also be sped up, e.g., by rewriting it in flex. My "html2wikipedia" is written in flex - it's really fast and
didn't
take long to write. The real problem is, I suspect that isn't the bottleneck.
It isn't. And there's no reason to expect flex to be any faster than any other language.
Actually, for some lexing applications flex can be MUCH faster. That's because it can pre-compile a large set of patterns into C, and compile the result. Its "-C" option can, for some applications, result in blazingly fast operations. You CAN do the same thing by hand, but it takes a long time to hand-optimize that kind of code.
However, there's no point in rewriting what is not the bottleneck. Which I why I was hoping to hear if someone has done measurements to identify the real bottlenecks, e.g., "50% of the system time is spent doing X". If most time is spent rendering articles for display (without editing), then it's worth examining what's taking the time. If the time is spent on checking if links exist, then clearly that's worth examining.
Oh, one note - if you want to simply store whether or not a given article entry exists or not, and quickly check it, one fancy way of doing this is by using a Bloom filter. You can hash the article title, and then using a fancy data structure can store its existance or non-existance. More info, and MIT-licensed code, for a completely different application are at: http://www.ir.bbn.com/projects/SPIE (there, they hash packets so that later queries can ask "did you see this packet"?). Given the relatively small size of article text, it's not clear you need this (you can store all the titles in memory), but I just thought I'd mention it.
Anyway, thanks for listening. My hope is that the Wikipedia doesn't become a victim of its own success :-).
David A. Wheeler wrote:
If it's a choice between having a working wikipedia, and having the bells & whistles, I think working is the better plan.
I agree completely.
We've been advised by... I'm sorry, but I forgot who it was, but he's the author of a well-known book on this sort of thing... that separating webserving and database should be a huge win. If that's right, then we should be good to go after the new server is installed this weekend, and after some time spent getting it into service.
In general, I think that it is absolutely true that responsiveness is more important than frills. I have never thought of the feature of links appearing differently depending on whether or not the article exists as a frill, but I suppose it is. We could conceivably abandon that and any other feature that requires "on the fly" anything, and make the site very fast.
But it's probably better, for some features, to throw hardware at it.
--Jimbo
On Thu, 1 May 2003, Jimmy Wales wrote:
Date: Thu, 1 May 2003 11:45:53 -0700 From: Jimmy Wales jwales@bomis.com Subject: Re: [Wikitech-l] Re: Chat about Wikipedia performance?
We've been advised by... I'm sorry, but I forgot who it was, but he's the author of a well-known book on this sort of thing... that separating webserving and database should be a huge win. If that's right, then we should be good to go after the new server is installed this weekend, and after some time spent getting it into service.
Well, to be fair, one of the very first questions I asked upon joining wikien-l was whether the database and webserver were on one host or two, with exactly that reasoning in mind. But I don't think I've written any well-known books... unless the Wikipedia itself counts? ;)
(David A. Wheeler david_a_wheeler@yahoo.com):
- Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching.
Sorry, I wasn't clear. I wasn't thinking of caching - I was thinking of accessing the filesystem INSTEAD of MySQL when getting the current wikitext.
No, you were clear. I am using "caching" in the plain English sense of the word. Using the file system as a cache in front of the database is just one possible implementation of the idea.
It isn't. And there's no reason to expect flex to be any faster than any other language.
Actually, for some lexing applications flex can be MUCH faster. That's because it can pre-compile a large set of patterns into C, and compile the result. Its "-C" option can, for some applications, result in blazingly fast operations.
I suppose that's true. I do want to formalize the wikitext grammar at some point, and using something like Lex/Yacc code compiled and linked into PHP as a module is certainly a possibility.
Oh, one note - if you want to simply store whether or not a given article entry exists or not, and quickly check it, one fancy way of doing this is by using a Bloom filter. You can hash the article title, and then using a fancy data structure can store its existance or non-existance.
Yes, that's a very good idea. I just recompiled the PHP on the server to have the shared memory extensions, so putting a Bloom filter into that memory is probably better than a more typical hash table.
wikitech-l@lists.wikimedia.org