Lee Daniel Crocker <lee(a)piclab.com> said:
(David A. Wheeler <dwheeler(a)dwheeler.com>)
said:
1. Perhaps for simple reads of the current article (cur), you
could completely skip using MySQL and use the filesystem instead.
In other words, caching.
Sorry, I wasn't clear.
I wasn't thinking of caching - I was thinking of accessing the
filesystem INSTEAD of MySQL when getting the current wikitext.
Why? Well, I suspect that accessing the filesystem directly
is much faster than accessing the data via MySQL - if most
accesses are simple reads, then you can access it without
user-level locks, etc., etc. Even more importantly,
checking for existence is a simple filesystem check - which
is likely to be much faster than the MySQL request.
Would it be faster? I don't know; the only _real_ way to find out
is to benchmark it.
Of course, if wikipedia is near the breaking point for performance,
another approach would be to change the design so that reading
only requires one lookup (for the data itself).
You noted the two big problems, and I agree that they're the
sticking points.
You could abandon many user settings, except ones that the user
can supply themselves to select between different stylesheets, and
abandon displaying links differently depending on whether or not
they're there. Less desirable, but you've already abandoned supporting
search! Then you can cache the generated HTML as well.
If it's a choice between having a working wikipedia, and
having the bells & whistles, I think working is the better plan.
You can always include them as settable options, to be returned once
the system doesn't have performance problems.
Although databases are more flexible for storing structured
data, for simple unstructured data, a simple filesystem-based
approach might be more suitable. This also lets you use other
existing tools (like the many tools that let you store
indexes for later rapid searching based on files).
A quick start might be to temporarily disable all checking
of links, and see if that helps much.
[Rendering]
could also be sped up, e.g., by rewriting it in flex.
My "html2wikipedia" is written in flex - it's really fast and
didn't
take long to write. The real problem is, I
suspect that
isn't the bottleneck.
It isn't. And there's no reason to expect flex to be any faster
than any other language.
Actually, for some lexing applications flex can be MUCH faster.
That's because it can pre-compile a large set of patterns
into C, and compile the result. Its "-C" option can, for
some applications, result in blazingly fast operations.
You CAN do the same thing by hand, but it takes a long time to
hand-optimize that kind of code.
However, there's no point in rewriting what is not the bottleneck.
Which I why I was hoping to hear if someone has done measurements
to identify the real bottlenecks, e.g., "50% of the system
time is spent doing X". If most time is spent rendering
articles for display (without editing), then it's worth examining
what's taking the time. If the time is spent on checking if
links exist, then clearly that's worth examining.
Oh, one note - if you want to simply store whether or not a
given article entry exists or not, and quickly check it, one
fancy way of doing this is by using a Bloom filter.
You can hash the article title, and then using a fancy data
structure can store its existance or non-existance.
More info, and MIT-licensed code, for a completely different
application are at:
http://www.ir.bbn.com/projects/SPIE
(there, they hash packets so that later queries can ask
"did you see this packet"?). Given the relatively small
size of article text, it's not clear you need this
(you can store all the titles in memory), but I just thought
I'd mention it.
Anyway, thanks for listening. My hope is that the Wikipedia
doesn't become a victim of its own success :-).