(Mark Christensen mchristensen@humantech.com):
A quick start might be to temporarily disable all checking of links, and see if that helps much.
This seems to be a helpful suggestion. Without profiling, it's hard to tell where the bottleneck is, but I think link checking is a good guess.
And this fairly simple (now that Lee has created a functioning test suite) could probably tell us if this is a bottleneck. If so, then at least we know where to focus our optimization efforts.
If this is the problem, we are in luck because there have been a lot of good improvement suggestions. But they all add complexity to the code (or database setup) and "premature optimization is the root of all kinds of evil," so if link checking isn't a bottleneck it would be counterproductive to spend a lot of time to try to optimize it.
I think it's a bit premature for that yet. I think the differently- rendered missing links feature is pretty critical, not just a frill. In the short term the hardware will bail us out until we can find a solution.
Along those lines, I'd like to ask for some feedback: if I do the Bloom filter in shared memory thing, I can choose parameters to optimize things. So here's the first question: Bloom filters have no false negatives (that is, there's no risk that they'll show an existing page as non-existing), but there are false positives. What is an acceptable false-positive rate? With a 16-bit filter, the rate will be one in 65,000; I think that's a bit too high. With a 32-bit filter, it's one in 4 billion, which seems reasonable. 24 bits is one in 16 million, which might also be OK.
A 32-bit filter for 200,000 pages will fit into 2Mb of memory, which seems reasonable. It would require a 768-bit hash function. Anybody have recommendations for a good 768-bit hash function for titles?
Alternatively, 2Mb is probably enough space for a traditional hash table of titles and article IDs, provided the average length of a title isn't too long, and there's some reasonably efficient storage method for it.