Re: [Wikitech-l] Re: Chat about Wikipedia performance?

1 May 2003


      ...
(Mark Christensen mchristensen@humantech.com):
...
A quick start might be to temporarily disable all checking
of links, and see if that helps much.
This seems to be a helpful suggestion.  Without profiling, it's  hard to
tell where the bottleneck is, but I think link checking is a good guess.
And this fairly simple (now that Lee has created a functioning test
suite) could probably tell us if this is a bottleneck.  If so, then at
least we know where to focus our optimization efforts.
If this is the problem, we are in luck because there have been a lot of
good improvement suggestions.  But they all add complexity to the code
(or database setup) and "premature optimization is the root of all kinds
of evil," so if link checking isn't a bottleneck it would be
counterproductive to spend a lot of time to try to optimize it.
I think it's a bit premature for that yet. I think the differently-
rendered missing links feature is pretty critical, not just a frill.
In the short term the hardware will bail us out until we can find a
solution.
Along those lines, I'd like to ask for some feedback: if I do the
Bloom filter in shared memory thing, I can choose parameters to
optimize things. So here's the first question: Bloom filters have no
false negatives (that is, there's no risk that they'll show an
existing page as non-existing), but there are false positives. What
is an acceptable false-positive rate? With a 16-bit filter, the
rate will be one in 65,000; I think that's a bit too high. With a
32-bit filter, it's one in 4 billion, which seems reasonable. 24 bits
is one in 16 million, which might also be OK.
A 32-bit filter for 200,000 pages will fit into 2Mb of memory, which
seems reasonable. It would require a 768-bit hash function. Anybody
have recommendations for a good 768-bit hash function for titles?
Alternatively, 2Mb is probably enough space for a traditional hash
table of titles and article IDs, provided the average length of a
title isn't too long, and there's some reasonably efficient
storage method for it.
-- 
Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Chat about Wikipedia performance?