>>>From: Brion Vibber <brion(a)pobox.com>
>>>
>>>On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
>>> > Off the top of my head, I can't think of any simple mathematical way
>>>to
>>> > do want you want to do. (that being making articles selected randomly
>>> > less likely to be selected randomly over time)
>>>
>>>Not sure who wants to do that... For me it's quite sufficient to make
>>>the probability of selection random, and let the size of the selection
>>>field make multiple selections _reasonably_ rare.
>>
>>With over 100,000 articles, it's unlikely the user will ever see the same
>>article come up twice. A previous flawed algorithm had the property of
>>making a few articles much more likely to be selected.
>>
>I thought the reason we were having this discussion was because articles
>were repeating too often, indicating some fault in the randomizer?
>I have commonly seen an article twice in a sequence of 3 -5 random page
>requests - about a dozen times in the past month. On one occasion I saw an
>article 3 times in a sequence of 6. Which is definitely not statistically
>random.
A calculator is only as smart as the fool pressing the buttons. Same with
pseudorandom number generators. The PRNG appears to have been working
perfectly, allowing us to misuse it with amazing precision. For an
explanation of what went wrong, see my first post on this topic.
As for Brion's comment about MySQL updating RAND() seeding -- I assumed they
would be using a persistent seed, but that comment suggests they're making
the same mistake our own code makes: initialising the seed on every
connection, from the system time. Just because you ask for microseconds
doesn't mean you get microseconds. The PHP manual doesn't say how
microtime() is implemented, what if it's using the old 18.2Hz clock?
-- Tim Starling.
_________________________________________________________________
Hotmail now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp
>From: Brion Vibber <brion(a)pobox.com>
>
>On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
> > Off the top of my head, I can't think of any simple mathematical way to
> > do want you want to do. (that being making articles selected randomly
> > less likely to be selected randomly over time)
>
>Not sure who wants to do that... For me it's quite sufficient to make
>the probability of selection random, and let the size of the selection
>field make multiple selections _reasonably_ rare.
With over 100,000 articles, it's unlikely the user will ever see the same
article come up twice. A previous flawed algorithm had the property of
making a few articles much more likely to be selected.
>The resetting of the random weight upon load is just meant to keep the
>pot stirring, and "in theory" shouldn't have any effect at all on
>randomness if the random indexes were random to begin with.
Quite right. I would recommend only setting cur_random on article creation.
It would save that little tiny bit of database load. There's no reason to
think this "pot stirring" will make the selection more random, now that the
pot has been pre-stirred.
> > Seems to me that the best way to do this is either:
> > 1) Be truly random. In PostgreSQL, this would be something like:
> > SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id)
>FROM cur)
> > (this function should be really fast, not sure if faster than:)
>
>Mysql doesn't seem to let you use a non-constant expression for
>'LIMIT offset,count' (and besides, still no subqueries), but we can pick
>a random number easily enough in PHP. The problem is that we aren't sure
>what range to use; hence a random index attached to each article whose
>range is known (between 0 and 1).
The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for
versions later than 3.23. But I think the current algorithm is quite clever.
Its only drawback is the overhead associated with maintaining the extra
index, and initialising the field on article creation.
>We don't want to select non-article pages or redirects, but we don't
>currently have a count of exactly that. There's the "good articles"
>count, but it has other criteria, so if we used it we'd always crop off
>relatively recent articles. If we inflate it based on an estimate we'd
>get overemphasis on the _most_ recent article if we estimate too high.
>
>Sure, there's count(*), but that takes time to run... might be faster
>with an index specifically on cur_namespace,cur_is_redirect. Or we could
>maintain another column in the stats table which counts non-redirect
>article pages with _no_ other criteria (such as size or comma/bracket
>count). Rebuilding indexes on cur is _very_ slow, so if we need to
>change the indexes, best to plan ahead.
Maybe we could have a "good article" flag, stored in cur on edit.
-- Tim Starling.
_________________________________________________________________
MSN Instant Messenger now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_messenger.asp
>> If this is the problem, we are in luck because there have been a lot
>> of good improvement suggestions. But they all add complexity to the
>> code (or database setup) and "premature optimization is the root of
>> all kinds of evil," so if link checking isn't a bottleneck it would
be
>> counterproductive to spend a lot of time to try to optimize it.
> I think it's a bit premature for that yet. I think the differently-
> rendered missing links feature is pretty critical, not just a frill.
> In the short term the hardware will bail us out until we can find
> a solution.
What I was suggesting is just that we '''test''' the performance with
differently rendered links turned off, so that we know if this is a
place where optimization is required. I expect it is, but I think it
makes sense to check.
Even a simple hash in shared means that we are maintaining data in two
places all the time and thus and makes things somewhat more complex to
maintain.
I think this is a good idea -- if link checking is creating the current
bottleneck. And that's why I voiced support turning off the links to
see the performance impact before anybody spends time coding an
optimized link checking routine.
--Mark Christensen
> Well, to be fair, one of the very first questions
> I asked upon joining wikien-l was whether the
> database and webserver were on one host or two,
> with exactly that reasoning in mind. But I don't
> think I've written any well-known books...
> unless the Wikipedia itself counts? ;)
I believe Jimbo was talking about Ian Gilfillan, author of "Mastering
MySQL 4", who posted a couple of suggestions back in February.
--Mark
Lee Daniel Crocker <lee(a)piclab.com> said:
> (David A. Wheeler <dwheeler(a)dwheeler.com>) said:
> >
> > 1. Perhaps for simple reads of the current article (cur), you
> > could completely skip using MySQL and use the filesystem instead.
>
> In other words, caching.
Sorry, I wasn't clear.
I wasn't thinking of caching - I was thinking of accessing the
filesystem INSTEAD of MySQL when getting the current wikitext.
Why? Well, I suspect that accessing the filesystem directly
is much faster than accessing the data via MySQL - if most
accesses are simple reads, then you can access it without
user-level locks, etc., etc. Even more importantly,
checking for existence is a simple filesystem check - which
is likely to be much faster than the MySQL request.
Would it be faster? I don't know; the only _real_ way to find out
is to benchmark it.
Of course, if wikipedia is near the breaking point for performance,
another approach would be to change the design so that reading
only requires one lookup (for the data itself).
You noted the two big problems, and I agree that they're the
sticking points.
You could abandon many user settings, except ones that the user
can supply themselves to select between different stylesheets, and
abandon displaying links differently depending on whether or not
they're there. Less desirable, but you've already abandoned supporting
search! Then you can cache the generated HTML as well.
If it's a choice between having a working wikipedia, and
having the bells & whistles, I think working is the better plan.
You can always include them as settable options, to be returned once
the system doesn't have performance problems.
Although databases are more flexible for storing structured
data, for simple unstructured data, a simple filesystem-based
approach might be more suitable. This also lets you use other
existing tools (like the many tools that let you store
indexes for later rapid searching based on files).
A quick start might be to temporarily disable all checking
of links, and see if that helps much.
> > [Rendering] could also be sped up, e.g., by rewriting it in flex.
> > My "html2wikipedia" is written in flex - it's really fast and
> didn't
> > take long to write. The real problem is, I suspect that
> > isn't the bottleneck.
>
> It isn't. And there's no reason to expect flex to be any faster
> than any other language.
Actually, for some lexing applications flex can be MUCH faster.
That's because it can pre-compile a large set of patterns
into C, and compile the result. Its "-C" option can, for
some applications, result in blazingly fast operations.
You CAN do the same thing by hand, but it takes a long time to
hand-optimize that kind of code.
However, there's no point in rewriting what is not the bottleneck.
Which I why I was hoping to hear if someone has done measurements
to identify the real bottlenecks, e.g., "50% of the system
time is spent doing X". If most time is spent rendering
articles for display (without editing), then it's worth examining
what's taking the time. If the time is spent on checking if
links exist, then clearly that's worth examining.
Oh, one note - if you want to simply store whether or not a
given article entry exists or not, and quickly check it, one
fancy way of doing this is by using a Bloom filter.
You can hash the article title, and then using a fancy data
structure can store its existance or non-existance.
More info, and MIT-licensed code, for a completely different
application are at:
http://www.ir.bbn.com/projects/SPIE
(there, they hash packets so that later queries can ask
"did you see this packet"?). Given the relatively small
size of article text, it's not clear you need this
(you can store all the titles in memory), but I just thought
I'd mention it.
Anyway, thanks for listening. My hope is that the Wikipedia
doesn't become a victim of its own success :-).
LDC wrote:
>> [ibiblio] would LOVE to host wikipedia.org...
>I don't think that's an option, but certainly
>they might be useful as a mirror, or storage
>for backups, etc.
My thoughts exactly LDC. If they go for it, we would
have a fantastic mirror AND we still would still have
direct control over our own servers.
--Daniel Mayer (aka mav)
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
Someone needs to run this:
UPDATE cur SET cur_random=RAND()
Here's a quote from Village Pump explaining why:
The pseudorandom number generator which backs the Special:Randompage link
appears to be not so good. I just hit it about two dozen times and had
several links come up multiple times. -º¡º
Emphasis on pseudo. I got RNA transcription four times in about a dozen
clicks - twice in succession. Also, Zog's favorite song came up - but
redirected to Johnny Rebel. I thought redirects were excluded. Geoffrey
Here's what I found out with a few SQL queries. The cur_random field values
are strongly clustered up around the high end. In fact, there are no
articles at all between about 0.18 and 0.4, and only few below 0.18. A few
minutes of browsing through the old versions of SpecialRandompage.php shows
why. A previous version of the software selected the lowest-numbered
cur_random value, and set it to a random value. So here's why we now see
poor results: Most of the pages are clustered up above 0.9 or so, so when
you click Special:Randompage , there's a high chance of picking one of the
few low-numbered articles. The cur_random value is then reset, and there's
still a high chance of the new value being below 0.9. Hence, the few
priveleged low-numbered articles get selected far more often, and unless
someone re-randomizes cur_random column, it take a long time for the
high-numbered articles to diffuse back down. -- Tim Starling 04:25 May 1,
2003 (UTC)
--END QUOTE--
-- Tim Starling.
_________________________________________________________________
Hotmail now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp