Hello,
If I'm not mistaken, the random page scheme makes some pages much more likely to be retrieved than others. If I understand correctly, each page is assigned a random number (cur_random) when it's created. To retrieve a random page, a random number x is generated and the page with cur_random > x and lower than any other cur_random is returned.
The probability of retrieving a page is proportional to the gap between its cur_random and the next lower cur_random. If cur_random is generated uniformly, the gaps have an exponential distribution. This implies there will be some gaps that are much larger than others. I guess that we could directly test this by dumping out cur_random and computing the gaps.
Since the bias (if it exists) is independent of everything else (category, age, size, authors, etc etc) I think it would be no more than a curiosity, except for people who are doing some kind of comprehensive statistical stuff.
Sorry if this is just a rerun of some earlier discussion. I did look but couldn't find anything.
best, Robert Dodier
On Jan 8, 2005, at 10:59 PM, Robert Dodier wrote: [snip]
I guess that we could directly test this by dumping out cur_random and computing the gaps.
Please feel free to do so. cur table dumps are available to the public at http://download.wikimedia.org/
Sorry if this is just a rerun of some earlier discussion. I did look but couldn't find anything.
http://www.google.com/search? hl=en&lr=&q=site%3Amail.wikipedia.org+cur_random&btnG=Search
-- brion vibber (brion @ pobox.com)
Robert Dodier wrote:
Hello,
If I'm not mistaken, the random page scheme makes some pages much more likely to be retrieved than others. If I understand correctly, each page is assigned a random number (cur_random) when it's created. To retrieve a random page, a random number x is generated and the page with cur_random > x and lower than any other cur_random is returned.
The probability of retrieving a page is proportional to the gap between its cur_random and the next lower cur_random. If cur_random is generated uniformly, the gaps have an exponential distribution. This implies there will be some gaps that are much larger than others. I guess that we could directly test this by dumping out cur_random and computing the gaps.
If you like that one, you might be interested in a bug we had with Special:Randompage back in 2002 and early 2003. Pages were selected by taking the next article with a cur_random value greater than some randomly selected value on [0,1). Then, the cur_random value was updated, set to another random value. If this new random value was close to zero, the chance of it being selected again is small. Hence, the net effect is to cause the cur_random of all articles to drift towards zero. Eventually, there were only a few hundred articles out of 100,000 which had a cur_random value greater than 0.01.
I plotted the distribution of cur_random at the time, and it seemed to make a nice curve. I've been interested since then in whether there is an analytic form for that distribution. I suspect it tends to a delta function as t->infinity, but is there some normalised distribution which describes the shape?
-- Tim Starling
wikitech-l@lists.wikimedia.org