Someone needs to run this:
UPDATE cur SET cur_random=RAND()
Here's a quote from Village Pump explaining why:
The pseudorandom number generator which backs the Special:Randompage link appears to be not so good. I just hit it about two dozen times and had several links come up multiple times. -º¡º
Emphasis on pseudo. I got RNA transcription four times in about a dozen clicks - twice in succession. Also, Zog's favorite song came up - but redirected to Johnny Rebel. I thought redirects were excluded. Geoffrey
Here's what I found out with a few SQL queries. The cur_random field values are strongly clustered up around the high end. In fact, there are no articles at all between about 0.18 and 0.4, and only few below 0.18. A few minutes of browsing through the old versions of SpecialRandompage.php shows why. A previous version of the software selected the lowest-numbered cur_random value, and set it to a random value. So here's why we now see poor results: Most of the pages are clustered up above 0.9 or so, so when you click Special:Randompage , there's a high chance of picking one of the few low-numbered articles. The cur_random value is then reset, and there's still a high chance of the new value being below 0.9. Hence, the few priveleged low-numbered articles get selected far more often, and unless someone re-randomizes cur_random column, it take a long time for the high-numbered articles to diffuse back down. -- Tim Starling 04:25 May 1, 2003 (UTC)
--END QUOTE--
-- Tim Starling.
_________________________________________________________________ Hotmail now available on Australian mobile phones. Go to http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp
On Thu, May 01, 2003 at 08:10:38AM -0700, Brion Vibber wrote:
On Wed, 2003-04-30 at 21:48, Tim Starling wrote:
Someone needs to run this:
UPDATE cur SET cur_random=RAND()
Ok, it'll be a bit before it's done. Hooray for big slow databases. :D
-- brion vibber (brion @ pobox.com)
Off the top of my head, I can't think of any simple mathematical way to do want you want to do. (that being making articles selected randomly less likely to be selected randomly over time) Even if every cur_random is totally re-randomized, it still isn't going to be fair.
Seems to me that the best way to do this is either: 1) Be truly random. In PostgreSQL, this would be something like: SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id) FROM cur) (this function should be really fast, not sure if faster than:) SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(*) FROM cur) (but you get the idea, may need other constraints) 2) Keep a timestamp instead of a random number. That way, whenever an article is "randomly" selected, it gets its timestamp updated to the current time. Always select the oldest article for a "random" page. New articles always get a current timestamp here.
On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
Off the top of my head, I can't think of any simple mathematical way to do want you want to do. (that being making articles selected randomly less likely to be selected randomly over time)
Not sure who wants to do that... For me it's quite sufficient to make the probability of selection random, and let the size of the selection field make multiple selections _reasonably_ rare.
The resetting of the random weight upon load is just meant to keep the pot stirring, and "in theory" shouldn't have any effect at all on randomness if the random indexes were random to begin with.
Seems to me that the best way to do this is either:
- Be truly random. In PostgreSQL, this would be something like: SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id) FROM cur) (this function should be really fast, not sure if faster than:)
Mysql doesn't seem to let you use a non-constant expression for 'LIMIT offset,count' (and besides, still no subqueries), but we can pick a random number easily enough in PHP. The problem is that we aren't sure what range to use; hence a random index attached to each article whose range is known (between 0 and 1).
We don't want to select non-article pages or redirects, but we don't currently have a count of exactly that. There's the "good articles" count, but it has other criteria, so if we used it we'd always crop off relatively recent articles. If we inflate it based on an estimate we'd get overemphasis on the _most_ recent article if we estimate too high.
Sure, there's count(*), but that takes time to run... might be faster with an index specifically on cur_namespace,cur_is_redirect. Or we could maintain another column in the stats table which counts non-redirect article pages with _no_ other criteria (such as size or comma/bracket count). Rebuilding indexes on cur is _very_ slow, so if we need to change the indexes, best to plan ahead.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org