Magnus Manske wrote:
While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:
http://jan.kneschke.de/projects/mysql/order-by-rand/
That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.
Since the page table on enwiki has 5 million rows and the maximum page_id is 6 million, I imagine there would be a bit of a bias towards articles at the top edges of gaps. That's probably better than the current situation though. I did an interesting back of the envelope calculation on the degree of bias in the page_random system, in #wikimedia-tech. Maybe I will write it up if I get time...
-- Tim Starling