While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:
http://jan.kneschke.de/projects/mysql/order-by-rand/
That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.
Which reminds me: Someone familiar with query.php, please add a function to get a random page title.
Thanks, Magnus
On Wed, Jul 26, 2006 at 09:25:56PM +0200, Magnus Manske wrote:
While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:
Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?
Regards,
jens
Jens Frank wrote:
On Wed, Jul 26, 2006 at 09:25:56PM +0200, Magnus Manske wrote:
While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:
Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?
PostgreSQL has JOINs and max(), so why not?
Magnus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Magnus Manske writes:
http://jan.kneschke.de/projects/mysql/order-by-rand/
That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.
Um, no, it is faster than a pure ORDER BY RANDOM (almost anything is!), but not as fast as our current system, which is an excellent one. His method also only works well if you have no holes: his method of finding a value if the table has holes is not truly random.
Jens Frank asked:
Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?
It's certainly possible in Postgres, but I'd not recommend it, for the reasons mentioned above. I also wrote a blog post on this exact topic about two months ago:
http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando... or http://makeashorterlink.com/?K5D25287D
- -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200607261611 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando...
This is being pedantic, but your example
SELECT * FROM mydata WHERE myrand >= (SELECT RANDOM() OFFSET 0) ORDER BY myrand ASC LIMIT 1;
of course doesn't draw the rows uniformly at random, because the spacing between consecutive 'myrand' values is not uniform.
I suppose that if you periodically change the value of the myrand column, it should make up for that lack of randomness, for most applications - certainly Wikipedia.
If you pick a random value using the method above, and then walk forward or backward by a random number of rows, it could help even things out. Probably not worth it though.
Frederik
On Wed, Jul 26, 2006 at 08:16:21PM -0000, Greg Sabino Mullane wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Magnus Manske writes:
http://jan.kneschke.de/projects/mysql/order-by-rand/
That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.
Um, no, it is faster than a pure ORDER BY RANDOM (almost anything is!), but not as fast as our current system, which is an excellent one. His method also only works well if you have no holes: his method of finding a value if the table has holes is not truly random.
Jens Frank asked:
Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?
It's certainly possible in Postgres, but I'd not recommend it, for the reasons mentioned above. I also wrote a blog post on this exact topic about two months ago:
http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando... or http://makeashorterlink.com/?K5D25287D
Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200607261611 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE-----
iD8DBQFEx80nvJuQZxSWSsgRAlI+AJ4jgKZHPobK/rfiXJOmZ0AgxX+glQCeIVeu b2uKnN5zNCM/mbnh5/hqtg4= =MKEi -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Magnus Manske wrote:
While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:
http://jan.kneschke.de/projects/mysql/order-by-rand/
That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.
Since the page table on enwiki has 5 million rows and the maximum page_id is 6 million, I imagine there would be a bit of a bias towards articles at the top edges of gaps. That's probably better than the current situation though. I did an interesting back of the envelope calculation on the degree of bias in the page_random system, in #wikimedia-tech. Maybe I will write it up if I get time...
-- Tim Starling
wikitech-l@lists.wikimedia.org