Random page function

List overview All Threads
Download

newer

older

Problem linking to an anchor

WikiWYSIWYG (was: Re: New SVN...

Magnus Manske

26 Jul 2006 26 Jul '06

9:25 p.m.

While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:

http://jan.kneschke.de/projects/mysql/order-by-rand/

That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.

Which reminds me: Someone familiar with query.php, please add a function to get a random page title.

Thanks, Magnus

Show replies by date

jf＠mormo.org

26 Jul 26 Jul

9:35 p.m.

On Wed, Jul 26, 2006 at 09:25:56PM +0200, Magnus Manske wrote:

...

While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:

http://jan.kneschke.de/projects/mysql/order-by-rand/

Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?

Regards,

jens

Magnus Manske

9:37 p.m.

Jens Frank wrote:

...

On Wed, Jul 26, 2006 at 09:25:56PM +0200, Magnus Manske wrote:

...
While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:

http://jan.kneschke.de/projects/mysql/order-by-rand/

Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?

PostgreSQL has JOINs and max(), so why not?

Magnus

Greg Sabino Mullane

10:16 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Magnus Manske writes:

...

http://jan.kneschke.de/projects/mysql/order-by-rand/

That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.

Um, no, it is faster than a pure ORDER BY RANDOM (almost anything is!), but not as fast as our current system, which is an excellent one. His method also only works well if you have no holes: his method of finding a value if the table has holes is not truly random.

Jens Frank asked:

...

Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?

It's certainly possible in Postgres, but I'd not recommend it, for the reasons mentioned above. I also wrote a blog post on this exact topic about two months ago:

http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando... or http://makeashorterlink.com/?K5D25287D

- -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200607261611 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- iD8DBQFEx80nvJuQZxSWSsgRAlI+AJ4jgKZHPobK/rfiXJOmZ0AgxX+glQCeIVeu b2uKnN5zNCM/mbnh5/hqtg4= =MKEi -----END PGP SIGNATURE-----

Frederik Eaton

11:17 p.m.

...

http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando...

This is being pedantic, but your example

SELECT * FROM mydata WHERE myrand >= (SELECT RANDOM() OFFSET 0) ORDER BY myrand ASC LIMIT 1;

of course doesn't draw the rows uniformly at random, because the spacing between consecutive 'myrand' values is not uniform.

I suppose that if you periodically change the value of the myrand column, it should make up for that lack of randomness, for most applications - certainly Wikipedia.

If you pick a random value using the method above, and then walk forward or backward by a random number of rows, it could help even things out. Probably not worth it though.

Frederik

On Wed, Jul 26, 2006 at 08:16:21PM -0000, Greg Sabino Mullane wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Magnus Manske writes:

...
http://jan.kneschke.de/projects/mysql/order-by-rand/

That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.

Um, no, it is faster than a pure ORDER BY RANDOM (almost anything is!), but not as fast as our current system, which is an excellent one. His method also only works well if you have no holes: his method of finding a value if the table has holes is not truly random.

Jens Frank asked:

...
Is this MySQL-specific? Is it possible to do this with PostgreSQL, too?

It's certainly possible in Postgres, but I'd not recommend it, for the reasons mentioned above. I also wrote a blog post on this exact topic about two months ago:

http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-rando... or http://makeashorterlink.com/?K5D25287D

Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200607261611 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE-----

iD8DBQFEx80nvJuQZxSWSsgRAlI+AJ4jgKZHPobK/rfiXJOmZ0AgxX+glQCeIVeu b2uKnN5zNCM/mbnh5/hqtg4= =MKEi -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- http://ofb.net/~frederik/

Tim Starling

27 Jul 27 Jul

7:31 p.m.

Magnus Manske wrote:

...

While working on another tool, I noticed that we still have that page_random value in the page table. There seems to be a better way to do this:

http://jan.kneschke.de/projects/mysql/order-by-rand/

That method only requires the primary key being an integer (which is true for the page table), nothing else. The guy who wrote the page tested it on a table with 1.000.000 rows, 1.000 "random" queries took ~0.6 seconds. Sounds like it would be fast enough for us, and we could get rid of that page_random field altogether.

Since the page table on enwiki has 5 million rows and the maximum page_id is 6 million, I imagine there would be a bit of a bias towards articles at the top edges of gaps. That's probably better than the current situation though. I did an interesting back of the envelope calculation on the degree of bias in the page_random system, in #wikimedia-tech. Maybe I will write it up if I get time...

-- Tim Starling

6721

Age (days ago)

6722

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

5 participants

tags (0)

participants (5)

Frederik Eaton
Greg Sabino Mullane
jf＠mormo.org
Magnus Manske
Tim Starling