Weighted random article

List overview All Threads
Download

newer

older

Wikimedia Commons mobile photo...

Feedback/testing wanted:...

Lars Aronsson

22 Aug 2013 22 Aug '13

8:48 p.m.

The Swedish Wikipedia now has more than 1.5 million articles, compared to 600,000 in January 2013 and 500,000 in September 2012. This is due to the creation by a bot of many articles on animal and plant species.

The Swedish Wikipedia community has discussed the matter thoroughly, and there is strong consensus to keep these articles and to keep on generating more. (It is known that many German wikipedians think these are bad articles that should be removed, but this is not their decision.)

The current implementation of [[Special:Random]], however, gives equal weight to every existing article and this is perceived as a problem that needs to be fixed.

But it is not obvious how a bug report or feature request should be written. A naive approach would be to ask for a random article that wasn't created by a bot, but this is not to the point. Users want bot generated articles to come up, only not so often. And some manually written article stubs are also less wanted. Perhaps the random function should be weighted by article length or by the number of page views? But is it practical to implement such a weighted random function? Are the necessary data in the database?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Show replies by date

Tim Starling

22 Aug 22 Aug

9:57 p.m.

On 23/08/13 10:48, Lars Aronsson wrote:

...

But it is not obvious how a bug report or feature request should be written. A naive approach would be to ask for a random article that wasn't created by a bot, but this is not to the point.

That was my solution when this issue came up on the English Wikipedia:

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/4256

The configured SQL excluded pages most recently edited by Rambot. Derek Ramsey was opposed to it, since he thought his US census stubs deserved eyeballs just as much as any hand-written article, but IIRC I managed to get this solution deployed, at least for a year or two.

...

Users want bot generated articles to come up, only not so often. And some manually written article stubs are also less wanted. Perhaps the random function should be weighted by article length or by the number of page views? But is it practical to implement such a weighted random function? Are the necessary data in the database?

It would not be especially simple. The existing database schema does not allow weighted random selection. A special data structure could be used, or it could be implemented (inefficiently) in Lucene.

An approximation would be to select, say, 100 articles from the database using page_random, then calculate a weight for each of those 100 articles using complex criteria, then do a weighted random selection from those 100 articles.

Article length is in the database, but page view count is not.

-- Tim Starling

Lars Aronsson

10:47 p.m.

On 08/23/2013 03:57 AM, Tim Starling wrote:

...

An approximation would be to select, say, 100 articles from the database using page_random, then calculate a weight for each of those 100 articles using complex criteria, then do a weighted random selection from those 100 articles.

Interesting. An even easier/coarser approximation would be to make a second draw only when the first draw doesn't meet some criteria (e.g. bot-created, shorter than L bytes, lacks illustration).

On an average day, Special:Random (and its translation Special:Slumpsida) seems to be called some 9000 times on sv.wikipedia

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

C. Scott Ananian

23 Aug 23 Aug

8:37 a.m.

This "make a second draw" approach would also let you tune how often you saw the "bad" articles. That is, if it's a bad article, then flip a coin to see if you should make a second draw. Repeat if the new article is bad, but never make more than N draws. Someone with time on their hands and a statistical bent could compute how often "good" and "bad" articles come up as a function of the ratio of good and bad articles, the coin flip probability, and the limit N. --scott On Aug 22, 2013 10:47 PM, "Lars Aronsson" lars@aronsson.se wrote:

...

On 08/23/2013 03:57 AM, Tim Starling wrote:

...
An approximation would be to select, say, 100 articles from the database using page_random, then calculate a weight for each of those 100 articles using complex criteria, then do a weighted random selection from those 100 articles.

Interesting. An even easier/coarser approximation would be to make a second draw only when the first draw doesn't meet some criteria (e.g. bot-created, shorter than L bytes, lacks illustration).

On an average day, Special:Random (and its translation Special:Slumpsida) seems to be called some 9000 times on sv.wikipedia

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Lord_Farin

9:24 a.m.

The probability of displaying a "bad" page would be:

B q ((p B)^N - 1) / (p B - 1) + B (p B)^N

(modulo errors), where B is the fraction of bad pages, p is the probability of repeating, q is the probability of displaying (so p+q = 1), and N is the allowed number of repetitions. -- LF

On 23 August 2013 14:37, C. Scott Ananian cananian@wikimedia.org wrote:

...

This "make a second draw" approach would also let you tune how often you saw the "bad" articles. That is, if it's a bad article, then flip a coin to see if you should make a second draw. Repeat if the new article is bad, but never make more than N draws. Someone with time on their hands and a statistical bent could compute how often "good" and "bad" articles come up as a function of the ratio of good and bad articles, the coin flip probability, and the limit N. --scott On Aug 22, 2013 10:47 PM, "Lars Aronsson" lars@aronsson.se wrote:

...
On 08/23/2013 03:57 AM, Tim Starling wrote:

...
An approximation would be to select, say, 100 articles from the database using page_random, then calculate a weight for each of those 100 articles using complex criteria, then do a weighted random selection from those 100 articles.

Interesting. An even easier/coarser approximation would be to make a second draw only when the first draw doesn't meet some criteria (e.g. bot-created, shorter than L bytes, lacks illustration).

On an average day, Special:Random (and its translation Special:Slumpsida) seems to be called some 9000 times on sv.wikipedia

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

11:17 a.m.

On Fri, Aug 23, 2013 at 9:24 AM, Lord_Farin lord_farin@proofwiki.orgwrote:

...

The probability of displaying a "bad" page would be:

B q ((p B)^N - 1) / (p B - 1) + B (p B)^N

(modulo errors), where B is the fraction of bad pages, p is the probability of repeating, q is the probability of displaying (so p+q = 1), and N is the allowed number of repetitions.

I'm going to rewrite that as: B (1-p) ((p B)^N - 1) / (p B - 1) + B (p B)^N ...and I'm also going to take your word on the math, because my brain is lazy this morning.

Let's run the numbers, assuming the 500,000 articles Swedish wiki had in Sept 2012 were all good, and the million articles added since are all bad. Thus B = 2/3. Let's start with N at 5, so worse case we're going to be doing 5x as many SQL queries. p is the tunable parameter. So if: p = 0 prob of getting a bad page = 67% (sanity check, this is what they've got now) p = 0.5 prob of getting a bad page = 50% p = 0.75 prob of getting a bad page = 34% p = 0.80 prob of getting a bad page = 30% p = 0.90 prob of getting a bad page = 20% p = 0.95 prob of getting a bad page = 15% p = 1.00 prob of getting a bad page = 9% (this is set by N)

If you let N go up to 10, then: p = 0.90 prob of getting a bad page = 17% p = 0.95 prob of getting a bad page = 10% p = 1.00 prob of getting a bad page = 1%

My expectation that about a 10% chance of getting a 'bad page' would make Swedish wikipedians happy, so I'd recommend p=1 N=5. But the knobs can be twiddled. --scott

-- (http://cscott.net)

Benjamin Lees

22 Aug 22 Aug

10:53 p.m.

Just add all the non-bot articles to a category and use Special:RandomInCategory. ;-)

Bináris

23 Aug 23 Aug

1:38 p.m.

2013/8/23 Lars Aronsson lars@aronsson.se

...

A naive approach would be to ask for a random article that wasn't created by a bot, but this is not to the point. Users want bot generated articles to come up, only not so often.

Unfortunately, there is no exact measure of the human or bot factor of the creator of an article, and I have long been sad because of this. Botness of an editor is only stored in recent changes table for a while, and cannot be retrieved from page history.

Tell me if I am out of date, and I will open a glass of champagne at once.

Tyler Romeo

2:25 p.m.

On Fri, Aug 23, 2013 at 1:38 PM, Bináris wikiposta@gmail.com wrote:

...

Unfortunately, there is no exact measure of the human or bot factor of the creator of an article, and I have long been sad because of this. Botness of an editor is only stored in recent changes table for a while, and cannot be retrieved from page history.

That's pretty much accurate, at least until there becomes a way to add/change revision tags through the API.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Maarten Dammers

24 Aug 24 Aug

6:38 p.m.

Hi Lars,

Op 23-8-2013 2:48, Lars Aronsson schreef:

...

The current implementation of [[Special:Random]], however, gives equal weight to every existing article and this is perceived as a problem that needs to be fixed.

We talked about this on Wikimania.

If you compare our current implementation to wheel of fortune [1]; all our articles are evenly spread around. Weighted would be putting bot articles closer to each other so you would hit them less often. You just need a good algorithm to calculate this distribution.

You could implement this algorithm as an extension in MediaWiki that updates page_random with this different distribution. This way you don't need to update the database schema, just the logic at the page save.

Maarten

[1] https://commons.wikimedia.org/wiki/File:Wheel_of_Fortune_template.svg

C. Scott Ananian

26 Aug 26 Aug

1:32 p.m.

On Sat, Aug 24, 2013 at 6:38 PM, Maarten Dammers maarten@mdammers.nlwrote:

...

If you compare our current implementation to wheel of fortune [1]; all our articles are evenly spread around. Weighted would be putting bot articles closer to each other so you would hit them less often. You just need a good algorithm to calculate this distribution.

You could implement this algorithm as an extension in MediaWiki that updates page_random with this different distribution. This way you don't need to update the database schema, just the logic at the page save.

As a simple implementation: normally articles are saved with a random sort key between 0 and 1. If bot articles were saved with a random sort key between 0 and 0.1, they would expect to be seen by Special:RandomPage 10 times less often than they were previously. (Ie, if there were a b% chance of getting a bot page from Special:RandomPage previously, there would now be a (b/10)% chance of getting a bot page.) --scott

ps. If you want to be numerically precise, you need to be more careful with the edge conditions. For example, if there are no non-bot articles, then 90% of the time the algorithm will end up wrapping around and choosing the lowest-sorted bot article, which would warp the expected distribution. It would be more correct to "re-roll the dice" in that case, which introduces an extra term into the probability which ends up resolving the apparent contradiction when b=100%..

-- (http://cscott.net)

4152

Age (days ago)

4155

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

8 participants

tags (0)

participants (8)

Benjamin Lees
Bináris
C. Scott Ananian
Lars Aronsson
Lord_Farin
Maarten Dammers
Tim Starling
Tyler Romeo