Ok, so attachments don't work. Thus, I copy and paste my brief correspondence with Jimbo. See below.
------------------------------------------------------------------------------------------------------
Subject: Re: using the "random page" feature in academic studies From: "Jimmy (Jimbo) Wales" jwales@wikia.com Date: Thu, 16 Sep 2004 08:33:40 -0700
To: samuel sha@chello.se
Well, as it turns out, I do have a lack of knowledge, and so I recommend to you wikitech-l mailing list or #mediawiki irc channel on freenode.net.
I am unsure how the random article function works, it may avoid certain types of articles for some reason, I don't actually know.
In terms of a study of quality, I do think that "random article" may be a misleading starting point, but of course this depends on the interpretation of the results.
Wikipedia is a work in progress, and so what would likely be interesting would be to have "random articles" but sorted into categories depending on things like: how many edits, how long the article has been around, how "stable" it is (i.e. if it got a ton of edits in the past but has reached an equilibrium now), how long it is, etc.
Any random article is probably not as good as Britannica, for example. But "featured articles" are generally much much better.
--Jimbo
samuel wrote:
Hi!
First off, thank you for bringing one of the best ideas in the history of the Internet to fruition. With that out of the way, onwards to the actual point of this email: I am about conduct a study of the quality of Wikipedia's content. In order to do this, I will need to randomly select articles, and for this, the "random page" feature appears a natural choice. However, I cannot find any information on how it works, and it is essential that such information in described in the methodology section of the study. The questions I have regarding the selection are the following:
- What counts as a page?
- Is the selection done from all pages/articles or a subset of them?
- Is there any weight attached to outcomes (for example, so that a very
popular or frequently edited article would have a higher chance of appearing as a random article)?
If you, for whatever reason, have a better idea than using the "random page" feature for a study of this sort, I would be very glad if you could let me know. Also, if you are unable to answer this email for lack of knowledge (I find this hard to believe, but you never know) or time constraints, or whatever, I would greatly appreciate if you could point me in the direction of someone who is more likely to be up for the task.
Thank you very much!
Best regards, Samuel Härgestam, undergrad in mathematics, computer science and philosophy at Stockholm Univerisity, as well as a true Wikipedia lover
Here's some relevant code snippets from SpecialRandomPage.php:
$rand = mt_rand() / mt_getrandmax(); $randstr = number_format( $rand, 12, ".", "" );
$sqlget = "SELECT cur_id,cur_title FROM $cur $use_index WHERE cur_namespace=0 AND cur_is_redirect=0 $extra AND cur_random>$randstr ORDER BY cur_random LIMIT 1";
Here's the PHP documentation on mt_rand():
The salient points are:
* A random number is generated using the PHP function mt_rand() * That number is used to find an article in the default namespace (not a redirect) with the closest greater cur_random value than the generated number. * The cur_random value is generated with mt_rand() and saved with the article in Article.php.
It appears that if multiple articles should happen to share the same cur_random number, you could end up with a slight bias, since then the choice of which article to return is determined by how the database decides to order the articles, and if this isn't arbitrary then some articles are less likely to be returned than others. The degree of non-randomness that this implies probably depends on the number of articles and the range of the assignable random numbers.
Hope this helps, Alan
Thanks!
It seems to me that any way the database chooses to order articles with the same value for cur_random there will be a bias, regardless if the order is arbitrary, since the arbitrary order is static. Let's take an example to see if I understand this correctly. Say that we have two articles X and Y with cur_random(X) = cur_random(Y). Then regardless if the database returns X prior to Y or the other way round if there is no y>randstr (where y is the cur_random of another article) such that y < cur_random(X) or y < cur_random(Y), either X or Y will never be returned, whatever the value of randstr.
Thus it seems my question is transformed to:
1) How is cur_random set in the first place? 2) What is the precision of cur_random and randstr? (I don't speak PHP, so I can't tell from the code.) 3) How many articles are there?
With this information, it should be possible to calculate the risk of error in the future survey.
--Samuel
Alan Wessman wrote:
Here's some relevant code snippets from SpecialRandomPage.php:
$rand = mt_rand() / mt_getrandmax(); $randstr = number_format( $rand, 12, ".", "" );
$sqlget = "SELECT cur_id,cur_title FROM $cur $use_index WHERE cur_namespace=0 AND cur_is_redirect=0 $extra AND cur_random>$randstr ORDER BY cur_random LIMIT 1";
Here's the PHP documentation on mt_rand():
The salient points are:
- A random number is generated using the PHP function mt_rand()
- That number is used to find an article in the default namespace (not
a redirect) with the closest greater cur_random value than the generated number.
- The cur_random value is generated with mt_rand() and saved with the
article in Article.php.
It appears that if multiple articles should happen to share the same cur_random number, you could end up with a slight bias, since then the choice of which article to return is determined by how the database decides to order the articles, and if this isn't arbitrary then some articles are less likely to be returned than others. The degree of non-randomness that this implies probably depends on the number of articles and the range of the assignable random numbers.
Hope this helps, Alan _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Sep 16, 2004, at 1:34 PM, samuel wrote:
- How is cur_random set in the first place?
A random number is selected and set at article creation.
- What is the precision of cur_random and randstr? (I don't speak
PHP, so I can't tell from the code.)
mt_rand_max() is (2^31)-1, so there should be ~2 billion possible random numbers.
cur_random is of type 'double', which I believe is 64-bit floating point. I _think_ PHP uses doubles for floating-point numbers internally, but I'm not sure.
Speaking SQL the numbers are converted to decimal and back, so some precision may be lost. Queries look something like this:
SELECT cur_id,cur_title FROM cur USE INDEX (cur_random) WHERE cur_namespace=0 AND cur_is_redirect=0 AND cur_random>0.440311077722 ORDER BY cur_random LIMIT 1
- How many articles are there?
http://en.wikipedia.org/wiki/Special:Statistics
-- brion vibber (brion @ pobox.com)
On Thu, Sep 16, 2004 at 01:44:45PM -0600, Alan Wessman wrote:
Here's some relevant code snippets from SpecialRandomPage.php:
- A random number is generated using the PHP function mt_rand()
- That number is used to find an article in the default namespace (not
a redirect) with the closest greater cur_random value than the generated number.
- The cur_random value is generated with mt_rand() and saved with the
article in Article.php.
Some pages have a cur_random of 0, and will never be chosen by Special:Randompage.
Regards,
JeLuF
On Sep 16, 2004, at 1:42 PM, Jens Frank wrote:
Some pages have a cur_random of 0, and will never be chosen by Special:Randompage.
A quick spot-check on enwiki indicates that this is probably due to a bug in Special:Movepage; all the afflicted pages I checked began as redirects created by a page move and were subsequently edited into articles. Most likely they were created without a cur_random assigned.
Should check if this bug is still current, and correct the remaining affected articles.
-- brion vibber (brion @ pobox.com)
Brion Vibber brion@ikso.net wrote:
Should check if this bug is still current, and correct the remaining affected articles.
In case you didn't notice already:
Index: Title.php =================================================================== RCS file: /cvsroot/wikipedia/phase3/includes/Title.php,v retrieving revision 1.116 diff -u -u -r1.116 Title.php --- Title.php 14 Sep 2004 05:35:34 -0000 1.116 +++ Title.php 17 Sep 2004 03:57:52 -0000 @@ -1101,6 +1101,8 @@ $dbw =& wfGetDB( DB_MASTER ); $now = $dbw->timestamp(); $won = wfInvertTimestamp( wfTimestamp(TS_MW,$now) ); + wfSeedRandom(); + $rand = number_format( mt_rand() / mt_getrandmax(), 12, '.', '' );
# Rename cur entry $dbw->updateArray( 'cur', @@ -1127,6 +1129,7 @@ 'inverse_timestamp' => $won, 'cur_touched' => $now, 'cur_is_redirect' => 1, + 'cur_random' => $rand, 'cur_is_new' => 1, 'cur_text' => "#REDIRECT [[" . $nt->getPrefixedText() . "]]\n" ), $fname );
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
The initial code for conversion between Simplified and Traditional Chinese is now in CVS. It uses two conversion tables to perform normal SC->TC and TC->SC conversion. A new wiki tag was introduced to specify special conversions without using the conversion tables, for example, -{zh-cn some_phrase zh-tw some_other_phrase zh-hk yet_some_other_phrase}-. Using -{some_phrase}- along will displae some_phrase for all without any conversion. A quick hack also alows the UI messages to be converted this way, but this is likely to be replaced in the future by a more general way of switching UIs.
Thanks to Tim and Jens at IRC for help!
zhengzhu wrote:
A new wiki tag was introduced to specify special conversions without using the conversion tables, for example, -{zh-cn some_phrase zh-tw some_other_phrase zh-hk yet_some_other_phrase}-. Using -{some_phrase}- along will display some_phrase for all without any conversion.
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
Timwi
Timwi wrote:
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
It seems like it should, although it may complicate things by making for some messy syntax. The main cases I can think of offhand that would require some marking up are ones that shouldn't be "translated", such as proper names (you don't want to do color->colour if it's the American Color Movie Association or something).
-Mark
On Fri, 17 Sep 2004 17:30:23 +0100, Timwi timwi@gmx.net wrote:
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
I guess this presupposes a nice way of choosing between en-us and en-gb in the wiki preferences: I wouldn't put much money on many people's browsers being set up to request the right variant (just think how many people still complain that they can't type "£", because Windows thinks they have a US keyboard!)
Rowan Collins wrote:
On Fri, 17 Sep 2004 17:30:23 +0100, Timwi timwi@gmx.net wrote:
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
I guess this presupposes a nice way of choosing between en-us and en-gb in the wiki preferences: I wouldn't put much money on many people's browsers being set up to request the right variant (just think how many people still complain that they can't type "£", because Windows thinks they have a US keyboard!)
Zhengzhu is currently coding a UI in the user preferences which will allow Chinese users to select between Traditional Chinese and Simplified Chinese. The same interface could be used for variants of any language.
Timwi
I guess it could. There is now a way to specify preferred language variant in the user's preference setting page. See LanguageZh.php:getVariants(). I guess it will be nice to have a more general way of parsing the -{}- tag, to handle more general situations.
On Fri, 17 Sep 2004 18:27:06 +0100, Rowan Collins rowan.collins@gmail.com wrote:
On Fri, 17 Sep 2004 17:30:23 +0100, Timwi timwi@gmx.net wrote:
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
I guess this presupposes a nice way of choosing between en-us and en-gb in the wiki preferences: I wouldn't put much money on many people's browsers being set up to request the right variant (just think how many people still complain that they can't type "£", because Windows thinks they have a US keyboard!)
-- Rowan Collins BSc [IMSoP]
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
It horrifies me personally to think about doing anything of this sort, i.e. having markup to specify en-us and en-gb spellings. What a waste of energy, if you ask me.
Rowan Collins wrote:
On Fri, 17 Sep 2004 17:30:23 +0100, Timwi timwi@gmx.net wrote:
Wow. I wonder if we could use the same syntax to specify the various spellings of English. Then finally users will no longer need to see a mix.
I guess this presupposes a nice way of choosing between en-us and en-gb in the wiki preferences: I wouldn't put much money on many people's browsers being set up to request the right variant (just think how many people still complain that they can't type "£", because Windows thinks they have a US keyboard!)
-- Rowan Collins BSc [IMSoP] _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jimmy (Jimbo) Wales wrote:
It horrifies me personally to think about doing anything of this sort, i.e. having markup to specify en-us and en-gb spellings. What a waste of energy, if you ask me.
Jimmy... out of everyone here, you should know best that Wikipedia isn't about what you or anyone thinks is a waste a time, but what people *actually* waste their time on.
If people are willing to spend some time using this syntax to specify both spellings, why not let them? It might seem like a little thing to many if not most people, but it would certainly give Wikipedia's content a more professional and consistent overall appearance.
Timwi
Timwi wrote:
If people are willing to spend some time using this syntax to specify both spellings, why not let them? It might seem like a little thing to many if not most people, but it would certainly give Wikipedia's content a more professional and consistent overall appearance.
One reason is that adding more syntax increases a problem that I already think is growing -- which is excessive syntax for newcomers. When you click "edit this page", you should be presented with something as obvious as possible.
I'm not making a decree or anything, mind you. Rather, I'm expressing an opinion that adding software features to deal with US/UK spelling differences is a bad idea. I feel the same way about auto-conversion of units. If community opinion differs overwhelmingly from mine, I will not stand in the way of course. But I don't think community opinion differs on these points.
--Jimbo
Jimmy (Jimbo) Wales wrote:
It horrifies me personally to think about doing anything of this sort, i.e. having markup to specify en-us and en-gb spellings. What a waste of energy, if you ask me.
Perhaps a better use for it would be to merge the Indonesian and Malaysian wikis. The two dialects are, according to some sources, only as different as en-us and en-gb.
-- Tim Starling
On Sep 16, 2004, at 12:16 PM, samuel wrote:
I am about conduct a study of the quality of Wikipedia's content. In order to do this, I will need to randomly select articles, and for this, the "random page" feature appears a natural choice.
Depending on what you're trying to measure, simply taking a random selection of pages and looking at them in isolation may or may not be useful to you. Wikipedia is a permanent work in progress; its purpose is to generate content and grow and develop it over time. The actual distribution of a published or "stable" encyclopedia is a distinct project from the free-for-all editing on the wiki.
Due to Wikipedia's nature it can be fully expected that many, many pages at any given time will be found wanting, because they are new or unpopular subjects and thus insufficiently developed. By the same token, Wikipedia can be fully expected to cover many topics that more traditional encyclopedias don't cover at all, or cover in less detail.
If you're interested in comparing the quality of Wikipedia articles against that of more traditional encyclopedias, consider also doing random selection of topics covered by those encyclopedias, then seeking the same particular topics in Wikipedia for comparison. (And vice-versa!)
However, I cannot find any information on how it works, and it is essential that such information in described in the methodology section of the study. The questions I have regarding the selection are the following:
- What counts as a page?
Returnable pages are those in article namespace (not talk pages, user pages, etc) and not marked as redirects.
- Is the selection done from all pages/articles or a subset of them?
From all articles that meet the above criteria, with one exception: on en.wikipedia.org the random selection has been hacked not to return pages last edited by the accounts Ram-Man or Rambot. This is a crude (and in my view unfortunate) hack to appease complaints that the random page function returned articles on cities and towns in the US too often.
- Is there any weight attached to outcomes (for example, so that a
very popular or frequently edited article would have a higher chance of appearing as a random article)?
No.
You are of course welcome to download the article database from http://download.wikimedia.org/ and select & sort articles in any fashion you see fit.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
On Sep 16, 2004, at 12:16 PM, samuel wrote:
I am about conduct a study of the quality of Wikipedia's content. In order to do this, I will need to randomly select articles, and for this, the "random page" feature appears a natural choice.
Depending on what you're trying to measure, simply taking a random selection of pages and looking at them in isolation may or may not be useful to you. Wikipedia is a permanent work in progress; its purpose is to generate content and grow and develop it over time. The actual distribution of a published or "stable" encyclopedia is a distinct project from the free-for-all editing on the wiki.
Due to Wikipedia's nature it can be fully expected that many, many pages at any given time will be found wanting, because they are new or unpopular subjects and thus insufficiently developed. By the same token, Wikipedia can be fully expected to cover many topics that more traditional encyclopedias don't cover at all, or cover in less detail.
Yes, I am quite familiar with the nature of Wikipedia and how it works. The object of the study is to measure the quality of the current (by "current" I mean the Wikipedia that exists at a certain, as of yet undetermined, moment in time) online version of Wikipedia, since that is what people actually use. There has been a lot of debate as to how reliable this version of Wikipedia is, and it is that I am interested in settling.
If you're interested in comparing the quality of Wikipedia articles against that of more traditional encyclopedias, consider also doing random selection of topics covered by those encyclopedias, then seeking the same particular topics in Wikipedia for comparison. (And vice-versa!)
This is indeed what will be done. It is for the "vice-versa" part that I considered using the "random page" feature to select articles.
However, I cannot find any information on how it works, and it is essential that such information in described in the methodology section of the study. The questions I have regarding the selection are the following:
- What counts as a page?
Returnable pages are those in article namespace (not talk pages, user pages, etc) and not marked as redirects.
- Is the selection done from all pages/articles or a subset of them?
From all articles that meet the above criteria, with one exception: on en.wikipedia.org the random selection has been hacked not to return pages last edited by the accounts Ram-Man or Rambot. This is a crude (and in my view unfortunate) hack to appease complaints that the random page function returned articles on cities and towns in the US too often.
Interesting. Is Ram-Man and Rambot actual users who submitted such changes, or are the accounts created for the specific purpose of identifying the articles to be ignored?
- Is there any weight attached to outcomes (for example, so that a
very popular or frequently edited article would have a higher chance of appearing as a random article)?
No.
According to other posters to the list, there are two ways that certain articles could have negative weights associated with them, the first being those that have cur_random = 0 and the other being those which share cur_random with at least one other article.
You are of course welcome to download the article database from http://download.wikimedia.org/ and select & sort articles in any fashion you see fit.
That might very well be the best idea, and what I will actually do in the end. I just thought it convenient to use the "random page" feature if it indeed worked in the way that the study demands.
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Sep 16, 2004, at 1:59 PM, samuel wrote:
Interesting. Is Ram-Man and Rambot actual users who submitted such changes, or are the accounts created for the specific purpose of identifying the articles to be ignored?
Ram-Man is an actual human; he set up a bot to do a mass import of stub pages on cities and towns generated from data from the 2000 US census, which created some 30000 pages. After a while he switched the bot to its own account, Rambot.
See: http://en.wikipedia.org/wiki/User:Rambot
According to other posters to the list, there are two ways that certain articles could have negative weights associated with them, the first being those that have cur_random = 0
This only happens in the case of bugs in the software.
and the other being those which share cur_random with at least one other article.
This should pretty rare and hopefully doesn't happen. :) If it turns out not to be rare, a 'jitter' could perhaps be added to compensate.
-- brion vibber (brion @ pobox.com)
samuel wrote:
Yes, I am quite familiar with the nature of Wikipedia and how it works. The object of the study is to measure the quality of the current (by "current" I mean the Wikipedia that exists at a certain, as of yet undetermined, moment in time) online version of Wikipedia, since that is what people actually use.
Excellent. For this, one slight improvement I suggest, rather than selecting random pages, is to select random pages with a weighting adjusted for the traffic to those pages. What we're really interested in is the quality of the content that people actually use, as opposed to obscure "dead-end", orphan or near-orphan pages that end users seldom see.
This is indeed what will be done. It is for the "vice-versa" part that I considered using the "random page" feature to select articles.
It's a good idea -- some form of randomization is a great way to look at it. Probably the random pages feature is too crude to really answer the most interesting questions, though.
--Jimbo
Jimmy (Jimbo) Wales wrote:
samuel wrote:
Yes, I am quite familiar with the nature of Wikipedia and how it works. The object of the study is to measure the quality of the current (by "current" I mean the Wikipedia that exists at a certain, as of yet undetermined, moment in time) online version of Wikipedia, since that is what people actually use.
Excellent. For this, one slight improvement I suggest, rather than selecting random pages, is to select random pages with a weighting adjusted for the traffic to those pages. What we're really interested in is the quality of the content that people actually use, as opposed to obscure "dead-end", orphan or near-orphan pages that end users seldom see.
I agree. The study will (hopefully -- this project cannot be allowed to grow arbitrarily large) make quality measures of wikipedia sample pages collected in several different ways. An example of categories that are interesting to look at with respect to quality:
1) Completely random pages 2) Most popular pages 3) Least popular pages 4) Most edited pages 5) Least edited pages 6) Most recently edited pages 7) Least recently edited pages
These are not the only possible selections (an omission is obviously selections from different normal categories of human knowledge -- technology, culture, etc. It would be interesting to see, for example, if technology articles are less likely to contain factual errors than articles about art.). However, the first category is important since there is no obvious analogy to, say, "most recently edited pages" in other encyclopedias, and some comparision to other encyclopedias is necessary if the scale at which the Wikipedia quality is judged is to be truly meaningful. (By the way, if anyone has ideas for methods to select articles from these other categories as well, by all means, let me know! Some of them are rather straightforward, such as "most recently edited pages", but others much less so.)
--Samuel
This is indeed what will be done. It is for the "vice-versa" part that I considered using the "random page" feature to select articles.
It's a good idea -- some form of randomization is a great way to look at it. Probably the random pages feature is too crude to really answer the most interesting questions, though.
--Jimbo _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
On Sep 16, 2004, at 12:16 PM, samuel wrote:
I am about conduct a study of the quality of Wikipedia's content. In order to do this, I will need to randomly select articles, and for this, the "random page" feature appears a natural choice.
Depending on what you're trying to measure, simply taking a random selection of pages and looking at them in isolation may or may not be useful to you. Wikipedia is a permanent work in progress; its purpose is to generate content and grow and develop it over time. The actual distribution of a published or "stable" encyclopedia is a distinct project from the free-for-all editing on the wiki.
On a similar note, since Wikipedia is strongly a hypertext work (at least in its current HTML-based form; a future print version would be different), random sampling articles from the flat article-space doesn't measure the same sort of experience that users would see. It probably measures something, and possibly something valuable, but it's another thing to keep in mind: people often arrive at Wikipedia pages through other Wikipedia pages, in a non-uniformly-random way.
-Mark
Brion Vibber wrote:
From all articles that meet the above criteria, with one exception: on en.wikipedia.org the random selection has been hacked not to return pages last edited by the accounts Ram-Man or Rambot. This is a crude (and in my view unfortunate) hack to appease complaints that the random page function returned articles on cities and towns in the US too often.
Did you do this by hacking the code (adding to the SQL query) or the DB (setting the relevant cur_random fields to 0)? What you said sounded like the former, but personally I think the latter makes more sense (it would automatically heal over time as articles are edited by other people and are assigned a cur_random that way).
Timwi
On Sep 17, 2004, at 9:19 AM, Timwi wrote:
Brion Vibber wrote:
on en.wikipedia.org the random selection has been hacked not to return pages last edited by the accounts Ram-Man or Rambot.
Did you do this by hacking the code (adding to the SQL query) or the DB (setting the relevant cur_random fields to 0)?
It wasn't me that did it, but here's the hack in the settings:
'wgExtraRandompageSQL' => array( 'enwiki' => 'cur_user<>3903 AND cur_user<>6120', ),
What you said sounded like the former, but personally I think the latter makes more sense (it would automatically heal over time as articles are edited by other people and are assigned a cur_random that way).
That would *not* heal, as currently cur_random is assigned only at creation time, not on further edits. The current hack *does* heal, as the last-editor user ID will change when it's further edited. As long as Ram-Man is not still editing with his own account, anyway. ;)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org