(I tried to post this question before but was not properly registered for the mailing list. If this is a repeat I apologize.)
I am in need of some guidance on how to get some data out of the query service. I signed up for an account, but I'm not sure if I'm supposed to create an "issue" or if there is some other process I should follow. I'm also not sure if the query service is right way to go about this.
In short, as part of a graduate research project, I need to select about 100 articles in Wikipedia, find the most frequent editors (there is a "contributors" tool which ranks them) such as those with more than 10 edits to the article and then for each of these editors generate a list of all pages they have edited with a frequency count for each.
Does this sound like the query service is the right way to go about collecting this data and if so can someone point me to the proper procedure for making such a request?
Thanks.
Jim Hutchinson wrote:
(I tried to post this question before but was not properly registered for the mailing list. If this is a repeat I apologize.)
I am in need of some guidance on how to get some data out of the query service. I signed up for an account, but I'm not sure if I'm supposed to create an "issue" or if there is some other process I should follow. I'm also not sure if the query service is right way to go about this.
In short, as part of a graduate research project, I need to select about 100 articles in Wikipedia, find the most frequent editors (there is a "contributors" tool which ranks them) such as those with more than 10 edits to the article and then for each of these editors generate a list of all pages they have edited with a frequency count for each.
Does this sound like the query service is the right way to go about collecting this data and if so can someone point me to the proper procedure for making such a request?
Thanks.
If you get a toolserver account, you don't need the query service, as you could perform them yourself. Otherwise, doing such queries using toolserver tables is a way to do it. Another way would be for you to download the stub-articles.xml.gz file from http://dumps.wikimedia.org/ and process it to fetch that data. 100 articles is easy to do.
Jim Hutchinson wrote:
Does this sound like the query service is the right way to go about collecting this data and if so can someone point me to the proper procedure for making such a request?
Probably not.
https://wiki.toolserver.org/view/Query_service explains the query service. It's largely for requests such as "I'd like to get the name of every article on the English Wikipedia containing a deleted image" or something like that. Those types of requests have a lot of results and would be completely impractical to do via other means (such as individual HTTP requests to the API). It's vastly simpler to query the database directly and output the results to a text file or wherever.
https://wiki.toolserver.org/view/Account_approval_process explains how to obtain a Toolserver shell account. If you were going to build a tool to study article contributors or you needed to run queries frequently, it would make sense for you to apply for an account. In this case, because your scope is so narrow, it doesn't make much sense for you to apply for an account.
You can try using the dumps, as Platonides suggested, or what I would do in this case is use the public API. Every Wikimedia wiki has one, for example: http://en.wikipedia.org/w/api.php. That allows you to easily get lists of the contributors to any page, which you can then aggregate and analyze.
Feel free to ask further questions on this list as necessary and appropriate.
MZMcBride
On Thu, Apr 28, 2011 at 5:19 PM, MZMcBride z@mzmcbride.com wrote:
You can try using the dumps, as Platonides suggested, or what I would do in this case is use the public API. Every Wikimedia wiki has one, for example: http://en.wikipedia.org/w/api.php. That allows you to easily get lists of the contributors to any page, which you can then aggregate and analyze.
Thank you for the information and feedback. What I need is somewhat more complicated than a list of contributors to a single page. In fact, there is already a tool in Wikipedia (just called contributors, I think) that lists all the contributors to an article and their number of edits. What I need to do is, using that list of contributors, select the top 20 or so (excluding bots) for each of the hundred selected articles and get a list of all of the other articles to which each of them contributed with a frequency count of edits. Ideally, this data would be in a table of sorts for each article selected (so 100 tables).
This could, of course, be done manually by searching for contributions by username. however, this will be time consuming and possibly error prone. My hope was that a query could grab this information fairly quickly as well as automatically count frequencies of edits per article, etc.
I don't have the expertise to do this myself, but I do know someone who can and has requested an account. However, he is afraid he will not be granted an account for what will likely be a one time project.
Is there likely an API that can do what I described or would a query be an easier or more efficient way to go?
Thanks again.
Jim Hutchinson wrote:
What I need to do is, using that list of contributors, select the top 20 or so (excluding bots) for each of the hundred selected articles and get a list of all of the other articles to which each of them contributed with a frequency count of edits. Ideally, this data would be in a table of sorts for each article selected (so 100 tables).
This could, of course, be done manually by searching for contributions by username. however, this will be time consuming and possibly error prone. My hope was that a query could grab this information fairly quickly as well as automatically count frequencies of edits per article, etc.
I don't have the expertise to do this myself, but I do know someone who can and has requested an account. However, he is afraid he will not be granted an account for what will likely be a one time project.
Is there likely an API that can do what I described or would a query be an easier or more efficient way to go?
Yeah, user contributions are one of those things that are sometimes vastly easier with the direct queries, as some users have thousands and thousands of edits. This is obviously a problem for very few sites (mostly some popular Wikimedia wikis), but for the wikis where it is a problem, it can really slow things down to have to pull/aggregate so many edits. :-)
You can try filing a ticket in JIRA through the Query service. Alternately, I recently made a page at Meta-Wiki (http://meta.wikimedia.org/wiki/Tech) where you could try posting. I mostly set up the page because I felt trying to get people to figure out JIRA, in addition to forcing them to articulate what they actually want from the database, is too much. Plus it can act as more of a forum/help desk than JIRA can.
Hope that helps,
MZMcBride
On Thu, 2011-04-28 at 21:49 -0600, Jim Hutchinson wrote:
Thank you for the information and feedback. What I need is somewhat more complicated than a list of contributors to a single page. In fact, there is already a tool in Wikipedia (just called contributors, I think) that lists all the contributors to an article and their number of edits. What I need to do is, using that list of contributors, select the top 20 or so (excluding bots) for each of the hundred selected articles and get a list of all of the other articles to which each of them contributed with a frequency count of edits. Ideally, this data would be in a table of sorts for each article selected (so 100 tables).
This could, of course, be done manually by searching for contributions by username. however, this will be time consuming and possibly error prone. My hope was that a query could grab this information fairly quickly as well as automatically count frequencies of edits per article, etc.
I don't have the expertise to do this myself, but I do know someone who can and has requested an account. However, he is afraid he will not be granted an account for what will likely be a one time project.
Is there likely an API that can do what I described or would a query be an easier or more efficient way to go?
Technically, most of this shouldn't be too hard to do using SQL queries on the toolserver. One disadvantage, though, is that the toolserver does not have (direct) access to page text. This could be a problem if you, say, wanted to exclude reverts from the edit count, weigh edits by the amount of text added or do some other kind of fine-grained processing.
Basically, you have three steps you want to do:
1. Select 200 random articles. 2. Get the top contributors for each of them. 3. Get the edit counts for those contributors.
The first step is easy, as long as the (not quite uniform) random page selection algorithm built into MediaWiki is good enough for you. You could do it using a Toolserver SQL query, or just by clicking the "random page" link 200 times (by hand or by bot), but the simplest way would probably be to use the API: http://www.mediawiki.org/wiki/API:Random
If you wanted a more uniform sample, you could download the page table SQL dump (page.sql.gz), extract the page titles from it (with appropriate filtering, e.g. to exclude redirects) and randomly select 200 of them.
The second step could be easily done on the Toolserver, as long as you only wanted to count edits. For more fine-grained filtering based on page text, you could use Special:Export to obtain a "mini-dump" of the pages in your sample, including their full history, in XML format. Alternatively, the same information is also available using the API.
(The detail about excluding bots comes down to determining what is a bot. MediaWiki does feature a "bot flag", which can be used to filter out users having it. Unfortunately, for various reasons, not all bot accounts necessarily have the flag set. You might be able to filter out more bots by looking at, say, the categories on their user page, but ultimately you may still end up having to do some manual filtering.)
The last step could, again, be fairly easily done on the Toolserver as long as you only wanted the raw edit counts. In fact, it would probably be best to start with that data anyway, and then refine it by looking at the relevant page histories if necessary.
- Select 200 random articles.
- Get the top contributors for each of them.
- Get the edit counts for those contributors.
I think he has the list/s of 200 articles, and does not want random ones. Plus, he doesn't want the editcounts, he wants their top edited articles, with the editcount per article.
My personal opinion is that this HAS to be done via php (though I can't comment of server load). Use php-mysql to determine the list of top contributors per given article, then loop for each contributor, and give *his* top edited articles... Shouldn't be hard, though you might want to clarify what you mean by "top". (Top 3? More than X edits? More than X% edits per day/week/month/beginning of time? More than X% edits of the top editor?).
-Manishearth
Manish Goregaokar wrote:
- Select 200 random articles.
- Get the top contributors for each of them.
- Get the edit counts for those contributors.
I think he has the list/s of 200 articles, and does not want random ones. Plus, he doesn't want the editcounts, he wants their top edited articles, with the editcount per article.
My personal opinion is that this HAS to be done via php (though I can't comment of server load). Use php-mysql to determine the list of top contributors per given article, then loop for each contributor, and give *his* top edited articles... Shouldn't be hard, though you might want to clarify what you mean by "top". (Top 3? More than X edits? More than X% edits per day/week/month/beginning of time? More than X% edits of the top editor?).
-Manishearth
It's quite easy processing the stub-pages-articles dump, too.
1. Read the dump, if the page title matches, record all editing users. 2. Order the author list per article, select which ones pass to the next phase. 3. Read the dump again, if the user edited that page (and it's in the main namespace), record that page name. 4. ??? 5. Profit
You may be able to get several steps with a single SQL query, but I'm not convinced that would perform significantly better. Working form a XML is a bit outdated, but more reproduceable.
On Fri, Apr 29, 2011 at 7:58 AM, Manish Goregaokar manishsmail@gmail.comwrote:
- Select 200 random articles.
- Get the top contributors for each of them.
- Get the edit counts for those contributors.
I think he has the list/s of 200 articles, and does not want random ones. Plus, he doesn't want the editcounts, he wants their top edited articles, with the editcount per article.
My personal opinion is that this HAS to be done via php (though I can't comment of server load). Use php-mysql to determine the list of top contributors per given article, then loop for each contributor, and give *his* top edited articles... Shouldn't be hard, though you might want to clarify what you mean by "top". (Top 3? More than X edits? More than X% edits per day/week/month/beginning of time? More than X% edits of the top editor?).
Thanks again for the info. Yes, this is basically correct. I am looking to collect this info based on 100 articles from the Wikipedia science series. If the data proves relatively easy to collect, I like to collect data on all articles in the science series which is around 200 articles. Top contributors for me are those with 10 or more edits in the sampled article from the science series. For the sake of clarity, here is a short sample of the data I'm looking for.
From the "science" article http://en.wikipedia.org/wiki/Science
Clicking "view history" and then "contributors" gives a ranked list of all contributors in order of most edits.
http://toolserver.org/~daniel/WikiSense/Contributors.php?wikilang=en&wik...
The top three editors (lets call them A, B, and C) currently have 445, 73 and 70 edits respectively. Clicking on contributor "A" to see their user page and then the "user contributions" from the tool box shows all their edits. For example, he/she has several edits to the articles "intelligent design" and "southern poverty law center", etc. and user "B" has edits to "rock formations" and "human evolution". I would like to count frequency of all these edits across the top users for the sampled (e.g. science) articles sorted by the article title.
I don't know what the best way to arrange the data would be, but below is a Google Doc Spreadsheet that sort of shows what I think it would look like.
If the Query Service seems the best approach (is this done using the php-mysql referenced above or is it a different process?) then I will go ahead and create a task on https://jira.toolserver.org/browse/DBQ. If this is not the best or correct way to go any guidance is appreciated.
Thanks.
Thanks again for all the feedback. Due to my limits on time, I went ahead and submitted a task on the query service.
https://jira.toolserver.org/browse/DBQ-140
I don't know where this goes from here, but if anyone has any suggestion please share.
Thanks, Jim
On Mon, May 2, 2011 at 9:12 AM, Jim Hutchinson jim@ubuntu-rocks.org wrote:
On Fri, Apr 29, 2011 at 7:58 AM, Manish Goregaokar manishsmail@gmail.comwrote:
- Select 200 random articles.
- Get the top contributors for each of them.
- Get the edit counts for those contributors.
I think he has the list/s of 200 articles, and does not want random ones. Plus, he doesn't want the editcounts, he wants their top edited articles, with the editcount per article.
My personal opinion is that this HAS to be done via php (though I can't comment of server load). Use php-mysql to determine the list of top contributors per given article, then loop for each contributor, and give *his* top edited articles... Shouldn't be hard, though you might want to clarify what you mean by "top". (Top 3? More than X edits? More than X% edits per day/week/month/beginning of time? More than X% edits of the top editor?).
Thanks again for the info. Yes, this is basically correct. I am looking to collect this info based on 100 articles from the Wikipedia science series. If the data proves relatively easy to collect, I like to collect data on all articles in the science series which is around 200 articles. Top contributors for me are those with 10 or more edits in the sampled article from the science series. For the sake of clarity, here is a short sample of the data I'm looking for.
From the "science" article http://en.wikipedia.org/wiki/Science
Clicking "view history" and then "contributors" gives a ranked list of all contributors in order of most edits.
http://toolserver.org/~daniel/WikiSense/Contributors.php?wikilang=en&wik...
The top three editors (lets call them A, B, and C) currently have 445, 73 and 70 edits respectively. Clicking on contributor "A" to see their user page and then the "user contributions" from the tool box shows all their edits. For example, he/she has several edits to the articles "intelligent design" and "southern poverty law center", etc. and user "B" has edits to "rock formations" and "human evolution". I would like to count frequency of all these edits across the top users for the sampled (e.g. science) articles sorted by the article title.
I don't know what the best way to arrange the data would be, but below is a Google Doc Spreadsheet that sort of shows what I think it would look like.
If the Query Service seems the best approach (is this done using the php-mysql referenced above or is it a different process?) then I will go ahead and create a task on https://jira.toolserver.org/browse/DBQ. If this is not the best or correct way to go any guidance is appreciated.
Thanks.
-- Jim
toolserver-l@lists.wikimedia.org