On Thu, 2011-04-28 at 21:49 -0600, Jim Hutchinson wrote:
Thank you for the information and feedback. What I need is somewhat more complicated than a list of contributors to a single page. In fact, there is already a tool in Wikipedia (just called contributors, I think) that lists all the contributors to an article and their number of edits. What I need to do is, using that list of contributors, select the top 20 or so (excluding bots) for each of the hundred selected articles and get a list of all of the other articles to which each of them contributed with a frequency count of edits. Ideally, this data would be in a table of sorts for each article selected (so 100 tables).
This could, of course, be done manually by searching for contributions by username. however, this will be time consuming and possibly error prone. My hope was that a query could grab this information fairly quickly as well as automatically count frequencies of edits per article, etc.
I don't have the expertise to do this myself, but I do know someone who can and has requested an account. However, he is afraid he will not be granted an account for what will likely be a one time project.
Is there likely an API that can do what I described or would a query be an easier or more efficient way to go?
Technically, most of this shouldn't be too hard to do using SQL queries on the toolserver. One disadvantage, though, is that the toolserver does not have (direct) access to page text. This could be a problem if you, say, wanted to exclude reverts from the edit count, weigh edits by the amount of text added or do some other kind of fine-grained processing.
Basically, you have three steps you want to do:
1. Select 200 random articles. 2. Get the top contributors for each of them. 3. Get the edit counts for those contributors.
The first step is easy, as long as the (not quite uniform) random page selection algorithm built into MediaWiki is good enough for you. You could do it using a Toolserver SQL query, or just by clicking the "random page" link 200 times (by hand or by bot), but the simplest way would probably be to use the API: http://www.mediawiki.org/wiki/API:Random
If you wanted a more uniform sample, you could download the page table SQL dump (page.sql.gz), extract the page titles from it (with appropriate filtering, e.g. to exclude redirects) and randomly select 200 of them.
The second step could be easily done on the Toolserver, as long as you only wanted to count edits. For more fine-grained filtering based on page text, you could use Special:Export to obtain a "mini-dump" of the pages in your sample, including their full history, in XML format. Alternatively, the same information is also available using the API.
(The detail about excluding bots comes down to determining what is a bot. MediaWiki does feature a "bot flag", which can be used to filter out users having it. Unfortunately, for various reasons, not all bot accounts necessarily have the flag set. You might be able to filter out more bots by looking at, say, the categories on their user page, but ultimately you may still end up having to do some manual filtering.)
The last step could, again, be fairly easily done on the Toolserver as long as you only wanted the raw edit counts. In fact, it would probably be best to start with that data anyway, and then refine it by looking at the relevant page histories if necessary.