[Toolserver-l] Query Service Inquiry

Ilmari Karonen nospam at vyznev.net
Fri Apr 29 13:40:42 UTC 2011


On Thu, 2011-04-28 at 21:49 -0600, Jim Hutchinson wrote:
> Thank you for the information and feedback. What I need is somewhat
> more complicated than a list of contributors to a single page. In
> fact, there is already a tool in Wikipedia (just called contributors,
> I think) that lists all the contributors to an article and their
> number of edits. What I need to do is, using that list of
> contributors, select the top 20 or so (excluding bots) for each of the
> hundred selected articles and get a list of all of the other articles
> to which each of them contributed with a frequency count of edits.
> Ideally, this data would be in a table of sorts for each article
> selected (so 100 tables).
> 
> This could, of course, be done manually by searching for contributions
> by username. however, this will be time consuming and possibly error
> prone. My hope was that a query could grab this information fairly
> quickly as well as automatically count frequencies of edits per
> article, etc.
> 
> I don't have the expertise to do this myself, but I do know someone
> who can and has requested an account. However, he is afraid he will
> not be granted an account for what will likely be a one time project.
> 
> Is there likely an API that can do what I described or would a query
> be an easier or more efficient way to go?

Technically, most of this shouldn't be too hard to do using SQL queries
on the toolserver.  One disadvantage, though, is that the toolserver
does not have (direct) access to page text.  This could be a problem if
you, say, wanted to exclude reverts from the edit count, weigh edits by
the amount of text added or do some other kind of fine-grained
processing.

Basically, you have three steps you want to do:

 1. Select 200 random articles.
 2. Get the top contributors for each of them.
 3. Get the edit counts for those contributors.

The first step is easy, as long as the (not quite uniform) random page
selection algorithm built into MediaWiki is good enough for you.  You
could do it using a Toolserver SQL query, or just by clicking the
"random page" link 200 times (by hand or by bot), but the simplest way
would probably be to use the API:
http://www.mediawiki.org/wiki/API:Random

If you wanted a more uniform sample, you could download the page table
SQL dump (page.sql.gz), extract the page titles from it (with
appropriate filtering, e.g. to exclude redirects) and randomly select
200 of them.

The second step could be easily done on the Toolserver, as long as you
only wanted to count edits.  For more fine-grained filtering based on
page text, you could use Special:Export to obtain a "mini-dump" of the
pages in your sample, including their full history, in XML format.
Alternatively, the same information is also available using the API.

(The detail about excluding bots comes down to determining what is a
bot.  MediaWiki does feature a "bot flag", which can be used to filter
out users having it.  Unfortunately, for various reasons, not all bot
accounts necessarily have the flag set.  You might be able to filter out
more bots by looking at, say, the categories on their user page, but
ultimately you may still end up having to do some manual filtering.)

The last step could, again, be fairly easily done on the Toolserver as
long as you only wanted the raw edit counts.  In fact, it would probably
be best to start with that data anyway, and then refine it by looking at
the relevant page histories if necessary.

-- 
Ilmari Karonen




More information about the Toolserver-l mailing list