From: Brion Vibber <brion(a)wikimedia.org>
Subject: Re: [Mediawiki-api] List of all authors via API
To: "MediaWiki API announcements & discussion"
<mediawiki-api(a)lists.wikimedia.org>
Date: Wednesday, October 29, 2008, 10:24 AM
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Magnus Manske wrote:
On Fri, Oct 24, 2008 at 5:59 PM, Brion Vibber
<brion(a)wikimedia.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Johannes Beigel wrote:
>> Is there a way (or a plan to implement one) to
retrieve the list of
>> unique contributors for a given article
(from
a given revision down to
>> the first one)? Ideally this would accept
parameters for the mentioned
>> filtering. I guess inside of MediaWiki
code
this can be handled very
>> efficiently (using appropriate database
queries) and would eliminate
>> the need to transfer lots of redundant
data
over the socket.
> Given that this could require filtering
through
hundreds of thousands of
> unique revisions for a single request, I
don't
think we currently have a
good plan
for that. :)
I just ran a DISTINCT mysql query for all non-IP
editors of
[[en:George W. Bush]] on the toolserver, and that
took
3 seconds.
There are 41790 revisions.
Indeed, it's not as bad as I was afraid. I'm still
a little leery that
the EXPLAIN lists "Using temporary" though. :P
Considering that this would be a worst case
article,
and that it ran
on the overtaxed toolserver, it does seem
possible.
Maybe if we'd have
one MySQL slave / Apache dedicated for this task?
Probably fine to pull from the same slaves already
dedicated for
contributions queries (relevant indexes are already pulled
into memory).
Figuring out how to get something other than a raw list of
thousands of
editors for a "nice" author list remains a harder
task. :)
wouldn't that be a snap using the group_by function? sorry, I don't know the
database structure, but generically:
SELECT contributors, COUNT(*) FROM database GROUP BY contributors
would return a list of all contributors and the number of contributions they've made;
it could be tweaked to return only those contributors who've made over X
contributions. of course, I've only worked on small databases, so I have no idea what
the overhead on this would be...