--- On Wed, 10/29/08, Brion Vibber brion@wikimedia.org wrote:
From: Brion Vibber brion@wikimedia.org Subject: Re: [Mediawiki-api] List of all authors via API To: "MediaWiki API announcements & discussion" mediawiki-api@lists.wikimedia.org Date: Wednesday, October 29, 2008, 10:24 AM -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Magnus Manske wrote:
On Fri, Oct 24, 2008 at 5:59 PM, Brion Vibber
brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Johannes Beigel wrote:
Is there a way (or a plan to implement one) to
retrieve the list of
unique contributors for a given article (from
a given revision down to
the first one)? Ideally this would accept
parameters for the mentioned
filtering. I guess inside of MediaWiki code
this can be handled very
efficiently (using appropriate database
queries) and would eliminate
the need to transfer lots of redundant data
over the socket.
Given that this could require filtering through
hundreds of thousands of
unique revisions for a single request, I don't
think we currently have a
good plan for that. :)
I just ran a DISTINCT mysql query for all non-IP
editors of
[[en:George W. Bush]] on the toolserver, and that took
3 seconds.
There are 41790 revisions.
Indeed, it's not as bad as I was afraid. I'm still a little leery that the EXPLAIN lists "Using temporary" though. :P
Considering that this would be a worst case article,
and that it ran
on the overtaxed toolserver, it does seem possible.
Maybe if we'd have
one MySQL slave / Apache dedicated for this task?
Probably fine to pull from the same slaves already dedicated for contributions queries (relevant indexes are already pulled into memory).
Figuring out how to get something other than a raw list of thousands of editors for a "nice" author list remains a harder task. :)
wouldn't that be a snap using the group_by function? sorry, I don't know the database structure, but generically:
SELECT contributors, COUNT(*) FROM database GROUP BY contributors
would return a list of all contributors and the number of contributions they've made; it could be tweaked to return only those contributors who've made over X contributions. of course, I've only worked on small databases, so I have no idea what the overhead on this would be...