Last night, after I asked Brion whether he could turn on Rob's Special:Contributors extension on Wikimedia sites, and Domas colorfully indicated that it would be a Very Bad Idea(TM). The data for the extension can be retrieved from the revision table, but due to the existing topology of the table and the current load on the revision table itself, it would cause a very expensive query. He indicated that another option would be to create a new database table with page, user and editcount data. I've kept thinking about the idea, and have modified Domas's suggestion slightly:
mysql> describe contributors;
+--------------------+-----------------+------+-----+---------+------------- ---+
| Field | Type | Null | Key | Default | Extra |
+--------------------+-----------------+------+-----+---------+------------- ---+
| co_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| co_page | int(8) unsigned | NO | MUL | NULL | |
| co_user | int(5) unsigned | NO | MUL | 0 | |
| co_user_text | varchar(255) | NO | | NULL | |
| co_editcount | int(11) | NO | | NULL | |
| co_editcount_minor | int(11) | NO | | NULL | |
| co_timestamp | char(14) | NO | | NULL | |
+--------------------+-----------------+------+-----+---------+------------- ---+
Each row in the database would represent each row in the output of http://vs.aka-online.de/cgi-bin/wppagehiststat.pl, to pick an example. The co_id is the primary key of the table, and co_page links to the usual suspect, page.page_id. Co_user and co_user_text link to user_id and user_name, as well. Co_editcount is the number of edits said user has made to the page, and co_editcount_minor represents the number of minor edits done to the page (added because someone is going to ask for it sooner or later). Co_timestamp is the timestamp of generation of the row, added in case a fancy algorithm to detect old queries is made in the future.
As to how to initially populate the table: well, a bulk initialization script is one option, but Domas didn't like it either. ;) :P What I suggested is to populate the table on demand, and maintain it later, and after a night's sleep, I thought about how to refine it more:
* On page creation, generate a row for the page and the user.
* On page edit, one of two options:
** If a row for the page and user exists already, then UPDATE the co_editcount and co_editcount_minor fields as necessary.
** If rows for the page exist, but none points to the user, generate a new row.
** If there are no rows pointing to the page, create the necessary rows from the data in the revision table.
* On a query from [[Special:Contributors]], if there are no rows pointing to the row, generate them from revision, then return the expected results, Otherwise, just return the results.
* On page deletion/undeletion/oversight, SELECT all the rows pointing to a particular page and DELETE them from the table, essentially invalidating the cache for the page.
As to why the change is desired, it was described in detail at http://bugzilla.wikimedia.org/show_bug.cgi?id=7988. Also, this could help to implement some long-lasting suggestions to enhance Wikimedia projects' credibility and citability. For example, Roy Rosenzweig of George Mason University made a few recommendations in his article "Can History be Open Source? Wikipedia and the Future of the Past" (http://chnm.gmu.edu/resources/essays/d/42, and a good read, by the way), such as putting the total number of edits made to a page into every page's footer (next to [[MediaWiki:Lastmodifiedat]]). Also, with a JOIN with the user table, a percentage of edits made by active editors can be displayed to the viewer, which is another suggestion by Rosenzweig.
There are probably more efficient ways to do this, but I just wanted to get the ball rolling... any comments? Should it be done?
Titoxd.