New subject: [Mediawiki-l] Request for comments: Schema change to add Special:Contributors

9 Jan 2007


      Last night, after I asked Brion whether he could turn on Rob's
Special:Contributors extension on Wikimedia sites, and Domas colorfully
indicated that it would be a Very Bad Idea(TM). The data for the extension
can be retrieved from the revision table, but due to the existing topology
of the table and the current load on the revision table itself, it would
cause a very expensive query. He indicated that another option would be to
create a new database table with page, user and editcount data. I've kept
thinking about the idea, and have modified Domas's suggestion slightly:
mysql> describe contributors;
+--------------------+-----------------+------+-----+---------+-------------
---+
| Field              | Type            | Null | Key | Default | Extra
|
+--------------------+-----------------+------+-----+---------+-------------
---+
| co_id              | int(8) unsigned | NO   | PRI | NULL    |
auto_increment |
| co_page            | int(8) unsigned | NO   | MUL | NULL    |
|
| co_user            | int(5) unsigned | NO   | MUL | 0       |
|
| co_user_text       | varchar(255)    | NO   |     | NULL    |
|
| co_editcount       | int(11)         | NO   |     | NULL    |
|
| co_editcount_minor | int(11)         | NO   |     | NULL    |
|
| co_timestamp       | char(14)        | NO   |     | NULL    |
|
+--------------------+-----------------+------+-----+---------+-------------
---+
Each row in the database would represent each row in the output of
http://vs.aka-online.de/cgi-bin/wppagehiststat.pl, to pick an example. The
co_id is the primary key of the table, and co_page links to the usual
suspect, page.page_id. Co_user and co_user_text link to user_id and
user_name, as well. Co_editcount is the number of edits said user has made
to the page, and co_editcount_minor represents the number of minor edits
done to the page (added because someone is going to ask for it sooner or
later). Co_timestamp is the timestamp of generation of the row, added in
case a fancy algorithm to detect old queries is made in the future.
As to how to initially populate the table: well, a bulk initialization
script is one option, but Domas didn't like it either. ;) :P What I
suggested is to populate the table on demand, and maintain it later, and
after a night's sleep, I thought about how to refine it more:
* On page creation, generate a row for the page and the user.
* On page edit, one of two options:
** If a row for the page and user exists already, then UPDATE the
co_editcount and co_editcount_minor fields as necessary.
** If rows for the page exist, but none points to the user, generate a new
row.
** If there are no rows pointing to the page, create the necessary rows from
the data in the revision table.
* On a query from [[Special:Contributors]], if there are no rows pointing to
the row, generate them from revision, then return the expected results,
Otherwise, just return the results.
* On page deletion/undeletion/oversight, SELECT all the rows pointing to a
particular page and DELETE them from the table, essentially invalidating the
cache for the page.
As to why the change is desired, it was described in detail at
http://bugzilla.wikimedia.org/show_bug.cgi?id=7988. Also, this could help to
implement some long-lasting suggestions to enhance Wikimedia projects'
credibility and citability. For example, Roy Rosenzweig of George Mason
University made a few recommendations in his article "Can History be Open
Source? Wikipedia and the Future of the Past"
(http://chnm.gmu.edu/resources/essays/d/42, and a good read, by the way),
such as putting the total number of edits made to a page into every page's
footer (next to [[MediaWiki:Lastmodifiedat]]). Also, with a JOIN with the
user table, a percentage of edits made by active editors can be displayed to
the viewer, which is another suggestion by Rosenzweig.
There are probably more efficient ways to do this, but I just wanted to get
the ball rolling... any comments? Should it be done?
Titoxd.