Last night, after I asked Brion whether he could turn on Rob's Special:Contributors extension on Wikimedia sites, and Domas colorfully indicated that it would be a Very Bad Idea(TM). The data for the extension can be retrieved from the revision table, but due to the existing topology of the table and the current load on the revision table itself, it would cause a very expensive query. He indicated that another option would be to create a new database table with page, user and editcount data. I've kept thinking about the idea, and have modified Domas's suggestion slightly:
mysql> describe contributors;
+--------------------+-----------------+------+-----+---------+------------- ---+
| Field | Type | Null | Key | Default | Extra |
+--------------------+-----------------+------+-----+---------+------------- ---+
| co_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| co_page | int(8) unsigned | NO | MUL | NULL | |
| co_user | int(5) unsigned | NO | MUL | 0 | |
| co_user_text | varchar(255) | NO | | NULL | |
| co_editcount | int(11) | NO | | NULL | |
| co_editcount_minor | int(11) | NO | | NULL | |
| co_timestamp | char(14) | NO | | NULL | |
+--------------------+-----------------+------+-----+---------+------------- ---+
Each row in the database would represent each row in the output of http://vs.aka-online.de/cgi-bin/wppagehiststat.pl, to pick an example. The co_id is the primary key of the table, and co_page links to the usual suspect, page.page_id. Co_user and co_user_text link to user_id and user_name, as well. Co_editcount is the number of edits said user has made to the page, and co_editcount_minor represents the number of minor edits done to the page (added because someone is going to ask for it sooner or later). Co_timestamp is the timestamp of generation of the row, added in case a fancy algorithm to detect old queries is made in the future.
As to how to initially populate the table: well, a bulk initialization script is one option, but Domas didn't like it either. ;) :P What I suggested is to populate the table on demand, and maintain it later, and after a night's sleep, I thought about how to refine it more:
* On page creation, generate a row for the page and the user.
* On page edit, one of two options:
** If a row for the page and user exists already, then UPDATE the co_editcount and co_editcount_minor fields as necessary.
** If rows for the page exist, but none points to the user, generate a new row.
** If there are no rows pointing to the page, create the necessary rows from the data in the revision table.
* On a query from [[Special:Contributors]], if there are no rows pointing to the row, generate them from revision, then return the expected results, Otherwise, just return the results.
* On page deletion/undeletion/oversight, SELECT all the rows pointing to a particular page and DELETE them from the table, essentially invalidating the cache for the page.
As to why the change is desired, it was described in detail at http://bugzilla.wikimedia.org/show_bug.cgi?id=7988. Also, this could help to implement some long-lasting suggestions to enhance Wikimedia projects' credibility and citability. For example, Roy Rosenzweig of George Mason University made a few recommendations in his article "Can History be Open Source? Wikipedia and the Future of the Past" (http://chnm.gmu.edu/resources/essays/d/42, and a good read, by the way), such as putting the total number of edits made to a page into every page's footer (next to [[MediaWiki:Lastmodifiedat]]). Also, with a JOIN with the user table, a percentage of edits made by active editors can be displayed to the viewer, which is another suggestion by Rosenzweig.
There are probably more efficient ways to do this, but I just wanted to get the ball rolling... any comments? Should it be done?
Titoxd.
On 10/01/07, Titoxd@Wikimedia titoxd.wikimedia@gmail.com wrote:
Last night, after I asked Brion whether he could turn on Rob's Special:Contributors extension on Wikimedia sites, and Domas colorfully indicated that it would be a Very Bad Idea(TM). The data for the extension
Domas loves to hate.
can be retrieved from the revision table, but due to the existing topology of the table and the current load on the revision table itself, it would cause a very expensive query. He indicated that another option would be to
Yes. There's an index that we don't have applied that would help this a little, but not a huge amount.
create a new database table with page, user and editcount data. I've kept thinking about the idea, and have modified Domas's suggestion slightly:
Most of this seems sane.
later). Co_timestamp is the timestamp of generation of the row, added in case a fancy algorithm to detect old queries is made in the future.
As to how to initially populate the table: well, a bulk initialization script is one option, but Domas didn't like it either. ;) :P What I
A bulk initialiser would be possible, but would still incur heavy expense and isn't so easy to batch up.
suggested is to populate the table on demand, and maintain it later, and after a night's sleep, I thought about how to refine it more:
The basic premise is sound, even if the implementation could be ironed out a little. I'll sleep on it and mess about with it soon.
There are probably more efficient ways to do this, but I just wanted to get the ball rolling... any comments? Should it be done?
P = "extension goes live" Q = "extension is optimised" R = "rob gets peace"
P --> Q and P --> R ¬Q --> ¬R
R, thus Q
Rob Church
On 10/01/07, Rob Church robchur@gmail.com wrote:
P = "extension goes live" Q = "extension is optimised" R = "rob gets peace"
P --> Q and P --> R ¬Q --> ¬R
R, thus Q
That should have been R --> P, of course.
Rob Church
Gah, pressed the wrong button.
Anyways, what I wanted to ask is, does this require a core change?
Titoxd.
-----Original Message----- From: Rob Church [mailto:robchur@gmail.com] Sent: Tuesday, January 09, 2007 8:03 PM To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Request for comments: Schema change to addSpecial:Contributors [snip]
As to how to initially populate the table: well, a bulk initialization script is one option, but Domas didn't like it either. ;) :P What I
A bulk initialiser would be possible, but would still incur heavy expense and isn't so easy to batch up.
suggested is to populate the table on demand, and maintain it later, and after a night's sleep, I thought about how to refine it more:
The basic premise is sound, even if the implementation could be ironed out a little. I'll sleep on it and mess about with it soon.
There are probably more efficient ways to do this, but I just wanted to
get
the ball rolling... any comments? Should it be done?
P = "extension goes live" Q = "extension is optimised" R = "rob gets peace"
P --> Q and P --> R ¬Q --> ¬R
R, thus Q
Rob Church
On 10/01/07, Titoxd@Wikimedia titoxd.wikimedia@gmail.com wrote:
Gah, pressed the wrong button.
Anyways, what I wanted to ask is, does this require a core change?
No.
Rob Church
Hi!
Domas loves to hate.
I hate just chronic whiners, who start calling you idiot after you misspell their name or leave weeping after minor criticism, but to not limit criticism they emit themselves. I do not have issues with other people ;-)
To stay on topic - we might want eventually to put covering index for all contributions-based operations, but still, for this operation we may want to keep a separate table (and do that for most queries, that need aggregation). For different traversing methods covering indexes are way to go.
BR, Domas
On 13/01/07, Domas Mituzas midom.lists@gmail.com wrote:
I hate just chronic whiners, who start calling you idiot after you misspell their name or leave weeping after minor criticism, but to not limit criticism they emit themselves. I do not have issues with other people ;-)
I called you an idiot because I had run out of options - you were, in fact, being remarkably insulting, which is amazingly difficult to do or so I'm told.
I don't know where this "leave weeping" nonsense comes from, to be frank - I chose to go on a break in August for my sanity and for the sanity of others. It did me the world of good and I've pledged to keep taking more time off this thing. We obviously have an irrevocably broken working relationship, which is extremely sad, because I respect your database knowledge, but I have to point out that it wasn't I who cast the stone to break it.
To stay on topic - we might want eventually to put covering index for all contributions-based operations, but still, for this operation we may want to keep a separate table (and do that for most queries, that need aggregation). For different traversing methods covering indexes are way to go.
I haven't questioned the necessity of optimising the extension and agree that a separate table is probably the best approach.
Rob Church
Rob Church wrote:
On 13/01/07, Domas Mituzas midom.lists@gmail.com wrote:
I hate just chronic whiners, who start calling you idiot after you misspell
taking more time off this thing. We obviously have an irrevocably broken working relationship, which is extremely sad
well, folks, do you know it quite amusing to read such post when you haven't any idea of what is the subject :-))
keep cool... MediaWiki is a nice peace of software that can or can't be enhanced, good thing but not vital. nobody is going to die just now :-))
have a nice day, all of you :-)
jdd
mediawiki-l@lists.wikimedia.org