I'm working on a (currently read-only) Subversion interface to MediaWiki: http://www.mediawiki.org/wiki/WebDAV
It's implemented in PHP and lets me checkout wiki pages using a Subversion client, or as Subversion externals: http://svnbook.red-bean.com/en/1.4/svn.advanced.externals.html
I hope I'll eventually succeed in using this interface to edit pages offline, using Emacs version control mode, or Subclipse.
Today I'm stuck on an SQL issue. To implement the Subversion update-report, I need a list of pages which changed since revision X, and whether those pages have any revisions before X (whether those pages are "new").
The first half of this query (list of changed pages) was straight forward. $entryCondition corresponds to revision.rev_id > X, but is actually a conversion of the Subversion client's claims about its current entry states to SQL an condition:
$where = array(); $where[] = 'page_id = revision.rev_page'; if ( !empty( $entryCondition ) ) { $where[] = $entryCondition; }
$options = array(); $options['GROUP BY'] = 'page_id';
$results = $dbr->select( array( 'page', 'revision' ), array( 'page_title', 'MAX(revision.rev_id)' ), $where, null, $options );
The second half of this query (whether pages are "new") has me stuck.
1) I considered building an array of pages with revisions before X; if a page id is in array, it's not "new".
The interface is used to update from revision X, where X is often close to the overall max rev_id (HEAD). Because in MediaWiki the list of changed pages is always shorter than or equal to HEAD - X, and because the list of pages with revisions before X may be huge, the array may be huge relative to the number of pages the update-report actually handles. So I rejected this approach.
2) I considered first getting the list of pages which changed since revision X, then building an array of pages with revisions before X, limited to the list of changed pages using a "page_id IN ( list of changed pages )" SQL condition. This limits the array to only the pages the update-report actually handles.
However, if this is an initial checkout, the list of changed pages may be all wiki pages. In this case the "page_id IN ( list of changed pages )" SQL condition will be huge. So I rejected this approach.
Finally, I think what I need is something like a LEFT JOIN from revisions since X to revisions before X ON equal page ids. I can then check for NULL rows in the second table, corresponding to "new" pages.
1) My first problem is performing this query with MediaWiki's database layer. t1 is a row for each page changed since X, t2 is a row for each page with revisions before X and NULL rows for pages without:
$where = array(); $where[] = 'page_id = t1.rev_page'; if ( !empty( $entryCondition ) ) { $where[] = $entryCondition; }
$options = array(); $options['GROUP BY'] = 'page_id';
$results = $dbr->select( array( 'page', 'revision AS t1 LEFT JOIN revision AS t2 ON t1.rev_page = t2.rev_page AND t2.rev_id < t1.rev_id' ), array( 'page_title', 'MAX(t1.rev_id)', 't2.rev_id' ), $where, null, $options );
The expected SQL is something like:
SELECT page_title, t1.rev_id, t2.rev_id FROM page, revision AS t1 LEFT JOIN revision AS t2 ON t1.rev_page = t2.rev_page AND t2.rev_id < t1.rev_id WHERE page_id = t1.rev_page AND t1.rev_id > 18 GROUP BY page_id;
However I actually get:
SELECT page_title,MAX(t1.rev_id),t2.rev_id FROM `page`,`revision AS t1 LEFT JOIN revision AS t2 ON t1.rev_page = t2.rev_page AND t2.rev_id < t1.rev_id` WHERE (page_id = t1.rev_page) AND (t1.rev_id > 18) GROUP BY page_id
I'm sure the back ticks are a problem, but am not yet fully conversant with MediaWiki's database layer, so don't know the "right" way to fix them. Suggestions?
2) My second problem is the SQL query itself. It appears to work, however I suspect there's a problem in the "ON" clause. Because I GROUP BY page_id, t1.rev_id is _a_ revision id greater than X, but not necessarily the _minimum_ revision id greater than X.
I tried putting "t2.rev_id < MIN(t1.rev_id)" in the "ON" clause, but MySQL complained: Invalid use of group function
I haven't simply put "NOT $entryCondtition" in the "ON" clause because, though in these examples it corresponds to "NOT t2.rev_id > 18", it may actually be a far more complicated condition.
Can anyone suggest changes to or provide feedback on this SQL query?
Much thanks, Jack