I'm working on a (currently read-only) Subversion interface to
MediaWiki:
http://www.mediawiki.org/wiki/WebDAV
It's implemented in PHP and lets me checkout wiki pages using a
Subversion client, or as Subversion externals:
http://svnbook.red-bean.com/en/1.4/svn.advanced.externals.html
I hope I'll eventually succeed in using this interface to edit pages
offline, using Emacs version control mode, or Subclipse.
Today I'm stuck on an SQL issue. To implement the Subversion
update-report, I need a list of pages which changed since revision X,
and whether those pages have any revisions before X (whether those pages
are "new").
The first half of this query (list of changed pages) was straight
forward. $entryCondition corresponds to revision.rev_id > X, but is
actually a conversion of the Subversion client's claims about its
current entry states to SQL an condition:
$where = array();
$where[] = 'page_id = revision.rev_page';
if ( !empty( $entryCondition ) ) {
$where[] = $entryCondition;
}
$options = array();
$options['GROUP BY'] = 'page_id';
$results = $dbr->select( array( 'page', 'revision' ), array(
'page_title', 'MAX(revision.rev_id)' ), $where, null, $options );
The second half of this query (whether pages are "new") has me stuck.
1) I considered building an array of pages with revisions before X; if a
page id is in array, it's not "new".
The interface is used to update from revision X, where X is often
close to the overall max rev_id (HEAD). Because in MediaWiki the list
of changed pages is always shorter than or equal to HEAD - X, and
because the list of pages with revisions before X may be huge, the
array may be huge relative to the number of pages the update-report
actually handles. So I rejected this approach.
2) I considered first getting the list of pages which changed since
revision X, then building an array of pages with revisions before X,
limited to the list of changed pages using a "page_id IN ( list of
changed pages )" SQL condition. This limits the array to only the
pages the update-report actually handles.
However, if this is an initial checkout, the list of changed pages
may be all wiki pages. In this case the "page_id IN ( list of changed
pages )" SQL condition will be huge. So I rejected this approach.
Finally, I think what I need is something like a LEFT JOIN from
revisions since X to revisions before X ON equal page ids. I can then
check for NULL rows in the second table, corresponding to "new" pages.
1) My first problem is performing this query with MediaWiki's database
layer. t1 is a row for each page changed since X, t2 is a row for
each page with revisions before X and NULL rows for pages without:
$where = array();
$where[] = 'page_id = t1.rev_page';
if ( !empty( $entryCondition ) ) {
$where[] = $entryCondition;
}
$options = array();
$options['GROUP BY'] = 'page_id';
$results = $dbr->select( array( 'page', 'revision AS t1 LEFT JOIN
revision AS t2 ON t1.rev_page = t2.rev_page AND t2.rev_id < t1.rev_id' ), array(
'page_title', 'MAX(t1.rev_id)', 't2.rev_id' ), $where, null,
$options );
The expected SQL is something like:
SELECT page_title, t1.rev_id, t2.rev_id FROM page, revision AS t1
LEFT JOIN revision AS t2 ON t1.rev_page = t2.rev_page AND t2.rev_id
< t1.rev_id WHERE page_id = t1.rev_page AND t1.rev_id > 18 GROUP BY
page_id;
However I actually get:
SELECT page_title,MAX(t1.rev_id),t2.rev_id FROM `page`,`revision
AS t1 LEFT JOIN revision AS t2 ON t1.rev_page = t2.rev_page AND
t2.rev_id < t1.rev_id` WHERE (page_id = t1.rev_page) AND
(t1.rev_id > 18) GROUP BY page_id
I'm sure the back ticks are a problem, but am not yet fully
conversant with MediaWiki's database layer, so don't know the
"right"
way to fix them. Suggestions?
2) My second problem is the SQL query itself. It appears to work,
however I suspect there's a problem in the "ON" clause. Because I
GROUP BY page_id, t1.rev_id is _a_ revision id greater than X, but
not necessarily the _minimum_ revision id greater than X.
I tried putting "t2.rev_id < MIN(t1.rev_id)" in the "ON" clause,
but
MySQL complained: Invalid use of group function
I haven't simply put "NOT $entryCondtition" in the "ON" clause
because, though in these examples it corresponds to "NOT t2.rev_id >
18", it may actually be a far more complicated condition.
Can anyone suggest changes to or provide feedback on this SQL query?
Much thanks, Jack