Hi,
I'm trying to use the Mediawiki API to get information about revisions to a given page. I understand that it is possible to get the entire contents of a given revision. However I would like to know if it is possible to only get the diffs introduced in a given revision? This would result in a reduced bandwidth consumption.
My goal is to get the most frequently used terms in a set of revisions to a given page. For instance, for the web page Wikipedia what were the most common terms in all revisions between 2007-09 and 2007-11.
Thanks in advance for any comments on this issue, -- Sérgio Nunes
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
S. Nunes wrote:
Hi,
I'm trying to use the Mediawiki API to get information about revisions to a given page. I understand that it is possible to get the entire contents of a given revision. However I would like to know if it is possible to only get the diffs introduced in a given revision? This would result in a reduced bandwidth consumption.
At present time, there is no way to do this with the API. It is an oft requested feature, but unfortunately, no one has gotten around to getting it implemented. In the meantime the only way to get diffs for a revision is to fetch the full page text of the revision and the previous revision and compare the two using GNU diff or a similar utility.
Sorry for the inconvenience!
- -- Daniel Cannon (AmiDaniel)
cannon.danielc@gmail.com
On Mon, 26 Nov 2007, Daniel Cannon wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
S. Nunes wrote:
Hi,
I'm trying to use the Mediawiki API to get information about revisions to a given page. I understand that it is possible to get the entire contents of a given revision. However I would like to know if it is possible to only get the diffs introduced in a given revision? This would result in a reduced bandwidth consumption.
At present time, there is no way to do this with the API. It is an oft requested feature, but unfortunately, no one has gotten around to getting it implemented. In the meantime the only way to get diffs for a revision is to fetch the full page text of the revision and the previous revision and compare the two using GNU diff or a similar utility.
You can save some bandwidth (on large pages) by adding action=render to a regular diff url, like this:
http://en.wikipedia.org/w/index.php?action=raw&title=Wikipedia:Administr...
However, you also then have to screen-scrape the relevant parts of the diff, and to convert " < > and & to " < > &.
Paolo (en:User:Tizio)
On Mon, 26 Nov 2007, Paolo Liberatore wrote:
http://en.wikipedia.org/w/index.php?action=raw&title=Wikipedia:Administr...
Sorry, that was actually:
http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
(only action=view and action=render work with diffs)
Paolo (end:User:Tizio)
Paolo Liberatore wrote:
You can save some bandwidth (on large pages) by adding action=render to a regular diff url, like this:
http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Even better, you can get only the diff: http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Platonides schreef:
Paolo Liberatore wrote:
You can save some bandwidth (on large pages) by adding action=render to a regular diff url, like this:
http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Even better, you can get only the diff: http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Both URLs give you an HTML-formatted diff. I discovered some plain diff generation code hiding in a dark corner, which will output diffs like:
1c1 < Changed line ---
New line
5c5,6 < Changed line 2 ---
New line 2 Added line
I'll start building an API module around it, so the functionality should be available in a couple of days.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
Platonides schreef:
Paolo Liberatore wrote:
You can save some bandwidth (on large pages) by adding action=render to a regular diff url, like this:
http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Even better, you can get only the diff: http://en.wikipedia.org/w/index.php?action=render&title=Wikipedia:Admini...
Both URLs give you an HTML-formatted diff. I discovered some plain diff generation code hiding in a dark corner, which will output diffs like:
The second one won't give you the page. You can get the diffs in a plain format by removing every tag and unescaping. Taking into account at removing the class names is also helpful.
Roan Kattouw schreef:
I discovered some plain diff generation code hiding in a dark corner
[snip] I'll start building an API module around it, so the functionality should be available in a couple of days.
I've added diff generation to the prop=revisions module in r27890 [1]. Follow the link to see why this implementation is not yet good enough and will be improved soon.
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki/?view=rev&revision=27890
On Nov 27, 2007 5:40 PM, Roan Kattouw roan.kattouw@home.nl wrote:
Roan Kattouw schreef:
I discovered some plain diff generation code hiding in a dark corner
[snip] I'll start building an API module around it, so the functionality should be available in a couple of days.
I've added diff generation to the prop=revisions module in r27890 [1]. Follow the link to see why this implementation is not yet good enough and will be improved soon.
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki/?view=rev&revision=27890
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-api
So this does really output a diff string? Would it be possible to return a list of diffs in the following form:
[ { 'line': 4, 'action': 'removed' }, { 'line': 7, 'action': 'modified', 'old': 'test1', 'new': 'test2' } ]
Or is this not possible with the current diff engines?
Bryan
Bryan Tong Minh schreef:
So this does really output a diff string? Would it be possible to return a list of diffs in the following form:
[ { 'line': 4, 'action': 'removed' }, { 'line': 7, 'action': 'modified', 'old': 'test1', 'new': 'test2' } ]
Or is this not possible with the current diff engines?
Yes. I'll add an rvdiffformat parameter (later) so the diff can be output as normal, unified or array.
Roan Kattouw (Catrope)
Roan Kattouw schreef:
Bryan Tong Minh schreef:
So this does really output a diff string? Would it be possible to return a list of diffs in the following form:
[ { 'line': 4, 'action': 'removed' }, { 'line': 7, 'action': 'modified', 'old': 'test1', 'new': 'test2' } ]
Or is this not possible with the current diff engines?
Yes. I'll add an rvdiffformat parameter (later) so the diff can be output as normal, unified or array.
I've added rvdiffformat in r27913 [1]. The format of the array returned is as follows: (note that if lines are added or removed, line numbers will differ between versions)
For a changed line: { 'action': 'change' 'old': Old text 'new': New text 'oldline': Line number in old version 'newline': Line number in new version }
For an added line: { 'action': 'add' 'new': Added text 'newline': Line number in new version }
For a removed line: { 'action': 'delete' 'old': Removed text 'oldline': Line number in old version }
Unchanged lines are not listed.
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=27913
Thanks for the quick implementation. Can you please post an example on how to call this new function of the API? Is this readily available on Wikipedia? If not, when will it be?
-- Sérgio Nunes
On 11/27/07, Roan Kattouw roan.kattouw@home.nl wrote:
Roan Kattouw schreef:
Bryan Tong Minh schreef:
So this does really output a diff string? Would it be possible to return a list of diffs in the following form:
[ { 'line': 4, 'action': 'removed' }, { 'line': 7, 'action': 'modified', 'old': 'test1', 'new': 'test2' } ]
Or is this not possible with the current diff engines?
Yes. I'll add an rvdiffformat parameter (later) so the diff can be output as normal, unified or array.
I've added rvdiffformat in r27913 [1]. The format of the array returned is as follows: (note that if lines are added or removed, line numbers will differ between versions)
For a changed line: { 'action': 'change' 'old': Old text 'new': New text 'oldline': Line number in old version 'newline': Line number in new version }
For an added line: { 'action': 'add' 'new': Added text 'newline': Line number in new version }
For a removed line: { 'action': 'delete' 'old': Removed text 'oldline': Line number in old version }
Unchanged lines are not listed.
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=27913
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-api
S. Nunes schreef:
Thanks for the quick implementation. Can you please post an example on how to call this new function of the API?
Get a diff between r64 and r65: api.php?action=query&prop=revisions&revids=65&rvdiffto=64
Diff each rev of Main Page to the previous rev: api.php?action=query&prop=revisions&titles=Main%20Page&rvdifftoprev
Is this readily available on Wikipedia?
No.
If not, when will it be?
On the next software update, probably in a couple of weeks.
Roan Kattouw (Catrope)
Roan Kattouw schreef:
S. Nunes schreef:
Thanks for the quick implementation. Can you please post an example on how to call this new function of the API?
Get a diff between r64 and r65: api.php?action=query&prop=revisions&revids=65&rvdiffto=64
Diff each rev of Main Page to the previous rev: api.php?action=query&prop=revisions&titles=Main%20Page&rvdifftoprev
Is this readily available on Wikipedia?
No.
If not, when will it be?
On the next software update, probably in a couple of weeks.
Roan Kattouw (Catrope)
Because of its too high performance impact, this diff generation stuff has been removed. I'll see if I can get this to work in some other way, but for now you'll just have to diff stuff yourself.
Roan Kattouw (Catrope)
mediawiki-api@lists.wikimedia.org