Hi,
I want to get some feedback on a possible Summer of Code project proposal. For last year's GSoC I created an HTML diffing library for Daisy CMS. The algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that part of mediawiki up to par with the rest of the software.
Thanks
Guy
On 22/03/2008, Guy Van den Broeck guyvdb@gmail.com wrote:
http://users.pandora.be/guyvdb/wikipediadiff.jpg Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that part of mediawiki up to par with the rest of the software.
As a user, I like that a lot.
- d.
On Sun, Mar 23, 2008 at 12:28 AM, Guy Van den Broeck guyvdb@gmail.com wrote:
Hi,
I want to get some feedback on a possible Summer of Code project proposal. For last year's GSoC I created an HTML diffing library for Daisy CMS. The algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that part of mediawiki up to par with the rest of the software.
It would be a neat feature to add along with the normal diff (imo).. how fast is that algorithm in mw compared to the normal diff?
It is fast enough in daisy afaics:
http://cocoondev.org/daisy/index/version/34/diff?otherVersion=36
Hi,
Thanks for the positive feedback.
It's reasonably fast for medium sized documents. The problem is that it relies on a word-for-word LCS pass which means that the number of elements increases with (let's say a line has an average of 30 words) a factor 30 and the maximum execution time increases by 900.
In Daisy this has not shown to be a problem. There are heuristics that work in constant time and in practice the LCS complexity is O(N) in stead of O(N²). Performance might still be a problem though and investigating all options in that department would be part of the project itself.
Even if the speed is a problem for large installs, the project can be very useful to be used in smaller installs where ease of use has a higher priority. Note that I don't want to get rid of the old diff page just yet :)
--Guy
2008/3/23, Mohamed Magdy mohamed.m.k@gmail.com:
On Sun, Mar 23, 2008 at 12:28 AM, Guy Van den Broeck guyvdb@gmail.com wrote:
Hi,
I want to get some feedback on a possible Summer of Code project
proposal.
For last year's GSoC I created an HTML diffing library for Daisy CMS.
The
algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that
part
of mediawiki up to par with the rest of the software.
It would be a neat feature to add along with the normal diff (imo).. how fast is that algorithm in mw compared to the normal diff?
It is fast enough in daisy afaics:
http://cocoondev.org/daisy/index/version/34/diff?otherVersion=36
-- --alnokta
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 23/03/2008, Guy Van den Broeck guyvdb@gmail.com wrote:
It's reasonably fast for medium sized documents. The problem is that it relies on a word-for-word LCS pass which means that the number of elements increases with (let's say a line has an average of 30 words) a factor 30 and the maximum execution time increases by 900. In Daisy this has not shown to be a problem. There are heuristics that work in constant time and in practice the LCS complexity is O(N) in stead of O(N²). Performance might still be a problem though and investigating all options in that department would be part of the project itself.
Ahh, yeah, you'd need a better algorithm :-)
Still, it's pretty darn shiny and a highly desirable thing!
- d.
On Sat, Mar 22, 2008 at 6:28 PM, Guy Van den Broeck guyvdb@gmail.com wrote:
I want to get some feedback on a possible Summer of Code project proposal. For last year's GSoC I created an HTML diffing library for Daisy CMS. The algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that part of mediawiki up to par with the rest of the software.
I agree that inline diffs would be nicer, instead of side-by-side. Having it an HTML-rendered diff instead of a wikitext diff is useful to some extent, but it hides information. It seems like it would be relatively difficult to convey the fact that templates or images were changed, for instance, and things like comments (which must be included in diffs for proper usability) would also be an issue. Some mechanism would have to be devised to convey that such invisible changes took place. Possibly you could have an option to do a wikitext diff instead, but that doesn't seem ideal to me. Doing it one way that works well for everyone would be best if possible.
As for performance, please note that Wikimedia uses a diff engine written in C++. One written in PHP would probably not be acceptable on Wikipedia, from past experience (diffing used to eat a huge amount of CPU). Scalability is also important, within reason: [[George W. Bush]] is 128 KiB, for instance.
On Sat, Mar 22, 2008 at 10:56 PM, Carl Beckhorn cbeckhorn@fastmail.fm wrote:
On Sat, Mar 22, 2008 at 09:19:35PM -0400, Simetrical wrote:
I agree that inline diffs would be nicer, instead of side-by-side.
As an option, they can be useful. But when I'm looking for a 2 byte change to a 10KB file, the side by side diff will be superior.
Why do you say that? There's a difference between side-by-side vs. inline, and showing the whole page versus showing only the changed portions. I also seem to notice some "jump to change" widgets in the screenshot.
2008/3/23, Simetrical Simetrical+wikilist@gmail.com:
On Sat, Mar 22, 2008 at 6:28 PM, Guy Van den Broeck guyvdb@gmail.com wrote:
I want to get some feedback on a possible Summer of Code project
proposal.
For last year's GSoC I created an HTML diffing library for Daisy CMS.
The
algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very
impressed
with the current "diff pages". I think a visual diff would bring that
part
of mediawiki up to par with the rest of the software.
I agree that inline diffs would be nicer, instead of side-by-side. Having it an HTML-rendered diff instead of a wikitext diff is useful to some extent, but it hides information. It seems like it would be relatively difficult to convey the fact that templates or images were changed, for instance, and things like comments (which must be included in diffs for proper usability) would also be an issue. Some mechanism would have to be devised to convey that such invisible changes took place. Possibly you could have an option to do a wikitext diff instead, but that doesn't seem ideal to me. Doing it one way that works well for everyone would be best if possible.
As for performance, please note that Wikimedia uses a diff engine written in C++. One written in PHP would probably not be acceptable on Wikipedia, from past experience (diffing used to eat a huge amount of CPU). Scalability is also important, within reason: [[George W. Bush]] is 128 KiB, for instance.
Actually images are handled rather well: http://cocoondev.org/daisy/index/version/12/diff?&otherDocumentId=2-cd&a... note that the image overlays are probably wrong on safari but in principle it works for images.
Templates and for instance table changes are handled to. In Daisy we chose to display a tooltip window with an interpretation of the underlying HTML changes. I'm sure we can find something similar tailored for the needs of mediawiki.
If I start working on the HTML diff then I might as well add a word-for-word source diff like I did for Daisy: http://cocoondev.org/daisy/index/version/12/diff?&otherDocumentId=2-cd&a... It suffers from the same performance penalty as the HTML diff but it conveys all information present.
With respect to performance I think there are a lot of option. We can fall back on a simpler diff when the filesize or execution time exceeds a certain number, or the HTML diff can be an extra (experimental) link on the current diff page. In general, I don't think the performance concern should hold back this project. Once we have the optimized html diff code we can decide how and when to integrate it.
On 23/03/2008, Guy Van den Broeck guyvdb@gmail.com wrote:
With respect to performance I think there are a lot of option. We can fall back on a simpler diff when the filesize or execution time exceeds a certain number, or the HTML diff can be an extra (experimental) link on the current diff page. In general, I don't think the performance concern should hold back this project. Once we have the optimized html diff code we can decide how and when to integrate it.
It'd be better, I suspect, to do a really nice one (from the usage POV) and scale back as and when the algorithm is just too inefficient - at least you'll know what you're aiming for in the next version.
- d.
Simetrical schreef:
I agree that inline diffs would be nicer, instead of side-by-side. Having it an HTML-rendered diff instead of a wikitext diff is useful to some extent, but it hides information. It seems like it would be relatively difficult to convey the fact that templates or images were changed, for instance, and things like comments (which must be included in diffs for proper usability) would also be an issue. Some mechanism would have to be devised to convey that such invisible changes took place. Possibly you could have an option to do a wikitext diff instead, but that doesn't seem ideal to me. Doing it one way that works well for everyone would be best if possible.
Why not both? Right now, we just render the new version of the article right below the text diff. We could replace that with an inline diff.
As for performance, please note that Wikimedia uses a diff engine written in C++. One written in PHP would probably not be acceptable on Wikipedia, from past experience (diffing used to eat a huge amount of CPU).
CPU, and memory IIRC. PHP is very bad at allocating memory efficiently. An inline diff implementation should: * be written in C++ (and *possibly* have an *alternative* version in PHP, as we have with the current diff system) * probably be integrated with wikidiff2 so the two diffs are generated simultaneously; this avoids calculating the differences between the same set of revisions twice * use the diff cache; this basically means a diff is only rendered once, then cached
Roan Kattouw (Catrope)
Roan Kattouw schreef:
- probably be integrated with wikidiff2 so the two diffs are generated
simultaneously; this avoids calculating the differences between the same set of revisions twice
Not really, as one is a wikitext diff and the other a html diff. What can be used is a cached representation of the rendered html for one of the sides. Although that gives me a feeling of "something will get inconsistent here"
Guy Van den Broeck, +1 to your proposal. Even if it only marked modified/added paragraphs it would be a useful addition. That is.. amazing!
I'm not sure what the visual diff abstract syntax tree will contain. It's possible for it to be wikitext, html is also possible.
My proposal is submitted. Now I'm leaving on a 2 week road trip in Argentina. I hope no refinements are necessary :)
Cheers,
Guy
2008/3/25, Platonides Platonides@gmail.com:
Roan Kattouw schreef:
- probably be integrated with wikidiff2 so the two diffs are generated
simultaneously; this avoids calculating the differences between the same set of revisions twice
Not really, as one is a wikitext diff and the other a html diff. What can be used is a cached representation of the rendered html for one of the sides. Although that gives me a feeling of "something will get inconsistent here"
Guy Van den Broeck, +1 to your proposal. Even if it only marked modified/added paragraphs it would be a useful addition. That is.. amazing!
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Platonides schreef:
Roan Kattouw schreef:
- probably be integrated with wikidiff2 so the two diffs are generated
simultaneously; this avoids calculating the differences between the same set of revisions twice
Not really, as one is a wikitext diff and the other a html diff. What can be used is a cached representation of the rendered html for one of the sides. Although that gives me a feeling of "something will get inconsistent here"
Maybe I should have made myself clearer. What I don't want to happen is:
* wikidiff2 works its magic and figures out that "recieved" was changed to "received" on line 123 * wikidiff2 outputs an HTML diff outlining that * inlinediff works (more or less) the same magic and figures out that "recieved" => "received" on line 123 * inlinediff outputs its HTML
The diff information (what was added/removed/changed and where) shouldn't be calculated twice, but should somehow be shared between the two. This will improve performance and increase consistency between the two diffs. It seemed to me that integrating inlinediff with wikidiff2 would be the most practical way to implement this sharing of information.
Roan Kattouw (Catrope)
Roan Kattouw roan.kattouw@home.nl wrote:
[...] The diff information (what was added/removed/changed and where) shouldn't be calculated twice, but should somehow be shared between the two. This will improve performance and increase consistency between the two diffs. It seemed to me that integrating inlinediff with wikidiff2 would be the most practical way to implement this sharing of information.
I doubt that this is feasible because a visual diff will have to consider the effect of a change on templates & Co. - very difficult unless you want to re-implement the whole parser.
Is there really so much computing time spent on diffs that performance here is an issue?
Tim
On Tue, Mar 25, 2008 at 11:20 AM, Tim Landscheidt tim@tim-landscheidt.de wrote:
Is there really so much computing time spent on diffs that performance here is an issue?
Yes. Otherwise nobody would have bothered rewriting the diff engine in C++.
Magically calculating both diffs at once is not possible. The visual diff essentially compares tree structures at a fine granularity. A lot of the markup needs to be added in to the equation to compare equality. This is too different from a source diff.
2008/3/25, Simetrical Simetrical+wikilist@gmail.com:
On Tue, Mar 25, 2008 at 11:20 AM, Tim Landscheidt tim@tim-landscheidt.de wrote:
Is there really so much computing time spent on diffs that performance here is an issue?
Yes. Otherwise nobody would have bothered rewriting the diff engine in C++.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Yes, definitely, calculating diffs is costly, and one should carefully look after that double computing issue : Some parts might be common between both diff engines.
While this idea looks promising, I'm really curious to see how this will get implemented : Considering how many issues are related to the parser, the idea of visual diff, with our current (somewhat complicated) syntax, seems very challenging.
2008/3/25, Tim Landscheidt tim@tim-landscheidt.de:
Roan Kattouw roan.kattouw@home.nl wrote:
[...]
The diff information (what was added/removed/changed and where) shouldn't be calculated twice, but should somehow be shared between the two. This will improve performance and increase consistency between the two diffs. It seemed to me that integrating inlinediff with wikidiff2 would be the most practical way to implement this sharing of information.
I doubt that this is feasible because a visual diff will have to consider the effect of a change on templates & Co. - very difficult unless you want to re-implement the whole parser.
Is there really so much computing time spent on diffs that performance here is an issue?
Tim
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Guy Van den Broeck wrote:
I want to get some feedback on a possible Summer of Code project proposal. For last year's GSoC I created an HTML diffing library for Daisy CMS. The algorithm has proven to work well and I'm thinking of porting it to mediawiki.
What the algorithm does is take the source of 2 pages and merge them to visualize the diff. The code I have already does something like this: http://users.pandora.be/guyvdb/wikipediadiff.jpg
Is this a feasible project for wikimedia? I'm personally not very impressed with the current "diff pages". I think a visual diff would bring that part of mediawiki up to par with the rest of the software.
Very nice!
Definitely keep in mind the performance & hidden-markup issues that have been brought up in the thread, but a nice clean diff view would be fantastic.
- -- brion vibber (brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org