r39406 contains a brand new implementation of the diff algorithm (in PHP). It's in the same family as the previous one but has better performance and a better heuristic for large diffs.
This one is actually considerably faster than the old diff. In attachment I included benchmark output that I use with some of the biggest articles on wikipedia and some medium sized ones. It simulates the actual diff workload with a line based diff first, followed by a word based diff. The comparison is done between a set of sometimes distant versions for both the old algorithm as the new one. The summary in the end concludes that the new implementation is almost 30 times faster. This is mainly due to a couple of problematic diffs that the old code choked on. For these articles the new code uses a heuristic that is very fast. By looking at the length of the LCS you can see that a common subsequence is still found that is over 90% optimal and sometimes even 100%. The settings that I used to determine when the heuristic kicks in are a bit arbitrary. Even without enabling this heuristic the new code a lot faster for this benchmark.
2008/8/5 Guy Van den Broeck guyvdb@gmail.com:
Yes I realise, I was mainly doing this out of a personal interest. Of course before I started this I was not aware of how big the performance impact would be. I still think the new Diff has added value with the greedy heuristic for large documents. The PHP implementation is used on small default installs and they can have big documents too. Not getting an out of memory error, or having timeouts is useful there.
2008/8/5 Simetrical Simetrical+wikilist@gmail.com:
On Tue, Aug 5, 2008 at 3:55 PM, Guy Van den Broeck guyvdb@gmail.com wrote:
I can't draw any final conclusions with respect to performance because I don't have a site of considerable size. I'd appreciate it if somebody could test this implementation and compare it to the current php installation.
Any site of considerable size is likely to be running wikidiff2, which is written in C++ and almost certainly much faster than your PHP implementation. (Or did you really mean that your PHP code is faster than the C++ library?) So if your target is big sites, you shouldn't have written it in PHP.
Diffing is not a significant bottleneck at present with wikidiff2, however, so a rewrite for performance reasons alone is unlikely to be productive. I imagine the Wikimedia sysadmins (for instance) would be uninterested in trying out new solutions when an established one works perfectly well.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l