Hoi,
How many characters are there according to your software in the word Mbɔ́tɛ
? The correct answer is 5
Thanks,
GerardM
2009/1/10 Greg Hewgill <greg(a)hewgill.com>
2009/1/8 Brion Vibber <brion(a)wikimedia.org>rg>:
Definitely of interest! If you haven't
already, I'd love to see some
documentation on the format on
mediawiki.org, and it'd be great if we
I did some similar work a while ago using Python's difflib[1] as the
diffing engine. Since difflib was much too slow when feeding it lists
of single characters, I broke up the articles into sequences of words
which improved the speed dramatically (but it's still not as fast as
Robert's).
My goal was slightly different, and rather than producing exact
revision deltas I was looking for "blame" information[2]. I also
computed the SHA1-matching graph of reverts, which calculates the
shortest path between the current revision and the first one,
consequently skipping over page-blanking events in most cases.
The output for the first 1400 or so articles in enwiki can be found
here:
http://hewgill.com/~greg/wikiblame/<http://hewgill.com/%7Egreg/wikiblame…
I would be interested in adapting my blame processor to use a faster
diffing algorithm, since it took my machine many hours to process
those 1400 articles.
[1]:
http://python.org/doc/2.5/lib/module-difflib.html
[2]:
http://hewgill.com/journal/entries/461-wikipedia-blame
Greg Hewgill
http://hewgill.com
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l