On Thu, Aug 31, 2017 at 7:50 PM, Jérémie Roquet jroquet@arkanosis.net wrote:
Hi Platonides,
2017-08-31 19:40 GMT+02:00 Platonides platonides@gmail.com:
On Thu, Aug 31, 2017 at 3:10 PM, Jérémie Roquet jroquet@arkanosis.net wrote:
PS : what could be incredibly useful to dive into articles histories would be to import them in git², as it would allow the user to see diffs between revisions the way you see them online, to look for when a given sentence has been added / removed, etc. There are some very user-friendly tools to present the histories to non-technical users once the import has been made.
Not as much as you think. I did that once, but the results were worse
than
expected. git (and other scms) diffing is line-based. You have many relatively-independent lines of code, and diff based on that. Whereas on wikipedia articles, each line is a full paragraph, Thus, as soon as
someone
added a sentence (or a word), the full paragraph showed as changed.
Good point, thanks!
Did you try with git's builtin diff UI, or with some other frontend? I have never tried on Wikimedia dumps (I really should!) but I have to diff XML files with horribly long lines on a regular basis — which is something I naively believe to be very close to what diffing Wikimedia dumps would look like — and diff-so-fancy and vimdiff do wonders with that. Unfortunately, “user-friendly” GUIs like GitKraken, which I'd have recommended to non-technical users, appear to handle diffs as poorly as git builtin UI…
Best regards,
-- Jérémie
I think I attempted to use git gui blame, and perhaps git bisect. Not sure how I finally handled whatever I was looking for. It has been a long time ago. You might be able to get better results with some preprocessing, though.
Cheers
xmldatadumps-l@lists.wikimedia.org