On Thu, Aug 31, 2017 at 7:50 PM, Jérémie Roquet <jroquet(a)arkanosis.net>
2017-08-31 19:40 GMT+02:00 Platonides <platonides(a)gmail.com>om>:
On Thu, Aug 31, 2017 at 3:10 PM, Jérémie Roquet
PS : what could be incredibly useful to dive into articles histories
would be to import them in git², as it would allow the user to see
diffs between revisions the way you see them online, to look for when
a given sentence has been added / removed, etc. There are some very
user-friendly tools to present the histories to non-technical users
once the import has been made.
Not as much as you think. I did that once, but the results were worse
expected. git (and other scms) diffing is
line-based. You have many
relatively-independent lines of code, and diff based on that. Whereas on
wikipedia articles, each line is a full paragraph, Thus, as soon as
added a sentence (or a word), the full paragraph
showed as changed.
Good point, thanks!
Did you try with git's builtin diff UI, or with some other frontend? I
have never tried on Wikimedia dumps (I really should!) but I have to
diff XML files with horribly long lines on a regular basis — which is
something I naively believe to be very close to what diffing Wikimedia
dumps would look like — and diff-so-fancy and vimdiff do wonders with
that. Unfortunately, “user-friendly” GUIs like GitKraken, which I'd
have recommended to non-technical users, appear to handle diffs as
poorly as git builtin UI…
I think I attempted to use git gui blame, and perhaps git bisect. Not sure
how I finally handled whatever I was looking for. It has been a long time
You might be able to get better results with some preprocessing, though.