From page history to sentence history

List overview All Threads
Download

newer

older

New committer: Apekshit Sharma...

FYI : Commons thumbnails slow

Lars Aronsson

17 Jan 2011 17 Jan '11

12:34 a.m.

In the early days, one could follow Recent changes on a daily basis, to see if anything had changed. Nowadays watchlists reduce the amount of information to those pages one is interested in.

Many articles are soo long, and have been edited so many times, that the history view is almost useless. If I want to find out when and how the sentence "Overall, the city is relatively flat"in the article [[en:Paris]] has changed over time, I can sit all day and analyze individual diffs.

I think it would be very useful if I could highlight a sentence, paragraph or section of an article and get a reduced history view with only those edits that changed that part of the page. What sorts of indexes would be needed to facilitate such a search? Has anybody already implemented this as a separate tool?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Show replies by date

Benjamin Lees

17 Jan 17 Jan

5:18 a.m.

On Sun, Jan 16, 2011 at 7:34 PM, Lars Aronsson lars@aronsson.se wrote:

...

I think it would be very useful if I could highlight a sentence, paragraph or section of an article and get a reduced history view with only those edits that changed that part of the page. What sorts of indexes would be needed to facilitate such a search? Has anybody already implemented this as a separate tool?

The New York Times has made some progress in this area.[0] Of course, their articles don't get edited the way Wikipedia ones do...

I've tried using WikiBlame[1] a few times, but it operates at the level of strings, rather than sections/paragraphs/sentences, so, like you, I'm left to do most of my digging by hand. Glad to know I'm not alone in my pain. :-)

[0] http://open.blogs.nytimes.com/2011/01/11/emphasis-update-and-source/ [1] http://en.wikipedia.org/wiki/User:Flominator/WikiBlame

Alex Brollo

10:55 a.m.

2011/1/17 Benjamin Lees emufarmers@gmail.com

...

On Sun, Jan 16, 2011 at 7:34 PM, Lars Aronsson lars@aronsson.se wrote:

...
I think it would be very useful if I could highlight a sentence, paragraph or section of an article and get a reduced history view with only those edits that changed that part of the page. What sorts of indexes would be needed to facilitate such a search? Has anybody already implemented this as a separate tool

Before I dig a little more into wiki mysteries, I was absolutely sure that wiki articles were stored into small pieces (paragraphs?) so that a small edit into a long long page would take exactly the same disk space than a small edit into a short page. But I discovered soon, that things are different. :-)

Obviously, I'm far from having the skill and competence to compare different solutions and different performances; nevertheless, I guess that history architecture of Google docs is different from the wiki's one. Probabily the DropBox history engine is different too.

Alex

Aryeh Gregor

4:47 p.m.

On Mon, Jan 17, 2011 at 5:55 AM, Alex Brollo alex.brollo@gmail.com wrote:

...

Before I dig a little more into wiki mysteries, I was absolutely sure that wiki articles were stored into small pieces (paragraphs?) so that a small edit into a long long page would take exactly the same disk space than a small edit into a short page. But I discovered soon, that things are different. :-)

Wikimedia stores diffs using delta compression, so actually this is basically what happens. The size of the edit is what determines the size of the stored diff, not the size of the page. (I don't know how this works in detail, though.) IIRC, default MediaWiki doesn't work this way.

Roan Kattouw

7:12 p.m.

2011/1/17 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

Wikimedia stores diffs using delta compression, so actually this is basically what happens. The size of the edit is what determines the size of the stored diff, not the size of the page. (I don't know how this works in detail, though.) IIRC, default MediaWiki doesn't work this way.

Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.). However, decompressing it just gets you the raw text, so nothing in this storage system helps generation of diffs. Diff generation is still done by shelling out to wikidiff2 (a custom C++ diff implementation that generates diffs with HTML markup like <ins>/<del>) and caching the result in memcached.

Roan Kattouw (Catrope)

Aryeh Gregor

19 Jan 19 Jan

12:21 a.m.

On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw roan.kattouw@gmail.com wrote:

...

Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.).

We used to do this, but the problem was that many articles are much larger than the compression window of typical compression algorithms, so the redundancy between adjacent revisions wasn't helping compression except for short articles. Tim wrote a diff-based history storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and deployed it on Wikimedia, for 93% space savings:

http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html

I don't know if this was ever deployed to all of external storage, though. In that thread Tim mentioned only recompressing about 40% of revisions, and said that the recompression script required care and human attention to work correctly, so maybe he never got around to recompressing all the rest -- I don't think he ever said, that I saw.

Roan Kattouw

12:27 a.m.

2011/1/19 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

We used to do this, but the problem was that many articles are much larger than the compression window of typical compression algorithms, so the redundancy between adjacent revisions wasn't helping compression except for short articles. Tim wrote a diff-based history storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and deployed it on Wikimedia, for 93% space savings:

http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html

That's right, I forgot about that.

...

I don't know if this was ever deployed to all of external storage, though. In that thread Tim mentioned only recompressing about 40% of revisions, and said that the recompression script required care and human attention to work correctly, so maybe he never got around to recompressing all the rest -- I don't think he ever said, that I saw.

I think he finished recompressing a couple of months ago.

Roan Kattouw (Catrope)

Alex Brollo

12:44 a.m.

It seems a complely different topic, but: is there something to learn about text saving from the smart trick of TeX formulas storing? I did a little bit of "reverse engineering" on that algorithm, I did never find anything useful application from it, but much fun. :-)

Alex

Anthony

1:59 a.m.

On Tue, Jan 18, 2011 at 7:21 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw roan.kattouw@gmail.com wrote:

...
Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.).

We used to do this, but the problem was that many articles are much larger than the compression window of typical compression algorithms, so the redundancy between adjacent revisions wasn't helping compression except for short articles. Tim wrote a diff-based history storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and deployed it on Wikimedia, for 93% space savings:

http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html

Why isn't this being used for the dumps?

Aryeh Gregor

8:33 a.m.

On Wed, Jan 19, 2011 at 3:59 AM, Anthony wikimail@inbox.org wrote:

...

Why isn't this being used for the dumps?

Well, the relevant code is totally unrelated, so the question is sort of a non sequitur. If you mean "Why don't we have incremental dumps?", I guess Ariel is the person to ask. I'm assuming the answer is (as usual in software development) that there are higher-priority things to do right now. The concept of incremental dumps is pretty obvious, but that doesn't mean it wouldn't take some manpower to get them working.

Anthony

2:15 p.m.

On Wed, Jan 19, 2011 at 3:33 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

On Wed, Jan 19, 2011 at 3:59 AM, Anthony wikimail@inbox.org wrote:

...
Why isn't this being used for the dumps?

Well, the relevant code is totally unrelated, so the question is sort of a non sequitur.

No, the question is why the relevant code is totally unrelated. Specifically, I'm talking about the full history dumps.

...

If you mean "Why don't we have incremental dumps?"

No, that's not the question. The question is why are you uncompressing and undiffing (from DiffHistoryBlobs) only to recompress (to bz2) and then uncompress and recompress (to 7z) when you can get roughly the same compression by just extracting the blobs and removing any non-public data. Or, if it's easier, continue to uncompress (in gz) and undiff then rediff and recompress (in gz), as that will be much much faster than compressing in bz2.

You'll also wind up with a full history dump which is *much* easier to work with. Yes, you'll break backward compatibility, but considering that the English full history dump never finishes, even if you just implemented it for that one it'd be better than the present, which is to have nothing.

...

I'm assuming the answer is (as usual in software development) that there are higher-priority things to do right now.

And there are lots of lower-priority things that are being done. And lots of dollars sitting on the sidelines doing nothing.

Happy-melon

20 Jan 20 Jan

12:49 a.m.

"Anthony" wikimail@inbox.org wrote in message news:AANLkTi=UK+UF3y_B+ZLd57WCfUEF_7rf-Bt8TNvtg+2f@mail.gmail.com...

...

No, that's not the question. The question is why are you uncompressing and undiffing (from DiffHistoryBlobs) only to recompress (to bz2) and then uncompress and recompress (to 7z) when you can get roughly the same compression by just extracting the blobs and removing any non-public data.

That's probably not nearly as straightforward as it sounds. RevDel'd and suppressed revisions are not removed from the text storage; even Oversighted revisions are left there, only the entry in the revision table is removed or altered. I don't know OTTOMH how regularly the DiffHistoryBlob system stores a 'key frame', and how easy it would be to break diff chains in order to snip out non-public data from them, but I'd guess a) not very, and b) that the current code doesn't give any consideration to doing so because there's no reason for it to do so. So refactoring it to incorporate that, while not impossible, is a non-trivial amount of work.

...

And there are lots of lower-priority things that are being done. And lots of dollars sitting on the sidelines doing nothing.

Low-priority interesting things tend to get done when you have volunteers doing them. While the value of some of the Foundation's expenditure is commonly debated, I think you'd struggle to argue that many of the WMF's dollars are not doing *anything*.

--HM

Anthony

2:04 a.m.

On Wed, Jan 19, 2011 at 7:49 PM, Happy-melon happy-melon@live.com wrote:

...

"Anthony" wikimail@inbox.org wrote in message news:AANLkTi=UK+UF3y_B+ZLd57WCfUEF_7rf-Bt8TNvtg+2f@mail.gmail.com...

...
No, that's not the question. The question is why are you uncompressing and undiffing (from DiffHistoryBlobs) only to recompress (to bz2) and then uncompress and recompress (to 7z) when you can get roughly the same compression by just extracting the blobs and removing any non-public data.

That's probably not nearly as straightforward as it sounds.

I have no idea how straightforward it sounds, so I won't argue with that.

...

RevDel'd and suppressed revisions are not removed from the text storage; even Oversighted revisions are left there, only the entry in the revision table is removed or altered. I don't know OTTOMH how regularly the DiffHistoryBlob system stores a 'key frame', and how easy it would be to break diff chains in order to snip out non-public data from them, but I'd guess a) not very, and b) that the current code doesn't give any consideration to doing so because there's no reason for it to do so. So refactoring it to incorporate that, while not impossible, is a non-trivial amount of work.

It wouldn't be trivial, but it wouldn't be particularly hard either. Most of the work is already being done. It's just being done inefficiently.

On Wed, Jan 19, 2011 at 7:49 PM, Happy-melon happy-melon@live.com wrote:

...

...
And there are lots of lower-priority things that are being done. And lots of dollars sitting on the sidelines doing nothing.

Low-priority interesting things tend to get done when you have volunteers doing them. While the value of some of the Foundation's expenditure is commonly debated, I think you'd struggle to argue that many of the WMF's dollars are not doing *anything*.

Last I checked there were millions of them sitting in the bank.

Aryeh Gregor

21 Jan 21 Jan

11:48 a.m.

On Wed, Jan 19, 2011 at 4:15 PM, Anthony wikimail@inbox.org wrote:

...

No, the question is why the relevant code is totally unrelated.

Well, you might ask why we don't just (selectively) dump the page, revision, and text tables instead of doing XML dumps -- it seems like it would be much simpler -- but I have no idea. Perhaps it's to ease processing with non-MediaWiki tools, but I'm not sure why that's a design goal compared to the simplicity of SQL dumps. Surely it wouldn't be too hard to write a maintenance/ tool that just fetches the revision text for a particular article at a particular point, using only those three tables without any MediaWiki framework so it can be used standalone. Not to mention, the text table is immutable, so creating and publishing text table dumps incrementally should be trivial.

But I'm not going to criticize anyone from the peanut gallery here. I don't actually know much about the dumps work. Happy-melon is correct to point out that it might not be trivial to snip private info (even oversighted revisions) from the text table, depending on how it's constructed. There might be other concerns too.

...

And there are lots of lower-priority things that are being done. And lots of dollars sitting on the sidelines doing nothing.

That's a discussion for foundation-l, not wikitech-l.

On Thu, Jan 20, 2011 at 4:04 AM, Anthony wikimail@inbox.org wrote:

...

It wouldn't be trivial, but it wouldn't be particularly hard either. Most of the work is already being done. It's just being done inefficiently.

I'm glad to see you know what you're talking about here. Presumably you've examined the relevant code closely and determined exactly how you'd implement the necessary changes in order to evaluate the difficulty. Needless to say, patches are welcome.

Anthony

3:03 p.m.

On Fri, Jan 21, 2011 at 6:48 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

Not to mention, the text table is immutable, so creating and publishing text table dumps incrementally should be trivial.

The problem there is deletion and oversight. The best solution if you didn't have to worry about that would be to have a database on the dump servers with only public data, which accesses a live feed (over the LAN). Then creating a dump would be as simple as pg_dump, and fancier incremental dumps could be made relatively simply as well.

Then again, if your live feed tells you which revisions to delete/oversight, that's still a viable solution.

...

On Thu, Jan 20, 2011 at 4:04 AM, Anthony wikimail@inbox.org wrote:

...
It wouldn't be trivial, but it wouldn't be particularly hard either. Most of the work is already being done. It's just being done inefficiently.

I'm glad to see you know what you're talking about here. Presumably you've examined the relevant code closely and determined exactly how you'd implement the necessary changes in order to evaluate the difficulty. Needless to say, patches are welcome.

Access to the servers is welcome. I can't possibly test and improve performance without it.

Alternatively, give me a free live feed, and I'll make a decent dump system here at home, and provide the source code when I'm done.

Anthony

17 Jan 17 Jan

2:49 p.m.

On Sun, Jan 16, 2011 at 7:34 PM, Lars Aronsson lars@aronsson.se wrote:

...

Many articles are soo long, and have been edited so many times, that the history view is almost useless. If I want to find out when and how the sentence "Overall, the city is relatively flat"in the article [[en:Paris]] has changed over time, I can sit all day and analyze individual diffs.

I think it would be very useful if I could highlight a sentence, paragraph or section of an article and get a reduced history view with only those edits that changed that part of the page. What sorts of indexes would be needed to facilitate such a search? Has anybody already implemented this as a separate tool?

How would you define a particular sentence, paragraph or section of an article? The difficulty of the solution lies in answering that question.

Bryan Tong Minh

3:29 p.m.

On Mon, Jan 17, 2011 at 3:49 PM, Anthony wikimail@inbox.org wrote:

...

How would you define a particular sentence, paragraph or section of an article? The difficulty of the solution lies in answering that question.

Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.

Bryan

Alex Brollo

3:40 p.m.

2011/1/17 Bryan Tong Minh bryan.tongminh@gmail.com

...

Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.

Difficult, but diff between versions of a page does it. Looking at diff between pages, I simply thought firmly that only diff paragraphs were stored, so that the page was built as updated diff segments. I had no idea how this could be done, but all was "magic"!

Alex

Anthony

5:41 p.m.

On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo alex.brollo@gmail.com wrote:

...

2011/1/17 Bryan Tong Minh bryan.tongminh@gmail.com

...
Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.

Difficult, but diff between versions of a page does it. Looking at diff between pages, I simply thought firmly that only diff paragraphs were stored, so that the page was built as updated diff segments. I had no idea how this could be done, but all was "magic"!

Paragraphs are much easier to recognize than sentences, as wikitext has a paragraph delimiter - a blank line. To truly recognize sentences, you basically have to engage in natural language processing, though you can probably get it right 90% of the time without too much effort.

And to recognize what's going on when a sentence changes *and* is moved from one paragraph to another, requires an even greater level of natural language understanding. Again though, you can probably get it right most of the time without too much effort.

Wikitext actually makes it easier for the most part, as you can use tricks such as the fact that the periods in [[I.M. Someone]] don't represent sentence delimiters, since they are contained in square brackets. But not all periods which occur in the middle of a sentence are contained in square brackets, and not all sentences end with a period.

I'd say "difficult but doable" is quite accurate, although with the caveat that even the state of the art tools available today are probably going to make mistakes that would be obvious to a human. I'm sure there are tools for this, and there are probably some decent ones that are open source. But it's not as simple as just adding an index.

Anthony

5:50 p.m.

On Mon, Jan 17, 2011 at 12:41 PM, Anthony wikimail@inbox.org wrote:

...

And to recognize what's going on when a sentence changes *and* is moved from one paragraph to another, requires an even greater level of natural language understanding. Again though, you can probably get it right most of the time without too much effort.

Or at the paragraph level, when two paragraphs are combined into one (vs. one paragraph being deleted), or one paragraph is split into two (vs. one paragraph being added), or any of the various other, more complicated changes that take place.

If you want a high level of accuracy when trying to determine who added a particular fact (such as "Overall, the city is relatively flat", which may have started out as "Paris, in general, contains very few changes in elevation"), you really need to combine automated tools with human understanding.

Lars Aronsson

8:45 p.m.

On 01/17/2011 06:50 PM, Anthony wrote:

...

If you want a high level of accuracy when trying to determine who added a particular fact (such as "Overall, the city is relatively flat", which may have started out as "Paris, in general, contains very few changes in elevation"), you really need to combine automated tools with human understanding.

Our current "diff" is not perfect, it often performs worse than the GNU "wdiff" (word diff) utility. But it is still useful. What I'm calling for is a way to filter out (or group together) some of the edits from the history view that had nothing at all to do with the specified sentence or paragraph. This shouldn't be impossible to do. It need not be perfect. The more irrelevant edits it can filter out, the better.

I'm a "Unix programmer" from the days of RCS, which is functionally equivalent to the version control in MediaWiki. In RCS,tracing when, how and by whom a particular piece of code was altered (i.e., who introduced that bug) is as hard as it now is in MediaWiki.Do any of the newer systems (SVN, Git, ...) or commercial integrated development environments have better support for this?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Lars Aronsson

9:42 p.m.

On 01/17/2011 03:49 PM, Anthony wrote:

...

How would you define a particular sentence, paragraph or section of an article? The difficulty of the solution lies in answering that question.

I think the definition could vary, and the functionality could still be useful. The API parameters could be the offset and length in the given article version, just like substr().

A user interface (depending on skin) could input the offset and length by point-and-click (region select) or by pointing at a word and finding the preceding and following blank line. Some user interface might care about sentence separators.

The search could be simplified if each edit preserved some parameters of the diff, an "edit index", e.g. "inserted 7 characters at offset 4711". Then we know that this edit is irrelevant if the sought offset is nowhere near 4711 and as we go back in history, our offset needs to be reduced by 7 if it is larger than 4711. Doing such offset arithmetics for a thousand article edits should be a lot faster than calling diff over and over again. And then again, the diffs are necessary to build such an edit index. This could be done in a one-time conversion or on demand, using the edit index as a cache of such parameters.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

masti

10:36 p.m.

what is the reason and what it can bring to the community?

masti

On 01/17/2011 01:34 AM, Lars Aronsson wrote:

...

I think it would be very useful if I could highlight a sentence, paragraph or section of an article and get a reduced history view with only those edits that changed that part of the page. What sorts of indexes would be needed to facilitate such a search? Has anybody already implemented this as a separate tool?

Lars Aronsson

11:30 p.m.

On 01/17/2011 11:36 PM, masti wrote:

...

what is the reason and what it can bring to the community?

I tried to describe this. The task of finding out the history of a part of an article is very time consuming for long articles with a long history, where you have to manually look through lots of revisions that aren't related to the part of the article you are interested in.

I took as the example the part of the flat geography of the city of Paris. Was this part controversial? Who edited it? Has it changed? When and by whom?

Most edits to the article Paris are probably related to new elections, new buildings, new institutions. Most edits have nothing to do with the flat geography. So could the history view of maybe 5000 edits be quickly reduced down to 50 edits or even 5?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

masti

11:36 p.m.

On 01/18/2011 12:30 AM, Lars Aronsson wrote:

...

On 01/17/2011 11:36 PM, masti wrote:

...
what is the reason and what it can bring to the community?

I tried to describe this. The task of finding out the history of a part of an article is very time consuming for long articles with a long history, where you have to manually look through lots of revisions that aren't related to the part of the article you are interested in.

I took as the example the part of the flat geography of the city of Paris. Was this part controversial? Who edited it? Has it changed? When and by whom?

Most edits to the article Paris are probably related to new elections, new buildings, new institutions. Most edits have nothing to do with the flat geography. So could the history view of maybe 5000 edits be quickly reduced down to 50 edits or even 5?

In this rare situation it could be beneficial, but does it really make sense in general? Workload and complication of interface, in my opinion, is not worth it.

masti

Platonides

19 Jan 19 Jan

8:20 p.m.

masti wrote:

...

On 01/18/2011 12:30 AM, Lars Aronsson wrote:

...
On 01/17/2011 11:36 PM, masti wrote:

...
what is the reason and what it can bring to the community?

I tried to describe this. The task of finding out the history of a part of an article is very time consuming for long articles with a long history, where you have to manually look through lots of revisions that aren't related to the part of the article you are interested in.

I took as the example the part of the flat geography of the city of Paris. Was this part controversial? Who edited it? Has it changed? When and by whom?

Most edits to the article Paris are probably related to new elections, new buildings, new institutions. Most edits have nothing to do with the flat geography. So could the history view of maybe 5000 edits be quickly reduced down to 50 edits or even 5?

In this rare situation it could be beneficial, but does it really make sense in general? Workload and complication of interface, in my opinion, is not worth it.

masti

I think it makes sense, but more as an external tool which selected them for you. There are tools like http://wikipedia.ramselehof.de/wikiblame.php which aim to do these things, but although I don't think they are so good, they may be a good place to start.

4931

Age (days ago)

4935

Last active (days ago)

wikitech-l@lists.wikimedia.org

25 comments

10 participants

tags (0)

participants (10)

Alex Brollo
Anthony
Aryeh Gregor
Benjamin Lees
Bryan Tong Minh
Happy-melon
Lars Aronsson
masti
Platonides
Roan Kattouw