Hello comrades, I've run into a challenge too interesting to keep to myself ;) My immediate goal is to prototype an "offline" wikipedia, similar to Kiwix, which allows the end-user to make edits and synchronize them back to a central repository like enwiki.
The catch is, how to insert these changes without edit conflicts? With linear revision numbering, I can't imagine a natural representation of the data, only some kind of ad-hoc sandbox solution.
Extending the article revision numbering to represent a branching history would be the natural way to handle optimistic replication.
Non-linear revisioning might also facilitate simpler models for page protection, and would allow the formation of multiple, independent consensuses.
-Adam Wight
On 17/07/12 00:22, Adam Wight wrote:
Hello comrades, I've run into a challenge too interesting to keep to myself ;) My immediate goal is to prototype an "offline" wikipedia, similar to Kiwix, which allows the end-user to make edits and synchronize them back to a central repository like enwiki.
The catch is, how to insert these changes without edit conflicts? With linear revision numbering, I can't imagine a natural representation of the data, only some kind of ad-hoc sandbox solution.
Extending the article revision numbering to represent a branching history would be the natural way to handle optimistic replication.
Non-linear revisioning might also facilitate simpler models for page protection, and would allow the formation of multiple, independent consensuses.
-Adam Wight
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
On 07/16/2012 04:10 PM, Platonides wrote:
On 17/07/12 00:22, Adam Wight wrote:
Hello comrades, I've run into a challenge too interesting to keep to myself ;) My immediate goal is to prototype an "offline" wikipedia, similar to Kiwix, which allows the end-user to make edits and synchronize them back to a central repository like enwiki.
The catch is, how to insert these changes without edit conflicts? With linear revision numbering, I can't imagine a natural representation of the data, only some kind of ad-hoc sandbox solution.
Extending the article revision numbering to represent a branching history would be the natural way to handle optimistic replication.
Non-linear revisioning might also facilitate simpler models for page protection, and would allow the formation of multiple, independent consensuses.
-Adam Wight
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email. -adam
On 17/07/12 01:49, Adam Wight wrote:
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email. -adam
Not really. The would be lost in favour of the target ones. You keep a list of rev_ids in the source wiki and the ones it gets in the target wiki, adjunting following rev_parent_id to the target wiki numbers. It could be a problem for merges after the first one, but it's good enough for the first version.
The nasty problem I see is how to determine the winner in a version conflict: B / A \ C
B and C both are revisions with common parent A. How do you handle the merge? What revision should be shown in the title?
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email. -adam
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
I wonder could this sort of feature be implemented in the existing Kiwix codebase? That would be ideal I think.
Thank you, Derric Atzrott
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
I wonder could this sort of feature be implemented in the existing Kiwix codebase? That would be ideal I think.
Ward is working on it. :) http://wardcunningham.github.com/ https://github.com/WardCunningham/Smallest-Federated-Wiki
On Tue, Jul 17, 2012 at 4:32 AM, Derric Atzrott < datzrott@alizeepathology.com> wrote:
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email. -adam
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
I have added this thread to https://en.wikipedia.org/wiki/User:HaeB/Timeline_of_distributed_Wikipedia_pr... .
I wonder could this sort of feature be implemented in the existing Kiwix codebase? That would be ideal I think.
Thank you, Derric Atzrott
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 2012-07-17 07:32, Derric Atzrott wrote:
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
It always surprises me when people express enthusiasm for this kind of idea, since my instinct assumption is the exact opposite: that this couldn't possibly be feasible or practical.
Just out of curiosity, how large are the git-managed projects that you have successfully handled this way? Number of files, lines of code, bytes or commits per day? Did you ever run into a software project where a fully decentralized git solution was impractical, e.g. because pulling in the daily updates took more than an hour on your available bandwidth?
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
It always surprises me when people express enthusiasm for this kind of idea, since my instinct assumption is the exact opposite: that this couldn't possibly be feasible or practical.
Just out of curiosity, how large are the git-managed projects that you have successfully handled this way? Number of files, lines of code, bytes or commits per day? Did you ever run into a software project where a fully decentralized git solution was impractical, e.g. because pulling in the daily updates took more than an hour on your available bandwidth?
I can't say that I've handled an large git-managed projects this way, but I am to understand that this is the very thing for which git was designed. Given this I would hope that a git like model would be good for decentralized editing.
Thank you, Derric Atzrott
On Mon, Jul 23, 2012 at 7:25 AM, Derric Atzrott datzrott@alizeepathology.com wrote:
This is all a fantastic idea. Distributing Wikipedia in a fashion similar to git will make it a lot easier to use in areas where Internet connections are not so common.
It always surprises me when people express enthusiasm for this kind of idea, since my instinct assumption is the exact opposite: that this couldn't possibly be feasible or practical.
Just out of curiosity, how large are the git-managed projects that you have successfully handled this way? Number of files, lines of code, bytes or commits per day? Did you ever run into a software project where a fully decentralized git solution was impractical, e.g. because pulling in the daily updates took more than an hour on your available bandwidth?
I can't say that I've handled an large git-managed projects this way, but I am to understand that this is the very thing for which git was designed. Given this I would hope that a git like model would be good for decentralized editing.
It's really not. Things that are (relatively) simple in the database tend to require walking the entire revision tree in Git in order to figure the same data out.
Git is awesome for software development, but trying to use it as an article development tool is really a bad solution in search of a problem. We could've had the same argument years ago and said "why use a database, SVN stores information in a linear history that's useful for articles." Having diverging articles may be cool/ desired, but using Git is not the answer.
-Chad
It's really not. Things that are (relatively) simple in the database tend to require walking the entire revision tree in Git in order to figure the same data out.
Git is awesome for software development, but trying to use it as an article development tool is really a bad solution in search of a problem. We could've had the same argument years ago and said "why use a database, SVN stores information in a linear history that's useful for articles." Having diverging articles may be cool/ desired, but using Git is not the answer.
-Chad
Fair enough. I learn something new every day. I definitely think that distributed article editing is a great idea, even if a git-like system is not the answer to it.
Thank you, Derric Atzrott
It's really not. Things that are (relatively) simple in the database tend to require walking the entire revision tree in Git in order to figure the same data out.
Git is awesome for software development, but trying to use it as an article development tool is really a bad solution in search of a problem. We could've had the same argument years ago and said "why use a database, SVN stores information in a linear history that's useful for articles." Having diverging articles may be cool/ desired, but using Git is not the answer.
-Chad
Fair enough. I learn something new every day. I definitely think that distributed article editing is a great idea, even if a git-like system is not the answer to it.
Thank you, Derric Atzrott
Git is almost never used in a truly decentralized fashion, so it isn't optimized for that type of use. See git "hub", for example. Actual peer-to-peer is infinitely more scalable ;) because you don't have one poor enterprise Java server getting hit by everyone in the world, instead individuals are distributing the load among themselves.
That would be a difficult model for Wikipedia however, because maintaining an authoritative edition would require centralized cryptography, at the least.
Allowing articles on our central server to diverge temporarily is easily achievable, with very little overhead. In fact, when you consider the savings in revert wars, maybe there is a net gain.
I'm interested in writing a mediawiki extension to allow us to experiment with this idea.
-Adam
On 07/16/2012 04:49 PM, Adam Wight wrote:
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email.
And that would be the core problem of asynchronous optimistic replication ;) Simple last-write-wins or union (for shopping carts..) strategies are still manageable, but merging textual changes is harder. Manual intervention will often be needed.
The editor rather than some unsuspecting reader should be best equipped to resolve these conflicts, so some degree of synchrony in the 'push' stage might make sense to provide an opportunity for editor-guided merging.
Gabriel
wicke@wikidev.net:
On 07/16/2012 04:49 PM, Adam Wight wrote:
Cool! That's a nice solution because it's transparent to the end-user's system. However, if we use the current schema as you're describing, we would have to reconcile rev_id conflicts during the merge. This seems like a nasty problem if the merge is asynchronous, for example a batched changeset sent in email.
And that would be the core problem of asynchronous optimistic replication ;) Simple last-write-wins or union (for shopping carts..) strategies are still manageable, but merging textual changes is harder. Manual intervention will often be needed.
The editor rather than some unsuspecting reader should be best equipped to resolve these conflicts, so some degree of synchrony in the 'push' stage might make sense to provide an opportunity for editor-guided merging.
Gabriel
Although it might be simpler for the original editor to merge their own changes, that's not always what we want. The most flexible arrangement would be to separate the process into three workflows: edit, synchronize, and merge. Different people could perform each stage, or they can be folded together when appropriate.
On protected pages, for example, we specifically want some amount of peer review before deciding to merge. This could be seen as positive feedback also, if each successfully merged change comes with a bit of validation by the community.
Even a simple branching model will offer some delicious low-hanging fruit, for example, editors could "Save Draft" for any article and resume editing later.
-adam
Hi, I've started working on an extension to manage branching history, calling it "Nonlinear". Here's the crude code, https://github.com/adamwight/Nonlinear
Screenshot of the effect on revision history: mediawiki screenshot
On 07/16/2012 04:10 PM, Platonides wrote:
On 17/07/12 00:22, Adam Wight wrote:
Hello comrades, I've run into a challenge too interesting to keep to myself ;) My immediate goal is to prototype an "offline" wikipedia, similar to Kiwix, which allows the end-user to make edits and synchronize them back to a central repository like enwiki.
The catch is, how to insert these changes without edit conflicts? With linear revision numbering, I can't imagine a natural representation of the data, only some kind of ad-hoc sandbox solution.
Extending the article revision numbering to represent a branching history would be the natural way to handle optimistic replication.
Non-linear revisioning might also facilitate simpler models for page protection, and would allow the formation of multiple, independent consensuses.
-Adam Wight
Actually, the revision table allows for non-linear development (it stores from which version you edited the article). You could even make to "win" a version different than the one with the latest timestamp (by changing page_rev) one. You will need to change the way of viewing history, however, and add a system to keep track of "heads" and "merges". There may be some assumtions accross the codebase about the latest revision being the active one, too.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Excerpts from Adam Wight's message of Mon Jul 16 18:22:22 -0400 2012:
Hello comrades, I've run into a challenge too interesting to keep to myself ;) My immediate goal is to prototype an "offline" wikipedia, similar to Kiwix, which allows the end-user to make edits and synchronize them back to a central repository like enwiki.
The catch is, how to insert these changes without edit conflicts? With linear revision numbering, I can't imagine a natural representation of the data, only some kind of ad-hoc sandbox solution.
Extending the article revision numbering to represent a branching history would be the natural way to handle optimistic replication.
Non-linear revisioning might also facilitate simpler models for page protection, and would allow the formation of multiple, independent consensuses.
There is a tool for managing non-linear history in mediawiki data sets. It's actually a combination of git, the version control system, and the MediaWiki API. It's called git-remote-mediawiki.
First, I'll quote its documentation:
<quote> Getting started with Git-Mediawiki
Then, the first operation you should do is cloning the remote mediawiki. To do so, run the command
git clone mediawiki::http://yourwikiadress.com
You can commit your changes locally as usual with the command
git commit </quote>
You can read more here: https://github.com/Bibzball/Git-Mediawiki/wiki/User-manual
I've been enjoying it lately, though it has some rough edges. It is under periodic development, and in the near future I plan to make more of a user community around it.
It is probably entirely unwiedly to use on English Wikipedia directly, but it could be adjusted to permit the importing of database dumps, and then let people branch off those.
-- Asheesh.
wikitech-l@lists.wikimedia.org