Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
For some reason folks at #wikimedia-tech. were mainly concerned with speed and almost nothing else so I'll try explaining performance issues as best as I can.
First of all, I DO NOT propose to recalculate diffs for all zillions of edits Wikipedia already has. Diffs would only be calculated for a new edits. Next, I want to explain in detail how I see this working. So first I propose to modify revision table and add a flag with following possible values: "Revision is too old to be diffed", "Revision is awaiting to be diffed", "Revision has been diffed". Also another table should be added that will store blame maps for each revision. Blame map for each subsequent revision will be calculated incrementally. So it doesn't really matter whether article has 10 or 1000 revisions. We would only need last blame map.
I also propose to have separate dedicated diff server(s) with sole job to calculate diffs in background. I.e. diff server grabs revision with "Revision is awaiting to be diffed" flag and last blame map from database, calculates diff and finally stores new blame map in the database and also changes revision flag to "Revision has been diffed". Repeat.
In addition, article display logic should be altered. The module that displays article should check diff flag. If diff flag is set to "Revision is too old to be diffed" no further changes needed. If diff flag is set to "Revision is awaiting to be diffed" then Credits section should be created that only contains message "Calculation in progress". If diff flag is set to "Revision has been diffed" then Credits section should be created that contains list of contributors ordered by contribution size. The list of contributors in correct order can be generated with a single select to blame map table. In addition this select can be cached. Direct link to blame map should be displayed too. If user clicks on this blame map link corresponding blame map should be presented. Every blame map can be generated with a single select and can be placed in cache. Yawn
If you are still awake by now more thoughts on fault tolerance here. Should diff server die, crash, fail or whatever the only side effect end user will see is "Calculation in progress" message right after article body. That's it. No slowdown or anything. If user still wants see some kind of diff he/she can still use old diff engine. Because blame maps aren't calculated in real time this feature is impractical target for DoS attacks. However I should point out that any real time diff algorithm is one big fat target for DoS attacks on other wikis which are run on single server without some sort of acceleration.
There is also small Unicode issue. Due to crappy utf-8 support in PHP all non-latin characters are currently ignored. I believe this could be solved either by enabling proper Unicode support in PHP or writing custom code to separate words. But before that I propose to test on English Wikipedia first because if it will works for English it should work for other languages.
So I offer following practical steps. Dedicate one of servers to be diff playground. I will need a shell account on this server. Install mediawiki on it alongside with diff logic running in background. Create read only mysql account on live database server. So as a result this diff server can grab new revisions from live database, diff them and store results locally. This way we can find out how many edits single server can process and see how many servers this feature will require in total (I don't think it will be more than 2-3 though).
In conclusion, I'd like to say that in my opinion this feature will be useful and practical if implemented. It also can be crucial building block for other interesting features. However, I want to stress that I'm not interested in doing this *unless* it is used in English Wikipedia and I'm given appropriate credit. I can give a reason why I want that in private e-mail.
Thank you for reading this long and boring e-mail.
I'm forwarding this to MediaWiki-l also. I think it's more relevant there anyway.
I like that a lot, and think it should be implemented in MediaWiki ASAP. I'm e-mailing him to ask why he only wants to contribute if it's used in 'teh wikipedia'<!-- wikicrapia more like it -->, there are lots of other wiki sites out there that need this and could definitely use it. I'm aallll for this, and willing to help in any way I can.
On Jun 6, 2006, at 2:13 PM, Roman Nosov wrote:
Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
For some reason folks at #wikimedia-tech. were mainly concerned with speed and almost nothing else so I'll try explaining performance issues as best as I can.
First of all, I DO NOT propose to recalculate diffs for all zillions of edits Wikipedia already has. Diffs would only be calculated for a new edits. Next, I want to explain in detail how I see this working. So first I propose to modify revision table and add a flag with following possible values: "Revision is too old to be diffed", "Revision is awaiting to be diffed", "Revision has been diffed". Also another table should be added that will store blame maps for each revision. Blame map for each subsequent revision will be calculated incrementally. So it doesn't really matter whether article has 10 or 1000 revisions. We would only need last blame map.
I also propose to have separate dedicated diff server(s) with sole job to calculate diffs in background. I.e. diff server grabs revision with "Revision is awaiting to be diffed" flag and last blame map from database, calculates diff and finally stores new blame map in the database and also changes revision flag to "Revision has been diffed". Repeat.
In addition, article display logic should be altered. The module that displays article should check diff flag. If diff flag is set to "Revision is too old to be diffed" no further changes needed. If diff flag is set to "Revision is awaiting to be diffed" then Credits section should be created that only contains message "Calculation in progress". If diff flag is set to "Revision has been diffed" then Credits section should be created that contains list of contributors ordered by contribution size. The list of contributors in correct order can be generated with a single select to blame map table. In addition this select can be cached. Direct link to blame map should be displayed too. If user clicks on this blame map link corresponding blame map should be presented. Every blame map can be generated with a single select and can be placed in cache. Yawn
If you are still awake by now more thoughts on fault tolerance here. Should diff server die, crash, fail or whatever the only side effect end user will see is "Calculation in progress" message right after article body. That's it. No slowdown or anything. If user still wants see some kind of diff he/she can still use old diff engine. Because blame maps aren't calculated in real time this feature is impractical target for DoS attacks. However I should point out that any real time diff algorithm is one big fat target for DoS attacks on other wikis which are run on single server without some sort of acceleration.
There is also small Unicode issue. Due to crappy utf-8 support in PHP all non-latin characters are currently ignored. I believe this could be solved either by enabling proper Unicode support in PHP or writing custom code to separate words. But before that I propose to test on English Wikipedia first because if it will works for English it should work for other languages.
So I offer following practical steps. Dedicate one of servers to be diff playground. I will need a shell account on this server. Install mediawiki on it alongside with diff logic running in background. Create read only mysql account on live database server. So as a result this diff server can grab new revisions from live database, diff them and store results locally. This way we can find out how many edits single server can process and see how many servers this feature will require in total (I don't think it will be more than 2-3 though).
In conclusion, I'd like to say that in my opinion this feature will be useful and practical if implemented. It also can be crucial building block for other interesting features. However, I want to stress that I'm not interested in doing this *unless* it is used in English Wikipedia and I'm given appropriate credit. I can give a reason why I want that in private e-mail.
Thank you for reading this long and boring e-mail. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
Apart from that why would it be boring.. this is a technical list. Personally I am interested in two things as well, what other projects are you referring to and how you want to see this attribution done.
Thanks, GerardM
Elliott F. Cable wrote:
I'm forwarding this to MediaWiki-l also. I think it's more relevant there anyway.
I like that a lot, and think it should be implemented in MediaWiki ASAP. I'm e-mailing him to ask why he only wants to contribute if it's used in 'teh wikipedia'<!-- wikicrapia more like it -->, there are lots of other wiki sites out there that need this and could definitely use it. I'm aallll for this, and willing to help in any way I can.
On Jun 6, 2006, at 2:13 PM, Roman Nosov wrote:
Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
For some reason folks at #wikimedia-tech. were mainly concerned with speed and almost nothing else so I'll try explaining performance issues as best as I can.
First of all, I DO NOT propose to recalculate diffs for all zillions of edits Wikipedia already has. Diffs would only be calculated for a new edits. Next, I want to explain in detail how I see this working. So first I propose to modify revision table and add a flag with following possible values: "Revision is too old to be diffed", "Revision is awaiting to be diffed", "Revision has been diffed". Also another table should be added that will store blame maps for each revision. Blame map for each subsequent revision will be calculated incrementally. So it doesn't really matter whether article has 10 or 1000 revisions. We would only need last blame map.
I also propose to have separate dedicated diff server(s) with sole job to calculate diffs in background. I.e. diff server grabs revision with "Revision is awaiting to be diffed" flag and last blame map from database, calculates diff and finally stores new blame map in the database and also changes revision flag to "Revision has been diffed". Repeat.
In addition, article display logic should be altered. The module that displays article should check diff flag. If diff flag is set to "Revision is too old to be diffed" no further changes needed. If diff flag is set to "Revision is awaiting to be diffed" then Credits section should be created that only contains message "Calculation in progress". If diff flag is set to "Revision has been diffed" then Credits section should be created that contains list of contributors ordered by contribution size. The list of contributors in correct order can be generated with a single select to blame map table. In addition this select can be cached. Direct link to blame map should be displayed too. If user clicks on this blame map link corresponding blame map should be presented. Every blame map can be generated with a single select and can be placed in cache. Yawn
If you are still awake by now more thoughts on fault tolerance here. Should diff server die, crash, fail or whatever the only side effect end user will see is "Calculation in progress" message right after article body. That's it. No slowdown or anything. If user still wants see some kind of diff he/she can still use old diff engine. Because blame maps aren't calculated in real time this feature is impractical target for DoS attacks. However I should point out that any real time diff algorithm is one big fat target for DoS attacks on other wikis which are run on single server without some sort of acceleration.
There is also small Unicode issue. Due to crappy utf-8 support in PHP all non-latin characters are currently ignored. I believe this could be solved either by enabling proper Unicode support in PHP or writing custom code to separate words. But before that I propose to test on English Wikipedia first because if it will works for English it should work for other languages.
So I offer following practical steps. Dedicate one of servers to be diff playground. I will need a shell account on this server. Install mediawiki on it alongside with diff logic running in background. Create read only mysql account on live database server. So as a result this diff server can grab new revisions from live database, diff them and store results locally. This way we can find out how many edits single server can process and see how many servers this feature will require in total (I don't think it will be more than 2-3 though).
In conclusion, I'd like to say that in my opinion this feature will be useful and practical if implemented. It also can be crucial building block for other interesting features. However, I want to stress that I'm not interested in doing this *unless* it is used in English Wikipedia and I'm given appropriate credit. I can give a reason why I want that in private e-mail.
Thank you for reading this long and boring e-mail.
Gerard Meijssen wrote:
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
Apart from that why would it be boring.. this is a technical list. Personally I am interested in two things as well, what other projects are you referring to and how you want to see this attribution done.
I discussed unicode support with the original poster on IRC. I couldn't get through to him that adding UTF-8 support to a PHP application is trivial, and requires no special UTF-8 support within PHP itself. MediaWiki's UTF-8 support is mostly implemented from scratch using PHP's binary-safe string handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder within the word splitting routine. It's not difficult.
-- Tim Starling
On 08/06/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Gerard Meijssen wrote:
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
"You do not appreciate" - rather a confrontational tone, there. Who are we to assume that someone else doesn't appreciate the amount of effort put in elsewhere? It might be correct, but then again, there might be no specific bias against it.
Apart from that why would it be boring.. this is a technical list. Personally I am interested in two things as well, what other projects are you referring to and how you want to see this attribution done.
Apart from why what would be boring? The post was to get feedback, don't withhold it. I would imagine standard attribution for the code under GNU GPL blah blah blah. We won't be adding flashing banners, "Wikipedia now uses a feature from XYZ". Or are we to start crediting developers with individual features? "Thanks for clearing your watchlist, c/o Rob Church."
I discussed unicode support with the original poster on IRC. I couldn't get through to him that adding UTF-8 support to a PHP application is trivial,
My impression of the poster was that he didn't completely understand the whole UTF-8/Unicode/blah thing nor its implications, and looked somewhat confused.
and requires no special UTF-8 support within PHP itself. MediaWiki's UTF-8 support is mostly implemented from scratch using PHP's binary-safe string handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder within the word splitting routine. It's not difficult.
If the *idea* is found to be viable, adding the UTF-8 goodies will be trivial, and we'll put the damn effort in.
Rob Church
Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
On 08/06/06, Rob Church robchur@gmail.com wrote:
On 08/06/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Gerard Meijssen wrote:
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
"You do not appreciate" - rather a confrontational tone, there. Who are we to assume that someone else doesn't appreciate the amount of effort put in elsewhere? It might be correct, but then again, there might be no specific bias against it.
Apart from that why would it be boring.. this is a technical list. Personally I am interested in two things as well, what other projects are you referring to and how you want to see this attribution done.
Apart from why what would be boring? The post was to get feedback, don't withhold it. I would imagine standard attribution for the code under GNU GPL blah blah blah. We won't be adding flashing banners, "Wikipedia now uses a feature from XYZ". Or are we to start crediting developers with individual features? "Thanks for clearing your watchlist, c/o Rob Church."
I discussed unicode support with the original poster on IRC. I couldn't get through to him that adding UTF-8 support to a PHP application is trivial,
My impression of the poster was that he didn't completely understand the whole UTF-8/Unicode/blah thing nor its implications, and looked somewhat confused.
and requires no special UTF-8 support within PHP itself. MediaWiki's UTF-8 support is mostly implemented from scratch using PHP's binary-safe string handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder within the word splitting routine. It's not difficult.
If the *idea* is found to be viable, adding the UTF-8 goodies will be trivial, and we'll put the damn effort in.
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Roman Nosov wrote:
Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in wikidiff2 was to consider them part of the word. I considered all non-alphanumeric characters less than 0xc0 as word-splitting punctuation characters. There are three languages that I'm aware of that don't use spaces to separate words, and thus require special handling: Chinese, Japanese and Thai. They are the only ones that I was able to find while searching the web for word segmentation information, and nobody from any other language wiki has complained. Chinese and Japanese are adequately handled by doing character-level diffs -- I received lots of praise from the Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation for search or machine translation is a much more difficult problem, but luckily solving it is unnecessary for diff formatting. Character-level diffs may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received any complaints from the Wikipedians, I believe this is less than ideal. Thai has lots of composing characters, so you often end up highlighting little dots on top of letters and the like. Really what is required here is dictionary-based word segmentation. Our search engine is also next to useless on the Thai Wikipedia due to the lack of word segmentation. But that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch) { // Standard alphanumeric if ((ch >= '0' && ch <= '9') || (ch == '_') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { return true; } // Punctuation and control characters if (ch < 0xc0) return false; // Thai, return false so it gets split up if (ch >= 0xe00 && ch <= 0xee7) return false; // Chinese/Japanese, same if (ch >= 0x3000 && ch <= 0x9fff) return false; if (ch >= 0x20000 && ch <= 0x2a000) return false; // Otherwise assume it's from a language that uses spaces return true; }
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial, I'll stand by that, but supporting all the languages of the world is not so trivial. But as you can see, language support isn't as hard as you might think, because lots of research has already been done.
-- Tim Starling
On 6/8/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Roman Nosov wrote:
Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in wikidiff2 was to consider them part of the word. I considered all non-alphanumeric characters less than 0xc0 as word-splitting punctuation characters.
The Unicode character databases actually include information on which chars are letters, which are punctuation, etc. Some programming languages incorporate this into appropriate functions such as isletter(), ispunt() or the like. I believe Perl has them. I don't know whether PHP has them or not but if it doesn't that might be considered a bug.
There are three languages that I'm aware of that don't use spaces to separate words, and thus require special handling: Chinese, Japanese and Thai. They are the only ones that I was able to find while searching the web for word segmentation information, and nobody from any other language wiki has complained.
The other language I can think of that doesn't use spaces is Khmer but it doesn't have many fonts yet and so very few web sites if any and surely no wikis. Some other Southeast Asian scripts may fall into the same category.
Chinese and Japanese are adequately handled by doing character-level diffs -- I received lots of praise from the Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation for search or machine translation is a much more difficult problem, but luckily solving it is unnecessary for diff formatting. Character-level diffs may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received any complaints from the Wikipedians, I believe this is less than ideal. Thai has lots of composing characters, so you often end up highlighting little dots on top of letters and the like. Really what is required here is dictionary-based word segmentation.
I believe there are free dictionary based word segmentation algorithms available for Thai. It's also known not to be perfect but I'm not aware of any free Thai word segmenters that do better than them.
Andrew Dunbar (hippietrail)
Our search engine is also next to useless on the Thai Wikipedia due to the lack of word segmentation. But that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch) { // Standard alphanumeric if ((ch >= '0' && ch <= '9') || (ch == '_') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { return true; } // Punctuation and control characters if (ch < 0xc0) return false; // Thai, return false so it gets split up if (ch >= 0xe00 && ch <= 0xee7) return false; // Chinese/Japanese, same if (ch >= 0x3000 && ch <= 0x9fff) return false; if (ch >= 0x20000 && ch <= 0x2a000) return false; // Otherwise assume it's from a language that uses spaces return true; }
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial, I'll stand by that, but supporting all the languages of the world is not so trivial. But as you can see, language support isn't as hard as you might think, because lots of research has already been done.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Well it looks like my question about why some quotation marks do break words and others don't will remain unanswered ("rareness" of high numbered punctuation doesn't make it part of a word) … Anyway if such level of supporting UTF-8 is sufficient for Mediawiki then Unicode issue is "solved". Unicode über alles.
On 08/06/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Roman Nosov wrote:
Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in wikidiff2 was to consider them part of the word. I considered all non-alphanumeric characters less than 0xc0 as word-splitting punctuation characters. There are three languages that I'm aware of that don't use spaces to separate words, and thus require special handling: Chinese, Japanese and Thai. They are the only ones that I was able to find while searching the web for word segmentation information, and nobody from any other language wiki has complained. Chinese and Japanese are adequately handled by doing character-level diffs -- I received lots of praise from the Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation for search or machine translation is a much more difficult problem, but luckily solving it is unnecessary for diff formatting. Character-level diffs may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received any complaints from the Wikipedians, I believe this is less than ideal. Thai has lots of composing characters, so you often end up highlighting little dots on top of letters and the like. Really what is required here is dictionary-based word segmentation. Our search engine is also next to useless on the Thai Wikipedia due to the lack of word segmentation. But that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch) { // Standard alphanumeric if ((ch >= '0' && ch <= '9') || (ch == '_') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { return true; } // Punctuation and control characters if (ch < 0xc0) return false; // Thai, return false so it gets split up if (ch >= 0xe00 && ch <= 0xee7) return false; // Chinese/Japanese, same if (ch >= 0x3000 && ch <= 0x9fff) return false; if (ch >= 0x20000 && ch <= 0x2a000) return false; // Otherwise assume it's from a language that uses spaces return true; }
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial, I'll stand by that, but supporting all the languages of the world is not so trivial. But as you can see, language support isn't as hard as you might think, because lots of research has already been done.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 6/8/06, Roman Nosov rnosov@gmail.com wrote:
Well it looks like my question about why some quotation marks do break words and others don't will remain unanswered ("rareness" of high numbered punctuation doesn't make it part of a word) … Anyway if such level of supporting UTF-8 is sufficient for Mediawiki then Unicode issue is "solved". Unicode über alles.
I think it was adequately explained - the reason why it isn't detected is because the algorithm doesn't know it's a seperation character. So it's not seperated. If the algorithm did know, it would be seperated properly.
So perhaps someone, like you, should submit a quick patch to that part of the diff engine, as outlined by Tim, that makes it properly interpret that code point. If there's a general rule or table in the Unicode standard then implementing that might be an even better option.
The unicode site, by the way, is www.unicode.org and you can find a database of unicode character properties here:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
with information on interpreting them here:
http://ftp.lanet.lv/ftp/mirror/unicode/3.2-Update/UnicodeData-3.2.0.html
Enjoy!
Rob Church wrote:
Gerard Meijssen wrote:
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
"You do not appreciate" - rather a confrontational tone, there. Who are we to assume that someone else doesn't appreciate the amount of effort put in elsewhere? It might be correct, but then again, there might be no specific bias against it.
It is already confrontational of a programmer to pretend the whole world could make do with Latin-1. It is one of the most devastating and accordingly infuriating assumptions that still prevails despite the fact that Unicode is decades old. We're in the 21st century; it is no longer appropriate to even start programming anything where any user-visible text is restricted to Latin-1 or any other 8-bit charset.
Unicode is not a "feature". Unicode is an implementation detail. Latin-1 is a bug.
Timwi
On 08/06/06, Timwi timwi@gmx.net wrote:
It is already confrontational of a programmer to pretend the whole world could make do with Latin-1. It is one of the most devastating and accordingly infuriating assumptions that still prevails despite the fact that Unicode is decades old. We're in the 21st century; it is no longer appropriate to even start programming anything where any user-visible text is restricted to Latin-1 or any other 8-bit charset.
Of course, of course, I clean forgot. Because a quick proof of concept has to be PERFECT, doesn't it. Do excuse that little oversight.
It's not perfect yet. Get over it and give some feedback on the idea.
Rob Church
I'm totally agree with Timwi – proper Unicode support is a requirement not a feature. However can someone tell me why PHP comes with no appropriate out-of-box support for such vital feature in 21 century? The root cause of my diff engine ignoring Unicode at the moment is because many PHP functions simply don't work with UTF-8 encoded strings. PHP team promises proper Unicode support only in version 6. Yeah I guess we are still in nineties …
However I think it's much better to honestly say upfront that Unicode isn't properly supported then to claim that it is. For example look no further than Wikipedia's current diff engine. Self-appointed Unicode expert Tim Starling brags that it is extremely easy to build UTF-8 support from scratch. Well let's check that.
For example if you use ordinary single quote (the one from damned latin-1, you can easily find it on your keyboard) to separate two words in wikipedia then no problems. Diff engine will see these two separate words. However if you use Left single quotation mark (Unicode code 0x2018, the one MS Word likes to use) to separate two words oops now these two words are treated as one.
Test Case for everyone to check:
Using ordinary single quote: First edit: One'two Second Edit: One'three Diff engine output: Correctly highlights words two and three
Using left single quotation mark (Unicode code 0x2018, you might need to type it rather than copy&paste it, of course all due to excellent Unicode support by each and every e-mail program): First edit: One'two Second Edit: One'three Diff engine output: Incorrectly highlights both strings
So my question to all Unicode Nazis here is why quote from latin-1 charset is treated *differently* from slightly different Unicode quote?
On 08/06/06, Rob Church robchur@gmail.com wrote:
On 08/06/06, Timwi timwi@gmx.net wrote:
It is already confrontational of a programmer to pretend the whole world could make do with Latin-1. It is one of the most devastating and accordingly infuriating assumptions that still prevails despite the fact that Unicode is decades old. We're in the 21st century; it is no longer appropriate to even start programming anything where any user-visible text is restricted to Latin-1 or any other 8-bit charset.
Of course, of course, I clean forgot. Because a quick proof of concept has to be PERFECT, doesn't it. Do excuse that little oversight.
It's not perfect yet. Get over it and give some feedback on the idea.
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Roman Nosov wrote:
So my question to all Unicode Nazis here is why quote from latin-1 charset is treated *differently* from slightly different Unicode quote?
Because they are different character and the diff engine doesnt recognize the MS Word one ? It can be fixed I believe :o)
My editor (vim) use the "regular" single quote when I edit an unicode text.
Roman Nosov wrote:
I'm totally agree with Timwi – proper Unicode support is a requirement not a feature. However can someone tell me why PHP comes with no appropriate out-of-box support for such vital feature in 21 century?
Well, you see, that (the fact that PHP misses out on the most basic vital features, not just Unicode specifically) is kind of why no sensible 21st-century programmer would ever recommend PHP to start writing something new, and why the only people who choose or even recommend PHP are amateurs. Now obviously we are "stuck" with MediaWiki written in PHP, so we have to use it and live with its severe shortcomings...
For example look no further than Wikipedia's current diff engine.
You have mentioned some properties of the current diff engine, but I'm afraid I don't see how any of them are in any way a problem or an issue.
Timwi
On 10/06/06, Timwi timwi@gmx.net wrote:
Well, you see, that (the fact that PHP misses out on the most basic vital features, not just Unicode specifically) is kind of why no sensible 21st-century programmer would ever recommend PHP to start writing something new, and why the only people who choose or even recommend PHP are amateurs.
Got a penchant for trolling, eh?
Rob Church
Hmm I don't think this thread is a good place for fighting language wars. Your POV is that PHP is a bad language; my POV is that PHP offers reasonable trade-off between performance, standards support, cost, reliability, complexity and so on. However, I did some research and it looks like the first prize in the category "Best support of Unicode in regular expressions" goes to Perl (Perl is cited as example many times by Unicode.org). Unfortunately PHP clearly sucks at the moment (even with mbstring extension). Perhaps version 6 will change that. So it might make sense to rewrite standalone component of a new diff engine in Perl.
Also it looks like some people don't understand punctuation issue. In Unicode *standard* punctuation marks can be below 0xc0 as well as *above*. If you look at code written by Tim Starling you'll see: // Punctuation and control characters if (ch < 0xc0) return false; So basically code above assumes that punctuation marks can only have codes below 0xc0 which is incorrect. On the other hand if you type in MS Word left single quotation mark then sequence of letters then right single quotation mark only sequence of letters will be spell checked. Which is nice and shows that MS Word developers respect at least Unicode standard. In other words Word sees difference between *all* Unicode punctuation marks and all Unicode letters. But you won't be able to repeat same trick with Mediawiki. Current diff engine considers all punctuation marks with codes above 0xc0 to be letters and makes them part of a word. Tim Starling in his defence says that high numbered punctuation is rare and the fact it is processed incorrectly won't do much damage. Well to a certain extend it's a good defence but if you accept it then you should also accept statements like "Opera is rarely used browser so if Wikipedia renders incorrectly in Opera it wouldn't do much damage" or "Supporting of just IE and FF is sufficient enough". BTW I noticed few glitches with how Wikipedia is displayed in Opera. Probably I've drunk too much open source Kool-Aid but here is a good example of proprietary product (manufactured by so-much-hated Microsoft) obeying standards and open source software that selectively supports standards. Someone suggested to me to fix it. Well I'm afraid I'm more on a bug creating side of things :) In fact, I was expecting that "Unicode Nazis" will rush to fix it. Instead all I got were "who cares" type of responses. I guess I should add more water to my Kool-Aid next time … Also small suggestion to all new participants in this thread – please state whether you like or not feature in question (you can find a description in the original e-mail).
On 10/06/06, Timwi timwi@gmx.net wrote:
Roman Nosov wrote:
I'm totally agree with Timwi – proper Unicode support is a requirement not a feature. However can someone tell me why PHP comes with no appropriate out-of-box support for such vital feature in 21 century?
Well, you see, that (the fact that PHP misses out on the most basic vital features, not just Unicode specifically) is kind of why no sensible 21st-century programmer would ever recommend PHP to start writing something new, and why the only people who choose or even recommend PHP are amateurs. Now obviously we are "stuck" with MediaWiki written in PHP, so we have to use it and live with its severe shortcomings...
For example look no further than Wikipedia's current diff engine.
You have mentioned some properties of the current diff engine, but I'm afraid I don't see how any of them are in any way a problem or an issue.
Timwi
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Rob Church wrote:
On 08/06/06, Timwi timwi@gmx.net wrote:
It is already confrontational of a programmer to pretend the whole world could make do with Latin-1. It is one of the most devastating and accordingly infuriating assumptions that still prevails despite the fact that Unicode is decades old. We're in the 21st century; it is no longer appropriate to even start programming anything where any user-visible text is restricted to Latin-1 or any other 8-bit charset.
Of course, of course, I clean forgot. Because a quick proof of concept has to be PERFECT, doesn't it. Do excuse that little oversight.
It's not perfect yet. Get over it and give some feedback on the idea.
I concur. If the feature is useful at all, a test version would be useful even if it only worked in a Cyrillic variant of EBCDIC.
However, responding to the original poster here (I'm pretty sure Rob agrees), the English Wikipedia is rarely if ever the right place to test proof-of-concept code. That'd be like testing an experimental engine design on the Autobahn during rush hour.
On Fri, Jun 09, 2006 at 12:50:00AM +0300, Ilmari Karonen wrote:
I concur. If the feature is useful at all, a test version would be useful even if it only worked in a Cyrillic variant of EBCDIC.
Pervert.
:-)
Cheers, -- jra
Roman Nosov wrote:
Hi all,
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
Wow! That's cool and useful. I only ask myself how it scales with large articles and dozen of editors. de:Benutzer:Jah / meta:user:Jah implemented a similar function for offline use a year ago,[1] but you script looks much more developed. With blame maps you can finally find out who is responsible for which statement and directly ask him to cite his sources :-) Can you add something to give a link to the normal diff/edit where a specific paragraph was changed for the last time?
Thank you and greetings, Jakob
[1] http://de.wikipedia.org/wiki/Wikipedia:Hauptautoren An example: http://de.wikipedia.org/wiki/Wikipedia:Hauptautoren/Stern
On Thu, Jun 29, 2006 at 10:28:12PM +0200, Jakob Voss wrote:
Roman Nosov wrote:
I've been directed here by Brion, Robchurch and others on #wikimedia-tech. So I propose a new feature for Wikipedia which people on #wikimedia-tech mostly refer as blame page or blame map. I would prefer to call it something like "Track contributions mode" (because of similarity with MS Word track changes mode) or "Hall of fame" but whatever. I have live prototype written in PHP&MySQL at http://217.147.83.36:9000/ Example of "blame map" can be seen at http://217.147.83.36:9000/history::171 two blame maps compared http://217.147.83.36:9000/history::171=169
Wow! That's cool and useful. I only ask myself how it scales with large articles and dozen of editors. de:Benutzer:Jah / meta:user:Jah implemented a similar function for offline use a year ago,[1] but you script looks much more developed. With blame maps you can finally find out who is responsible for which statement and directly ask him to cite his sources :-) Can you add something to give a link to the normal diff/edit where a specific paragraph was changed for the last time?
I concur: that is *outtahand* cool. Is it possible to specify a cutoff, either in edits-back-from-now, or after-a-given-date, and leave older material uncolored? (Specifically, this might be really cool for patrolling... not necessarily on WP, mind you.)
Is this something that can be plugged into MediaWiki? Or was that just a mockup, and you're still early on?
Cheers, -- jra
On 30/06/06, Jay R. Ashworth jra@baylink.com wrote:
Is this something that can be plugged into MediaWiki? Or was that just a mockup, and you're still early on?
When it was shown to me, I was under the impression it was proof of concept code. With our current method of storing text, it wouldn't be a *simple* thing to add in, but it *would* be something that could be considered, if we could cache the diffs in some manner, otherwise this becomes a real performance hog.
It's worth noting that someone else has registered an intention to do something similar and provide it as a script on the toolserver. That sounds infeasible to me, however, since we all know that, right now (and likely, for some considerable time), Zedler does not have access to the text of pages.
Rob Church
On 6/30/06, Rob Church robchur@gmail.com wrote:
When it was shown to me, I was under the impression it was proof of concept code. With our current method of storing text, it wouldn't be a *simple* thing to add in, but it *would* be something that could be considered, if we could cache the diffs in some manner, otherwise this becomes a real performance hog.
When you say "performance hog", are you thinking in terms of the blame map being displayed every time someone clicks "history", or only when they click some specific button?
Steve
On 30/06/06, Steve Bennett stevage@gmail.com wrote:
When you say "performance hog", are you thinking in terms of the blame map being displayed every time someone clicks "history", or only when they click some specific button?
I'm thinking in terms of the algorithm having to refer to diffs of the page which won't exist unless cached from a previous operation. So it would have to generate them there and then. Diffs are cheap enough, sure, but keep pulling the text records and diffing against them en masse, and we'll notice the spike.
Rob Church
Hello,
a *simple* thing to add in, but it *would* be something that could be considered, if we could cache the diffs in some manner, otherwise this becomes a real performance hog.
Caching diffs ain't that fun. The interesting approach to this problem would generating intermediate diff, that could be edited incrementally, revision after revision, then displayed. But still, it is a eye candy, that requires quite a lot of resources...
Doams
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
But still, it is a eye candy, that requires quite a lot of resources...
You call it eye candy, I call it an interesting and potentially damn useful feature idea. Running a web site that gets upwards of ten thousand odd hits per second, allowing any old web user to edit it, etc. etc. also requires resources.
How do we handle the hit rate? Caching. Squids, shared memory caching and opcode caching, not to mention encouraging the client side cache where possible. Our resources evolve to meet our needs. We can adapt to what our code base is doing.
Sure, unreasonable and needless performance drains are a bitch, and I'm all for eradicating them. But the emphasis is on unreasonable and needless.
I'm well aware of the issues that the innocuous phrase "caching diffs" brings up, hence the "...". But here we have an interesting proposal - it's new born, it might not have the cleanest implementation, and it might turn out to be a load of bollocks. You won't known until we've examined it in more detail.
So performance is an issue. When is it not?
Rob Church
On 6/30/06, Rob Church robchur@gmail.com wrote:
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
But still, it is a eye candy, that requires quite a lot of resources...
You call it eye candy, I call it an interesting and potentially damn useful feature idea. Running a web site that gets upwards of ten thousand odd hits per second, allowing any old web user to edit it, etc. etc. also requires resources.
I think this blame map idea is a hell of a lot more than eye candy, and would be a massively useful addition to Wikipedia, well worth investing time and effort, and possibly new hardware on. It would really help dealing with vandalism, hoaxing, unsourced statements and probably lots of other problems we haven't even realised yet. Hell, it would even make it easier to attribute unsigned posts to talk pages.
So performance is an issue. When is it not?
I love the fact that we have server admins telling us not to worry too much about the performance side of things :)
Steve
On 30/06/06, Steve Bennett stevage@gmail.com wrote:
I love the fact that we have server admins telling us not to worry too much about the performance side of things :)
I'm not a server admin.
Rob Church
On 6/30/06, Rob Church robchur@gmail.com wrote:
On 30/06/06, Steve Bennett stevage@gmail.com wrote:
I love the fact that we have server admins telling us not to worry too much about the performance side of things :)
I'm not a server admin.
So much for that theory then :)
Steve
Hi!
So much for that theory then :)
hehe, when cluster load goes up, lots of strange things start to happen (well, of course less than a year ago, when it all went haywire), and frustration goes up thousand-fold :)
so when someone says "hey look, they're having lots of resources", converted to server-admin speak might sound "cluster going down, going down" :-)
Domas
Steve Bennett wrote:
I love the fact that we have server admins telling us not to worry too much about the performance side of things :)
We *should* worry about the performance side of things, but not let that fear impede the addition of useful functionality. I believe this is what Rob was actually saying.
On 30/06/06, Ivan Krstic krstic@fas.harvard.edu wrote:
We *should* worry about the performance side of things, but not let that fear impede the addition of useful functionality. I believe this is what Rob was actually saying.
Precisément.
Rob Church
On Fri, Jun 30, 2006 at 10:31:32AM +0200, Steve Bennett wrote:
I think this blame map idea is a hell of a lot more than eye candy, and would be a massively useful addition to Wikipedia, well worth investing time and effort, and possibly new hardware on. It would really help dealing with vandalism, hoaxing, unsourced statements and probably lots of other problems we haven't even realised yet. Hell, it would even make it easier to attribute unsigned posts to talk pages.
Oh, and tooltip popups, for the attributions?
Cheers, -- jra
Wooo, you type faster than you think!
You call it eye candy, I call it an interesting and potentially damn useful feature idea. Running a web site that gets upwards of ten thousand odd hits per second, allowing any old web user to edit it, etc. etc. also requires resources.
Yes, this is what we constantly work on =)
How do we handle the hit rate? Caching. Squids, shared memory caching and opcode caching, not to mention encouraging the client side cache where possible. Our resources evolve to meet our needs. We can adapt to what our code base is doing.
When did you work on performance? How many of various eye candies can be cached? Squids can cache content for anonymous users, this is what they're doing at the moment. Shared memory caching isn't that much of panacea, our usual MySQL queries are nearly as fast as memcached hits - both at single millisecond. It is usually used to offload processing, but you have to define a clear case how and when to use cache.
Say for simple two-revision diff we not only cache them, but also have great C++ written module for doing actual work. You should note that we have zillion-revision articles, so not only you run out of RGB, but also have quite a lot of expenses.
might turn out to be a load of bollocks. You won't known until we've examined it in more detail. So performance is an issue. When is it not?
Many things we did to improve performance were based on profiling data. And of course, common sense.
We had DifferenceEngine taking nearly 10% of our cluster cpu use, with less than 0.5% of backend requests being diffs. Now we're down to 2% of cluster cpu use, with less than 0.5% of backend requests being diffs. Add up regular page loading routines, because there's more than DifferenceEngine for showing a diff.
So sure, lots of work has been done, it became quite efficient.
Now if we introduce task that is far more complex than two-rev diff:
a) how much of use that will get (to create an encyclopedia) b) how efficient would it be?
Of course, if someone comes up with efficient feature, it is great. But complex efficient feature requires complex efficient programming, and we have just two guys who are actually doing such work (and capable of doing it). And sure, they're both loaded with running the site. If anyone else wants to join the ship, of course, they're absolutely welcome.
And people shouldn't have misconceptions that we have lots of resources and can run anything. We're donation powered website, that has to fix the infrastructure before autumn, because then we have yet another surge of users.
Anyone has right to download the dump and provide nice eyecandies for community, if they want to participate that way :) Just one important thing to remember - don't try getting popular ;-)
Domas
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
When did you work on performance? How many of various eye candies can be cached?
I didn't. My point was that the people best suited to do so shouldn't be put off by the fact that there might have to be a bit extra work. God forbid that a programmer should express an opinion on a matter that's still mostly confined to the development side at this time.
Many things we did to improve performance were based on profiling data. And of course, common sense.
You analyse the problem, you think of the best way of solving it and you do it. Yes. Why won't that method work here?
a) how much of use that will get (to create an encyclopedia)
A lot, as has been echoed here already.
Of course, if someone comes up with efficient feature, it is great. But complex efficient feature requires complex efficient programming, and we have just two guys who are actually doing such work (and capable of doing it).
I won't go into details on this, but I suggest you avoid the backhanded insults to the rest of us. Who are not paid. Who do this because we give a shit. Why, I don't know. I couldn't possibly tell you what compels me to continue working in this environment, writing little features and bug fixes.
If anyone else wants to join the ship, of course, they're absolutely welcome.
You frequently give off a vastly different impression.
And people shouldn't have misconceptions that we have lots of resources and can run anything. We're donation powered website, that has to fix the infrastructure before autumn, because then we have yet another surge of users.
I'm aware of this. I'm aware that we're running an Alexa top-15 web site on a budget that probably equalises the average coffee expenditure for some of the other organisations around that ballpark.
I had no such misconceptions. I know how bad it is. I know that a dodgy query can send things sideways; I've *watched* the cluster die badly.
Anyone has right to download the dump and provide nice eyecandies for community, if they want to participate that way :) Just one important thing to remember - don't try getting popular ;-)
Just one important thing to remember. Don't sacrifice utility for performance. Otherwise, we'll just have to tell Jimbo Wales his grand little scheme is far too unfeasible to run.
Are we intending on having a productive dialogue here, Domas?
Rob Church
Hello,
I didn't. My point was that the people best suited to do so shouldn't be put off by the fact that there might have to be a bit extra work. God forbid that a programmer should express an opinion on a matter that's still mostly confined to the development side at this time.
There were few guys who were complaining about how much of work they invested in one or other feature, and why doesn't it still run on site. There has to be understanding about what is needed for code to be used. If there is understanding, sure, it's great, let's continue the development.
You analyse the problem, you think of the best way of solving it and you do it. Yes. Why won't that method work here?
Exactly, this is why my initial mail on this issue was just a rambling on incremental diffs. I didn't get any opinions on them yet :) There can be various ideas on how to deal with it, diff just last 3-5 revs, and make it rolling diff.
Sure, diffs are quite critical in wikis - they add to collaboration, and they must be there. The cost of a diff shouldn't be neglected too ;-)
I won't go into details on this, but I suggest you avoid the backhanded insults to the rest of us. Who are not paid. Who do this because we give a shit. Why, I don't know. I couldn't possibly tell you what compels me to continue working in this environment, writing little features and bug fixes.
It was not an insult. I know you have a problem and you think whole world is against you. Well, maybe it is, but that is offtopic. I'm not paid either. I also work on little changes and bug fixes. And I wouldn't take the fact that we have two hard working guys writing fascinating stuff, as an insult. Because I'd like to be able to do that much too. Maybe I can't, so I do what I can.
You frequently give off a vastly different impression.
:-(
Just one important thing to remember. Don't sacrifice utility for performance. Otherwise, we'll just have to tell Jimbo Wales his grand little scheme is far too unfeasible to run.
We're not sacrificing. We have few other features impacting our performance for years, and we have them there. Because they make sense :) And yet another issue. Our performance is our utility. Make it slow, and people will get frustrated and won't be enjoying the free encyclopedia.
Are we intending on having a productive dialogue here, Domas?
When weren't we? ;-)
Domas
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
It was not an insult. I know you have a problem and you think whole world is against you. Well, maybe it is, but that is offtopic.
That statement is completely unacceptable and I would like an apology.
I'm not paid either. I also work on little changes and bug fixes. And I wouldn't take the fact that we have two hard working guys writing fascinating stuff, as an insult. Because I'd like to be able to do
It was the phrasing of it that implied Brion and Tim were our only competent coders, I think.
that much too. Maybe I can't, so I do what I can.
Same here.
We're not sacrificing. We have few other features impacting our performance for years, and we have them there. Because they make sense :)
So could this. Although I will concede it wouldn't be critical.
And yet another issue. Our performance is our utility. Make it slow, and people will get frustrated and won't be enjoying the free encyclopedia.
That's a very valid and fair point.
When weren't we? ;-)
All right, let's scrap the pettiness and move on.
Rob Church
On Fri, Jun 30, 2006 at 10:36:45AM +0100, Rob Church wrote:
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
It was not an insult. I know you have a problem and you think whole world is against you. Well, maybe it is, but that is offtopic.
That statement is completely unacceptable and I would like an apology.
<referee> Domas, I've been watching this list for the last year, and Rob is the *last* person of whom I, personally, would have acquired that impression. I have to agree with him; you might want to think about why you think that, and why you said it. </referee>
Cheers, -- jra
Domas Mituzas wrote:
Exactly, this is why my initial mail on this issue was just a rambling on incremental diffs. I didn't get any opinions on them yet :) There can be various ideas on how to deal with it, diff just last 3-5 revs, and make it rolling diff.
Incidentally, I've recently been looking into this extensively, using a Wikipedia dataset. I have a bunch of hard numbers; I'll clean them up and post them here in the next few days.
On 6/30/06, Rob Church robchur@gmail.com wrote:
On 30/06/06, Domas Mituzas midom.lists@gmail.com wrote:
But still, it is a eye candy, that requires quite a lot of resources...
You call it eye candy, I call it an interesting and potentially damn useful feature idea. Running a web site that gets upwards of ten thousand odd hits per second, allowing any old web user to edit it, etc. etc. also requires resources.
I don't think of this as anything remotely like eye candy. In fact it will save resources over manually looking for who is responsible for a particular change if it is not recent. Something I already do. Of course with a handy tool for this, people will use it several orders of magnitude more often than they now do manually.
Andrew Dunbar (hippietrail)
How do we handle the hit rate? Caching. Squids, shared memory caching and opcode caching, not to mention encouraging the client side cache where possible. Our resources evolve to meet our needs. We can adapt to what our code base is doing.
Sure, unreasonable and needless performance drains are a bitch, and I'm all for eradicating them. But the emphasis is on unreasonable and needless.
I'm well aware of the issues that the innocuous phrase "caching diffs" brings up, hence the "...". But here we have an interesting proposal - it's new born, it might not have the cleanest implementation, and it might turn out to be a load of bollocks. You won't known until we've examined it in more detail.
So performance is an issue. When is it not?
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Fri, Jun 30, 2006 at 09:23:22AM +0100, Rob Church wrote:
So performance is an issue. When is it not?
And, not to put too fine a point on it, whlie performance considerations need to be taken into account for WMF sites, not all MW's are MWF sites.
It would be interesting to have even a 3-sigma estimate of how many *hits*, total, hit WMF MediaWikiae vs MW's run by others...
Cheers, -- jra
Rob Church wrote:
It's worth noting that someone else has registered an intention to do something similar and provide it as a script on the toolserver. That sounds infeasible to me, however, since we all know that, right now (and likely, for some considerable time), Zedler does not have access to the text of pages.
Well, my idea was simpler. While this new system would be superb in giving the attribution, and an exapmple of GFDL-compliance, on everyday use, you won't need to know 'who added each piece', but 'who added that quote' on Jimbo Wales user page (~3000 edits) OffTopic: interesting how Jimbo ip was hidden [http://en.wikipedia.org/w/index.php?title=User:Jimbo_Wales&diff=7476787&...]
So you would ask 'when was this phrase [at revision y] introduced?', getting:
a) You must be wrong, article doesn't have such phrase b) Three edits before, by Foo [diff]
I don't think it to be very expensive. I calculate that most changes would have been introduced on last 10 edits (average 4 edits or so). Of course then you should try to guess if it was a blanking + reversion and complicate things a little more ;)
The worst case would be where the page has a lot of edits and the token has been there for a lot of edits. The first time it will be slow, but next times the cache will fast it a lot, so you'll have to go looking for another biggy article.
Your problem is that toolserver (zedler) doesn't have direct access to stored text due to external storage (which we were told will be solved..... one of this years) and text queries need to do a hit to the pmtpa server. Yeah, this can reduce perfomance, but -we only want raw text. No need to _stress_ wikimedia servers with page rendering. -if the revision was queried before, it will be cached on toolserver :-) -<s>we</s> I expect to need little revisions per query (as stated above).
It could be watched on real conditions the problems which may arise, or if it's needed to reduce frequence of queries/set a timewait. Until then, everything is speculation :-)
As a side note, there's currently a script on toolserver querying every change on wikipedia. Ask for some revisions when a user calls the php shouldn't matter in comparison ;-) External storage is a toolserver problem so i don't know why you complain to me so much about it. Morevoer, there're worse methods of implementating it, what about doing it on javascript? hahaha.
Platonides
On 6/30/06, Platonides Platonides@gmail.com wrote:
Well, my idea was simpler. While this new system would be superb in giving the attribution, and an exapmple of GFDL-compliance, on everyday use, you won't need to know 'who added each piece', but 'who added that quote' on Jimbo Wales user page (~3000 edits) OffTopic: interesting how Jimbo ip was hidden
[snip]
The proposed system would be completely inadequate for attribution purposes, too easy to confuse the computer even in obvious cases... plus not all cases are obvious, someone can still hold a significant copyright interest in the work even if every word they wrote has been replaced.
(Not that I don't think it would be a useful tool, but don't advertise it as something that it is not :) )
On 01/07/06, Platonides Platonides@gmail.com wrote:
As a side note, there's currently a script on toolserver querying every change on wikipedia. Ask for some revisions when a user calls the php shouldn't matter in comparison ;-)
Whose?
External storage is a toolserver problem so i don't know why you complain to me so much about it. Morevoer, there're worse methods of implementating it, what about doing it on javascript? hahaha.
I'm not complaining, so don't misrepresent me as doing so. I'm pointing out that a strong reliance upon text access is not too brilliant for a project running on the toolserver at present.
Rob Church
On 6/30/06, Platonides Platonides@gmail.com wrote: [snip]
As a side note, there's currently a script on toolserver querying every change on wikipedia. Ask for some revisions when a user calls the php shouldn't matter in comparison ;-) External storage is a toolserver problem so i don't know why you complain to me so much about it. Morevoer, there're worse methods of implementating it, what about doing it on javascript? hahaha.
There is a huge difference between reading every article change (less than 1 per second on a one month average) and reading all of the 9000ish revisions to George W. Bush every time someone wants to view the blame map.
Moin,
On Saturday 01 July 2006 18:00, Gregory Maxwell wrote:
On 6/30/06, Platonides Platonides@gmail.com wrote: [snip]
As a side note, there's currently a script on toolserver querying every change on wikipedia. Ask for some revisions when a user calls the php shouldn't matter in comparison ;-) External storage is a toolserver problem so i don't know why you complain to me so much about it. Morevoer, there're worse methods of implementating it, what about doing it on javascript? hahaha.
There is a huge difference between reading every article change (less than 1 per second on a one month average) and reading all of the 9000ish revisions to George W. Bush every time someone wants to view the blame map.
Cache the blame map. In addition, cache it for each revision. Limit the cache to "N maps or M megabytes, whatever is reached earlier".
I think it should be possible to generate the blame-map for revision N+1 from the map of revision N and the diff between the revisions.
Best wishes,
Tels
On 7/1/06, Tels nospam-abuse@bloodgate.com wrote:
Cache the blame map. In addition, cache it for each revision. Limit the cache to "N maps or M megabytes, whatever is reached earlier".
I think it should be possible to generate the blame-map for revision N+1 from the map of revision N and the diff between the revisions.
Locality is poor, any time you talk about caching revisions you're fighting a losing battle.
We'd really need incremental production of blame maps... Where you can take a finished blame map for revisions 1..5 and add revisions 6 and 7 and get the 1..7 blame map. Then blame maps could be could simply be generated and stored.. and when they are requested it would only require fetching the map and updating it.
wikitech-l@lists.wikimedia.org