Sal'
If summary field text is in multi-byte encoding and we want to grep the first 150 chars of comment then a slightly strange char sometimes appears in history
Example: http://commons.wikimedia.org/w/index.php?title=Image:Venera-7_diagram.jpg&am... ============== (==Описание/Description== *ru:Межпланетная автоматическая станция «Венера-7»: 1 — панели солнечных батарей; 2 — датчик астроориентации; 3 — защитная �) ==============
I think, It's first byte of truncated two-byte char. So, we have to use "mb_substr" instead of "substr", is'nt it?
== Cxion la plej helan, meta:ajvol
On 6/9/05, Александр Сигачёв alexander.sigachov@gmail.com wrote:
Sal'
If summary field text is in multi-byte encoding and we want to grep the first 150 chars of comment then a slightly strange char sometimes appears in history
Example: http://commons.wikimedia.org/w/index.php?title=Image:Venera-7_diagram.jpg&am...
(==Описание/Description== *ru:Межпланетная автоматическая станция «Венера-7»: 1 — панели солнечных батарей; 2 — датчик астроориентации; 3 — защитная �) ==============
I think, It's first byte of truncated two-byte char. So, we have to use "mb_substr" instead of "substr", is'nt it?
Without having actually looked at the code but it should be using the truncate() function from the language class, however the Language.php version of that function is not Unicode aware so stuff like this will continue happening until bug 2069 is solved (http://bugzilla.wikimedia.org/show_bug.cgi?id=2069)
Ævar Arnfjörð Bjarmason wrote:
Without having actually looked at the code but it should be using the truncate() function from the language class, however the Language.php version of that function is not Unicode aware so stuff like this will continue happening until bug 2069 is solved (http://bugzilla.wikimedia.org/show_bug.cgi?id=2069)
I don't understand this claim. The LanguageUtf8 truncate *is* already UTF-8 aware; 2069 is a code layout issue only and does not affect functionality.
If there's a bug here, it's from failing to call the function in the first place and letting the database crop the field.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Ævar Arnfjörð Bjarmason wrote:
Without having actually looked at the code but it should be using the truncate() function from the language class, however the Language.php version of that function is not Unicode aware so stuff like this will continue happening until bug 2069 is solved (http://bugzilla.wikimedia.org/show_bug.cgi?id=2069)
I don't understand this claim. The LanguageUtf8 truncate *is* already UTF-8 aware; 2069 is a code layout issue only and does not affect functionality.
If there's a bug here, it's from failing to call the function in the first place and letting the database crop the field.
Right, to put it another way, LanguageUtf8 is the base class for every language class except LanguageLatin1. $wgLang->truncate() will always use the correct encoding for the wiki, it's only if you call it with Language::truncate() that you'll run into trouble.
Note that you can use mb_substr() in MediaWiki if you like, I implemented a simulation of it for systems without mbstring, using the /./u trick.
-- Tim Starling
On 6/9/05, Brion Vibber brion@pobox.com wrote:
Ævar Arnfjörð Bjarmason wrote:
Without having actually looked at the code but it should be using the truncate() function from the language class, however the Language.php version of that function is not Unicode aware so stuff like this will continue happening until bug 2069 is solved (http://bugzilla.wikimedia.org/show_bug.cgi?id=2069)
I don't understand this claim. The LanguageUtf8 truncate *is* already UTF-8 aware; 2069 is a code layout issue only and does not affect functionality.
If there's a bug here, it's from failing to call the function in the first place and letting the database crop the field.
I was assuming that it was a Latin-1 wiki ( didn't actually notice it was the commons (they all look alike)).
wikitech-l@lists.wikimedia.org