I am trying to create wiki pages via program. I have been partially successful, but I cannot seem to get passed problems of accented characters not appearing correctly. Below I have a self-contained example. It creates a page called "Page Test 1" fine, except that the page text, instead of reading as "Fédération" (with two "e"s with accents), I get complete junk for that part. In Internet Explorer it shows as a chinese character (!), and in Firefox I get two nasty blobs with question marks in.
What can I do to ensure the code-page translation stuff works correctly? There are a whole bunch of stuff for dealing with funny chars, but which to use, or how should I be pre-processing 'user' input?
Thanks!
Hugh Prior
<?
require_once("../includes/Article.php"); require_once("../includes/Title.php"); require_once("../includes/EditPage.php"); require_once("../includes/GlobalFunctions.php");
/** * Test page creation */ function pageCreate() { global $wgLoadBalancer; global $wgUser;
// Create the page text $pageText = "Fédération"; $wikiPageName = "Page Test 1";
// Code adapted from "maintenance/InitialiseMessages.inc" $dbw =& wfGetDB( DB_MASTER );
$title = new Title(); $title = $title->newFromText($wikiPageName);
$article = new Article( $title ); $newid = $article->insertOn( $dbw, 'sysop' );
$revision = new Revision( array( 'page' => $newid, 'text' => $pageText, 'user' => 0, 'user_text' => "My user text", 'comment' => '', ) ); $revid = $revision->insertOn( $dbw ); $article->updateRevisionOn( $dbw, $revision );
$dbw->commit();
}
// Call the page creation pageCreate();
?>
Thank you Brion for your answer. But I am not much the wiser. I know of course that I need to do some special character treatment.
How? Is there some sort of "preprocessTextForUTF8()" function which I need to call? Is there some sor "$revision->saveAsUTF8()" function which I need to call?
Thanks.
Hugh Prior wrote:
Thank you Brion for your answer. But I am not much the wiser. I know of course that I need to do some special character treatment.
How? Is there some sort of "preprocessTextForUTF8()" function which I need to call? Is there some sor "$revision->saveAsUTF8()" function which I need to call?
Your text editor will have some sort of encoding setting. Use it.
-- brion vibber (brion @ pobox.com)
"Brion Vibber" brion@pobox.com wrote:
Your text editor will have some sort of encoding setting. Use it.
Thanks for trying Brion.
However, in view of the actual problem though, this which you suggest is, sorry to say, complete nonsense. The whole idea is that the page is created via a PROGRAM and not via the browser, so browser settings are totally irrelevant. Sure, the page needs to be visible correctly in a browser afterwards, but it should not be for the end user to try and fudge the browser to some bizarre setting just because a letter "e" has a simple accent.
If you look at the sample code you will see the sample text which causes a problem:
$pageText = "Fédération";
It is not complex text. It is not as if I am trying to input Chinese via a program into a wiki.
If you think that the code, being PHP, still has to be run by a browser, ask the question how could such code as shown in the sample run and generate correct output.when the PHP program is run from the command line.
To reiterate, how can I get the following simple program to correctly create wiki pages with the accents correctly:
<?
require_once("../includes/Article.php"); require_once("../includes/Title.php"); require_once("../includes/EditPage.php"); require_once("../includes/GlobalFunctions.php");
/** * Test page creation */ function pageCreate() { global $wgLoadBalancer; global $wgUser;
// Create the page text $pageText = "Fédération"; $wikiPageName = "Page Test 1";
// Code adapted from "maintenance/InitialiseMessages.inc" $dbw =& wfGetDB( DB_MASTER );
$title = new Title(); $title = $title->newFromText($wikiPageName);
$article = new Article( $title ); $newid = $article->insertOn( $dbw, 'sysop' );
$revision = new Revision( array( 'page' => $newid, 'text' => $pageText, 'user' => 0, 'user_text' => "My user text", 'comment' => '', ) ); $revid = $revision->insertOn( $dbw ); $article->updateRevisionOn( $dbw, $revision );
$dbw->commit();
}
// Call the page creation pageCreate();
?>
I think it isn't nonsense actually.
Mediawiki is UTF8. UTF8 has no problem with plain ASCII as long as it's in the common English subset.
Latin_1 characters are not transparent to this. If you edit your PHP with an ASCII editor, it won't be proper UTF8.
To be on the safe side, since I don't know which platform you're running, my humble advice is to try jEdit (java) and change the buffer encoding setting to UTF-8.
(just load, change the buffer settings to UTF8, and save again. You can save to a different file and see how they're not binary-identical) - Utilities >> Buffer Options >> Character Encoding
Hope that helps.
On 2/6/06, Hugh Prior mediawiki@localpin.com wrote:
"Brion Vibber" brion@pobox.com wrote:
Your text editor will have some sort of encoding setting. Use it.
Thanks for trying Brion.
However, in view of the actual problem though, this which you suggest is, sorry to say, complete nonsense. The whole idea is that the page is created via a PROGRAM and not via the browser, so browser settings are totally irrelevant. Sure, the page needs to be visible correctly in a browser afterwards, but it should not be for the end user to try and fudge the browser to some bizarre setting just because a letter "e" has a simple accent.
If you look at the sample code you will see the sample text which causes a problem:
$pageText = "Fédération";
It is not complex text. It is not as if I am trying to input Chinese via a program into a wiki.
If you think that the code, being PHP, still has to be run by a browser, ask the question how could such code as shown in the sample run and generate correct output.when the PHP program is run from the command line.
To reiterate, how can I get the following simple program to correctly create wiki pages with the accents correctly:
<? require_once("../includes/Article.php"); require_once("../includes/Title.php"); require_once("../includes/EditPage.php"); require_once("../includes/GlobalFunctions.php"); /** * Test page creation */ function pageCreate() { global $wgLoadBalancer; global $wgUser; // Create the page text $pageText = "Fédération"; $wikiPageName = "Page Test 1"; // Code adapted from "maintenance/InitialiseMessages.inc" $dbw =& wfGetDB( DB_MASTER ); $title = new Title(); $title = $title->newFromText($wikiPageName); $article = new Article( $title ); $newid = $article->insertOn( $dbw, 'sysop' ); $revision = new Revision( array( 'page' => $newid, 'text' => $pageText, 'user' => 0, 'user_text' => "My user text", 'comment' => '', ) ); $revid = $revision->insertOn( $dbw ); $article->updateRevisionOn( $dbw, $revision ); $dbw->commit(); } // Call the page creation pageCreate(); ?>
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Hugh Prior wrote:
"Brion Vibber" brion@pobox.com wrote:
Your text editor will have some sort of encoding setting. Use it.
Thanks for trying Brion.
However, in view of the actual problem though, this which you suggest is, sorry to say, complete nonsense.
That's only, sorry to say, because you have no idea what you're talking about.
$pageText = "Fédération";
It is not complex text.
It is not as if I am trying to input Chinese via a program into a wiki.
Actually, it's exactly like that. Your string contains two non-ASCII characters, which will need to be properly encoded or you'll get some data corruption. Specifically, they must be UTF-8 encoded.
There's *no* qualitative difference between "é" and something like "本"; both are non-ASCII characters and therefore must be properly encoded in the UTF-8 source file.
The symptoms you described are *exactly* the symptoms of a miscoded 8-bit ISO 8859-1 (or Windows "ANSI" or whatever they call it) character in what should be a UTF-8 text stream.
If you think that the code, being PHP, still has to be run by a browser,
I'm talking about the text editor you used to save the PHP source file containing literal strings. There's no "browser" involved in your problem.
-- brion vibber (brion @ pobox.com)
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" . Many popular editors default to 8-bit ASCII, and others default to 8859-1 , also known as "latin 1" ; some use "Windows encoding" which is not exactly the same thing, but it's close. There is also "Mac encoding" which is also close but again it's different. Those just to mention the "popular" ones.
ASCII values from 128 on, those used with the first bit set, are the problematic ones. UTF8 reserves those to indicate more bytes are needed for displaying a char. UTF8 is variable-length while all the others I've mentioned here are "1-byte-1-char" so to speak.
Brion is right about 'é' being not "UTF8 friendly" - meaning all lower ASCII ( 0-127 or, in hexadecimal, 0-7F) are encoded the same for all popular 8-bit representations and also UTF8.
In other words, Hugh will have to check the encoding of the file, and Brion is right about this not being a browser problem whatsoever.
Hope that helped, Hugh. Also read my email from yesterday where I tried to give you a solution instead of scolding you ;)
UTF8 is, by the way, not the best encoding for Asian text. UTF8 is meant to display English text effectively (1 byte) while still being able to map all Unicode. This is nice, but since all Japanese and Chinese characters (at least all I tried, I'd have to check the tables to make sure) take 3 BYTES OR MORE (sorry for shouting) that alone is reason enough to use another Wiki like the popular japanese pukiwiki (using EUC-Japanese), or others using typically SJIS or EUC-Japanese, EUC-Chinese, Big5 etc. It would be very nice to have an UTF16 version, which would only take 2-bytes for each character most of the time, 33%+- better space-wise. I'm aware it's bad to have just one thing more to care about (different encodings) so I really understand this is not being done. For me UTF8 is more or less okay, since my Wiki will be mixed latin1+asian text.
For those who made it to the end of this message, thanks for your patience :-) now back to my busy-ass life as game developer... I'm late to my commute.
On 2/6/06, Brion Vibber brion@pobox.com wrote:
Hugh Prior wrote:
"Brion Vibber" brion@pobox.com wrote:
Your text editor will have some sort of encoding setting. Use it.
Thanks for trying Brion.
However, in view of the actual problem though, this which you suggest
is,
sorry to say, complete nonsense.
That's only, sorry to say, because you have no idea what you're talking about.
$pageText = "Fédération";
It is not complex text.
It is not as if I am trying to input Chinese via a program into a wiki.
Actually, it's exactly like that. Your string contains two non-ASCII characters, which will need to be properly encoded or you'll get some data corruption. Specifically, they must be UTF-8 encoded.
There's *no* qualitative difference between "é" and something like "本"; both are non-ASCII characters and therefore must be properly encoded in the UTF-8 source file.
The symptoms you described are *exactly* the symptoms of a miscoded 8-bit ISO 8859-1 (or Windows "ANSI" or whatever they call it) character in what should be a UTF-8 text stream.
If you think that the code, being PHP, still has to be run by a browser,
I'm talking about the text editor you used to save the PHP source file containing literal strings. There's no "browser" involved in your problem.
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
muyuubyou wrote:
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some other encoding.
Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII, which incorporate the 7-bit ASCII character set in the lower 128 code points and various other characters in the high 128 code points.
Many people erroneously call any mapping from a number to a character that can fit in 8 bits an "ASCII code", however this is incorrect.
Many popular editors default to 8-bit ASCII,
There's no such thing.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
ASCII values from 128 on,
No such thing; there are no ASCII values from 128 on. However many 8-bit character encodings which are supersets of ASCII contain *non*-ASCII characters in the 128-256 range. Since these represent wildly different characters for each such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it will have any meaning in a program that doesn't know about your favorite encoding selection.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as compact as possible for the particular data I want to use at the moment" then yes, there are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and works very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for such languages.
If space were an issue, though, you'd be using data compression.
UTF8 is meant to display English text effectively (1 byte)
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
It would be very nice to have an UTF16 version, which would only take 2-bytes for each character most of the time, 33%+- better space-wise.
Much of the time, the raw amount of space taken up by text files is fairly insignificant. Text is small compared to image and multimedia data, and it compresses very well.
Modern memory and hard disk prices strongly favor accessibility and compatibility in most cases over squeezing a few percentage points out of uncompressed text size.
-- brion vibber (brion @ pobox.com)
"Extended ASCII" is "accepted" and thus exists, regardless it came from the ASCII board or not. The fact that bytes are 8-bits and almost everything in computers is in bytes or multiples thereof has created this nightmare of 8-bit encodings we're still suffering today. IBM's first extension is what many people call "extended ASCII" we like it or not, and that is what I was talking about. Namely the DOS representation of the higher 128 codes. It came with "IBM PC".
There is no such thing blahblah can be true if you ignore the gazillion lines of legacy code thinking otherwise.
I agree with you it's unwise to assume programs will map your non-ASCII right, but since many do it's a common thing. 99% (in ammount) of things in 1 byte are latin-1. The other "important" languages are impossible to represent in 1 byte anyway except for arabic and hebrew, but those are usually isolated from our "computer isle" in the west. For instance, 90% of the "interweb" that isn't Chinese or Japanese, it belongs to a latin-1 covered language.
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
of
specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
In the real world that "happens to be compact for English" is crucial. Unix was developed "in English" mainly, and therefore encoding of the English language plus some extra codes was all that fell into consideration. They simply didn't need to comment code in Japanese. What you say is factually true, but I was just pointing out the most important reason in regards to the topic at hand. By saying "UTF8 is meant to display English correctly" I didn't imply it isn't meant to do anything else or that was the basis of it. I could have said "UTF8 is meant to encode English correctly and effectively, among other things" but I just didn't want to shift the focus.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
Yeah, but he his editor of choice most probably isn't, or there wouldn't be a problem in the first place. Let's keep the focus.
It's a very common scenario, for minor changes, that people connect via TELNET or SSH and quickly edit something directly in their test server, instead of editing locally and then uploading FTP (or using some FTP capable editor like gvim with the FTP plugin for instance). It's also very common that consoles are set to ISO-8859-1 and thus vi, pico or nano will use that. Can also happen that it's a shared environment and the user just can't install stuff... also many telnet/ssh clients are not UTF8 compatible, or he may have any sort of configuration problem I can't even imagine now. Shit happens.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as
compact
as possible for the particular data I want to use at the moment" then yes,
there
are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and
works
very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for
such
languages.
If space were an issue, though, you'd be using data compression.
Compatibility is always an issue I'm afraid, and for this project UTF-8 is IMO the best choice, if we have to stick to just one encoding. For Wikipedia this is undoubtedly true. Other Wikis I'm sure they'd use a different thing. But it still "Just Works" so I'm not complaining. It also makes things nice for the developers because many IDEs and editors support UTF-8 out of the box.
But, of course, space is always an issue. Using data compression has an impact in processor performance. Having a better encoding for your text is "compression without processing penalty" to put it in layman terms, and having to retrieve more data slows down your wiki for several reasons: more data to retrieve from the database and more bandwidth needed/longer transmission time. For instance, for the average Japanese wiki it would save 30% space in server, 15-20% in bandwidth even with mod-gz, 30% better memory usage in database caching (caching is good for mediawiki as you know better than me for sure) - equivalent to have 30%+ memory for caching. Those are rough figures. I'm not asking you to change this, as it would involve a lot of time I'm sure you can use, just to keep it into consideration if at some point you had time to support more than one encoding for mediawiki. Many wikis hardly use any image at all, and when they do, they keep it somewhere out of the database (haven't looked this in mediawiki, are you storing them in BLOBs?).
So, "UTF8 is not the best for Asian text" as in, "by using exclusively UTF8, you're bogging your performance down 20%+ for many people" . And extra tweaks are not realistic for the joe-wiki-admin who most probably won't have caching at all.
This is not a critique. For me, the wiki works well, it's fast enough and UTF8 happens to suit me fine. This direction just keeps mediawiki from being more popular in Asia. Stability and functionality are over performance in my consideration list.
On 2/7/06, Brion Vibber brion@pobox.com wrote:
muyuubyou wrote:
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some other encoding.
Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII, which incorporate the 7-bit ASCII character set in the lower 128 code points and various other characters in the high 128 code points.
Many people erroneously call any mapping from a number to a character that can fit in 8 bits an "ASCII code", however this is incorrect.
Many popular editors default to 8-bit ASCII,
There's no such thing.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
ASCII values from 128 on,
No such thing; there are no ASCII values from 128 on. However many 8-bit character encodings which are supersets of ASCII contain *non*-ASCII characters in the 128-256 range. Since these represent wildly different characters for each such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it will have any meaning in a program that doesn't know about your favorite encoding selection.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as compact as possible for the particular data I want to use at the moment" then yes, there are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and works very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for such languages.
If space were an issue, though, you'd be using data compression.
UTF8 is meant to display English text effectively (1 byte)
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
It would be very nice to have an UTF16 version, which would only take 2-bytes for each
character
most of the time, 33%+- better space-wise.
Much of the time, the raw amount of space taken up by text files is fairly insignificant. Text is small compared to image and multimedia data, and it compresses very well.
Modern memory and hard disk prices strongly favor accessibility and compatibility in most cases over squeezing a few percentage points out of uncompressed text size.
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
muyuubyou wrote: [snipped excuses for making false statements claimed as corrections for "mistakes" which were true statements]
The only relevant thing in this discussion is that you always have to save your text files in the proper encoding (which for MediaWiki is always UTF-8, the standard Unicode encoding for Unix and text-based communication protocols).
-- brion vibber (brion @ pobox.com)
[ 'é' and japanese 'ki' not being in the same league for the vast majority of editors in the world, that wasn't a false statement.
I can agree to take back my claim of 'é' being ASCII ("extended" or not) because strictly it isn't ]
Hugh, can you reply and let me know if my suggestion worked?
Mr Vibber, pissed or not, it would be wise to reply to questions from users with more diplomacy, regardless of the tone used in the question in the first place. Granted you were told to be "speaking nonsense" when you were right, but instead of "you have no idea what you're talking about, é and japanese code are both the same for UTF8 " you could have said "actually, both of them have to be encoded properly in UTF8" and nothing would have happened. Please don't take this personal.
Is this the right list for suggestions? if so, please take my previous comment about UTF8 and UTF16 as a suggestion, please don't snip it out. Just by having UTF8 AND UTF16 things would improve. Sure it's a lot of work, but it's just a thing to consider for the future.
Different issue (sorry to mix stuff, but the list is busy enough already)
My issue with Firefox is happening in my installation but not in wikipedia. Not sure what it is, but I'll try to find out when I have more time.
The following only occurs with Chinese and Japanese text in page titles:
Basically when I pass the script an existing page, it opens it no problem in all browsers; but when I pass the script an unexisting one, it mangles the title only on firefox (don't have other mozilla browsers installed at the moment at home, must check it out with Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE based ones too. It's probably some strange behavior from the browser... but then again it doesn't happen with Wikipedia. Just in case someone has any pointers.
If it's something easy please don't scold me, I'm just a user who hasn't looked too much into the code *hides away*
On 2/7/06, Brion Vibber brion@pobox.com wrote:
muyuubyou wrote: [snipped excuses for making false statements claimed as corrections for "mistakes" which were true statements]
The only relevant thing in this discussion is that you always have to save your text files in the proper encoding (which for MediaWiki is always UTF-8, the standard Unicode encoding for Unix and text-based communication protocols).
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Thanks muyuubyou and Brian for all your information.
I now understand clearly that I need to make sure my editor is working in UTF-8, and that accented e (é) is going to have problems because it is not part of the 127 characters that make up the original ASCII.
At present I haven't yet found an editor-type solution but I have made a note that jEdit is one option.
When I get the problem fixed I'll let you know.
Thanks again for all your help. :-)
Hugh Prior
I use Dreamweaver MX as my editor, and I changed the "Document Encoding" to be "UTF-8 (Unicode)" instead of the default "Western (Latin 1)", and now this works fine (at least as a hard-coded example anyway).
i.e. in Dreamweaver: Modify->Properties->Page Properties->Document Encoding
also I have changed the default for new documents to be UTF-8 within Dreamweaver: Edit->Preferences->New Document->Default Encoding
Hope that helps somebody out there.
Hugh Prior
muyuubyou wrote:
Mr Vibber, pissed or not, it would be wise to reply to questions from users with more diplomacy, regardless of the tone used in the question in the first place. Granted you were told to be "speaking nonsense" when you were right, but instead of "you have no idea what you're talking about, é and japanese code are both the same for UTF8 " you could have said "actually, both of them have to be encoded properly in UTF8" and nothing would have happened. Please don't take this personal.
I'm sorry if I was a bit snappy.
Is this the right list for suggestions? if so, please take my previous comment about UTF8 and UTF16 as a suggestion, please don't snip it out. Just by having UTF8 AND UTF16 things would improve. Sure it's a lot of work, but it's just a thing to consider for the future.
If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5 mode, and you know for sure that you aren't going to use compressed text storage, you might be able to get away with changing text.old_text to a TEXT field type and assigning it the ucs2 charset. This will store its data as UCS-2 instead of UTF-8.
You can do the same for any of the various name, comment, etc fields.
Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8 storage is also limited so that characters outside the basic multilingual plane (the classic 16-bit range) can't be stored at all. Attempting to insert these characters will cause the field to become truncated (in UTF-8) or just corrupt the character (in UCS-2).
If MySQL supported it, my preference would be to use UTF-16 with 16-bit collation for the non-bulk-text fields; that is, allow clean translation to/from compliant UTF-8 but keep the indexes at 2 bytes per code point. This would keep the size of the indexes down compared to their UTF-8 support (which currently needs 3 bytes per character and would need 4 if they made it actually support full UTF-8).
Index size directly relates to key caching and index scanning performance, so on a large-scale setup that can be relevant. (Bulk text storage is much less significant in this respect; individual records are picked out cheaply based on an integer index lookup.)
Alternatively, you could potentially whip up some kind of text storage handler for MediaWiki that would convert the internal UTF-8 data into UTF-16 for storage in the blob. I doubt this would be significantly pleasant though. :)
Using UTF-16 internally or for output isn't really possible.
My issue with Firefox is happening in my installation but not in wikipedia. Not sure what it is, but I'll try to find out when I have more time.
The following only occurs with Chinese and Japanese text in page titles:
Basically when I pass the script an existing page, it opens it no problem in all browsers; but when I pass the script an unexisting one, it mangles the title only on firefox (don't have other mozilla browsers installed at the moment at home, must check it out with Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE based ones too. It's probably some strange behavior from the browser... but then again it doesn't happen with Wikipedia. Just in case someone has any pointers.
Do you have an example? How is it mangled, exactly?
In what way are you passing the data? * Typing on the URL bar * From an <a href> link on a web page * From a <form> on a web page
If in a URL or link, is the title: * percent-encoded UTF-8 bytes, per RFC 3987 * percent-encoded bytes in some other encoding, such as EUC-JP or Shift-JIS * raw typed text
Current versions of IE are, I think, set to send unencoded characters in URLs as percent-encoded UTF-8. Mozilla for some reason has left this option off, so sometimes it'll send unencoded characters in <a href> links in the source page's character set. I'm not sure what it'll do in the URL bar (locale encoding?) but it seems to be happy to send UTF-8 from the URL bar on my Windows XP box if I paste in some random Chinese text.
LanguageJa and LanguageZh doesn't set fallback encoding checks, so non-Unicode encodings of Japanese and Chinese won't be detected or automatically converted. (There are multiple character sets in use for these, making it extra difficult compared with most European languages.)
-- brion vibber (brion @ pobox.com)
That was a very interesting update about Unicode support in MySQL. Thanks!
Well, about my little issue with Firefox: I just type in the URL bar, and no cookie. Following links works (for instance, link to pages created with Opera, work also under Firefox).
For instance:
http://someIPhere/wiki/index.php/%E4%B8%AD%E6%96%87 I press enter, then the URL bar turns to: http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84
, which is page "ÖÐÎÄ"
Ö is UTF16 => 0096, UTF8 => C3 96
Those are 8 bytes there in the URL, 4 for each letter... that shouldn't be.
If I Google 中文, Google returns me this page: http://www.google.com/search?q=%E4%B8%AD%E6%96%87 which looks more UTF8 to me. And works, too.
Been browsing the Unicode chart and the first character is 4e2d: http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e2d UTF8 => E4 B8 AD
The second one is 6587 (God, unicode.org is hell to browse this stuff) which in turn is UTF8 => E6 96 87
I wonder what's the browser doing, because 中 has nothing to do with anything starting C396 in any encoding.
Hope that helps somehow.
On 2/7/06, Brion Vibber brion@pobox.com wrote:
muyuubyou wrote:
Mr Vibber, pissed or not, it would be wise to reply to questions from users with more diplomacy, regardless of the tone used in the question in the first place. Granted you were told to be "speaking nonsense" when you were right, but instead of "you have no idea what you're talking about, é and japanese code are both the same for UTF8 " you could have said "actually, both of them have to be encoded properly in UTF8" and nothing would have happened. Please don't take this personal.
I'm sorry if I was a bit snappy.
Is this the right list for suggestions? if so, please take my previous comment about UTF8 and UTF16 as a suggestion, please don't snip it out. Just by having UTF8 AND UTF16 things would improve. Sure it's a lot of work, but it's just a thing to consider for the future.
If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5 mode, and you know for sure that you aren't going to use compressed text storage, you might be able to get away with changing text.old_text to a TEXT field type and assigning it the ucs2 charset. This will store its data as UCS-2 instead of UTF-8.
You can do the same for any of the various name, comment, etc fields.
Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8 storage is also limited so that characters outside the basic multilingual plane (the classic 16-bit range) can't be stored at all. Attempting to insert these characters will cause the field to become truncated (in UTF-8) or just corrupt the character (in UCS-2).
If MySQL supported it, my preference would be to use UTF-16 with 16-bit collation for the non-bulk-text fields; that is, allow clean translation to/from compliant UTF-8 but keep the indexes at 2 bytes per code point. This would keep the size of the indexes down compared to their UTF-8 support (which currently needs 3 bytes per character and would need 4 if they made it actually support full UTF-8).
Index size directly relates to key caching and index scanning performance, so on a large-scale setup that can be relevant. (Bulk text storage is much less significant in this respect; individual records are picked out cheaply based on an integer index lookup.)
Alternatively, you could potentially whip up some kind of text storage handler for MediaWiki that would convert the internal UTF-8 data into UTF-16 for storage in the blob. I doubt this would be significantly pleasant though. :)
Using UTF-16 internally or for output isn't really possible.
My issue with Firefox is happening in my installation but not in wikipedia. Not sure what it is, but I'll try to find out when I have more time.
The following only occurs with Chinese and Japanese text in page titles:
Basically when I pass the script an existing page, it opens it no problem in all browsers; but when I pass the script an unexisting one, it mangles the title only on firefox (don't have other mozilla browsers installed at the moment at home, must check it out with Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE based ones too. It's probably some strange behavior from the browser... but then again it doesn't happen with Wikipedia. Just in case someone has any pointers.
Do you have an example? How is it mangled, exactly?
In what way are you passing the data?
- Typing on the URL bar
- From an <a href> link on a web page
- From a <form> on a web page
If in a URL or link, is the title:
- percent-encoded UTF-8 bytes, per RFC 3987
- percent-encoded bytes in some other encoding, such as EUC-JP or
Shift-JIS
- raw typed text
Current versions of IE are, I think, set to send unencoded characters in URLs as percent-encoded UTF-8. Mozilla for some reason has left this option off, so sometimes it'll send unencoded characters in <a href> links in the source page's character set. I'm not sure what it'll do in the URL bar (locale encoding?) but it seems to be happy to send UTF-8 from the URL bar on my Windows XP box if I paste in some random Chinese text.
LanguageJa and LanguageZh doesn't set fallback encoding checks, so non-Unicode encodings of Japanese and Chinese won't be detected or automatically converted. (There are multiple character sets in use for these, making it extra difficult compared with most European languages.)
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
muyuubyou wrote:
Well, about my little issue with Firefox: I just type in the URL bar, and no cookie. Following links works (for instance, link to pages created with Opera, work also under Firefox).
For instance:
In GB 18030, that's: d6 d0 ce c4
I press enter, then the URL bar turns to: http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84
, which is page "ÖÐÎÄ"
In ISO 8859-1, that's: d6 d0 ce c4
By any chance is your desktop set to a Chinese locale? It sounds like Firefox is taking the non-ASCII chars in the URL you type and encoding them as GB 18030. MediaWiki sees these unexpected non-UTF-8 characters and tries to convert them from a fallback, which is the default of ISO 8859-1 or Windows-1252.
Try setting the network.standard-url.encode-utf8 hidden preference in Firefox to true, see if that fixes it. (Who knows why they haven't turned this on yet, it's been standard on even MSIE for years...)
-- brion vibber (brion @ pobox.com)
My locale is indeed set to Chinese... because most of my users will be Chinese. Changing the configuration in Firefox didn't work.
Great find BTW :-)
... but I guess my little problem is there to stay, since I can't change my users' locale. Wish they fixed that in Firefox, it's almost 10% of my visits.
To top it off, it doesn't work with locales set to latin-1 either.
Thanks a lot, Brion. I should probably mail somebody at Mozilla :D I'm stubborn.
On 2/8/06, Brion Vibber brion@pobox.com wrote:
muyuubyou wrote:
Well, about my little issue with Firefox: I just type in the URL bar,
and no
cookie. Following links works (for instance, link to pages created with Opera, work also under Firefox).
For instance:
In GB 18030, that's: d6 d0 ce c4
I press enter, then the URL bar turns to: http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84
, which is page "ÖÐÎÄ"
In ISO 8859-1, that's: d6 d0 ce c4
By any chance is your desktop set to a Chinese locale? It sounds like Firefox is taking the non-ASCII chars in the URL you type and encoding them as GB 18030. MediaWiki sees these unexpected non-UTF-8 characters and tries to convert them from a fallback, which is the default of ISO 8859-1 or Windows-1252.
Try setting the network.standard-url.encode-utf8 hidden preference in Firefox to true, see if that fixes it. (Who knows why they haven't turned this on yet, it's been standard on even MSIE for years...)
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
muyuubyou wrote:
My locale is indeed set to Chinese... because most of my users will be Chinese. Changing the configuration in Firefox didn't work.
Great find BTW :-)
... but I guess my little problem is there to stay, since I can't change my users' locale. Wish they fixed that in Firefox, it's almost 10% of my visits.
If it's reasonably consistent, and you have working iconv or mbstring on PHP, you might be able to set the conversion by adding a fallbackEncoding() method on LanguageZh_cn.
-- brion vibber (brion @ pobox.com)
mediawiki-l@lists.wikimedia.org