-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 22/11/2010 14:25, Asaf Bartov wrote:
Note that the bug exists in GNU/Linux as well -- it's just better hidden... :) UTF8 uses a _variable_ amount of bytes to encode a code point. Often a single byte is enough. But if your filename includes very special characters, such as an "em-dash" (–) or an IPA charachter such as *ʧ* -- then the character would take up two bytes, and for some obscure characters it can be up to _four_ bytes.
There is no issue I think with UTF8 neither with libzim nor with Kiwix... and file names with em-dash. I have tested and it works. The reason is I think that the kernel interprets the char* string directly as UTF8 (ext3/4 is in UTF8).
But on Windows, this is not possible to interpret directly the char* as UTF16, otherwise if you give a ASCII encoded path it won't work. So I suppose, STL open() & co (or the kernel) make a charset conversion to UTF16 before asking the filesystem.
So if you want to open a file with character not in the ASCII charset, I suppose you have to use a special STL open() accepting wchar and give the path directly in UTF16.
That is my theory.
So French accents fit in one byte, but some other characters do not. If I had a ZIM file with such a character on GNU/Linux, the code would fail too.
Does not looks like :)
We do need a portable solution. I don't know the right way to do it off the top of my head, so perhaps someone else on the list can offer advice. If no one can, I'm willing to figure it out myself.
Yes, would be great. Tommi, your are the STL expert :)
Thanks for your feedback Asaf. Emmanuel