The issue has been mentioned at http://mail.wikipedia.org/pipermail/wikitech- l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang tags which neither assiciated with ISO-693 nor IANA language tags. So I've examined this issue, and have a patch sibmitted to BugZilla. However the patch I've made cannot be accepted as Brian said that this would break up the caching system.
Is there any workaround/suggestions how to resolve this issue without breaks the current caching system? (As my limited knowledge in PHP5 classes, so I need some help who is familar with PHP5 classes).
This patch would solve the issues that the lang tag cannot associated to any of valid language tags, such as there's no "simple" tag which assiciated neither ISO-693 nor IANA language tags. And also the patch fixes the incorrect language tag which depends on the user interface language. There's some sample patch explaining how the language tag to be determined, such as a Traditional Chinese user which supposes to using Traditional Chinese fonts (zh-hant/zh-TW) to see the contexts, rather than using Simplified Chinese fonts (zh/zh-hans/zh- CN).
regards, :) Shinjiman
-- http://meta.wikimedia.org/User:Shinjiman http://en.wikipedia.org/User:Shinjiman
Shinjiman wrote:
The issue has been mentioned at http://mail.wikipedia.org/pipermail/wikitech- l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang tags which neither assiciated with ISO-693 nor IANA language tags. So I've examined this issue, and have a patch sibmitted to BugZilla. However the patch I've made cannot be accepted as Brian said that this would break up the caching system.
There was not any such patch for this, and it would be totally unnecessary to make one. Just let us know what the incorrect ones are and we'll fix the configuration.
Are you maybe thinking of the patch for something totally different which tries to guess the visitor's language variant and change the lang attribute based on it?
-- brion vibber (brion @ pobox.com)
Brion Vibber <brion@...> writes:
Shinjiman wrote:
The issue has been mentioned at
http://mail.wikipedia.org/pipermail/wikitech-
l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang
tags
which neither assiciated with ISO-693 nor IANA language tags. So I've
examined
this issue, and have a patch sibmitted to BugZilla. However the patch I've made cannot be accepted as Brian said that this would break up the caching system.
There was not any such patch for this, and it would be totally unnecessary to make one. Just let us know what the incorrect ones are and we'll fix the configuration.
Are you maybe thinking of the patch for something totally different which
tries
to guess the visitor's language variant and change the lang attribute based
on it?
For Example: For Simple English Wikipedia, it's show the lang tag as "simple" which do not exists neither ISO639 nor IANA language tag tag. For 'Simple English', the lang code should be "en" (English).
And for Traditional Chinese readers reads a Traditional Chinese webpage, it's supposed to read a page using the Traditional Chinese font. However this does not apply to Wikipedia (and various wiki site running MediaWiki). As the Chinese Wikipedia (and various Chinese based wiki sites that running MediaWiki) has been introduced the LanguageConverter class. The lang tag is "zh" (Chinese). It's not the problem for the Simplified Chinese readers while the (major) browsers will using the Simplified Chinese fonts. However, it's having a problem for Traditional Chinese readers to reading the Chinese context using a Simplified Chinese font.
As your opinion mentioned, to use the language variant to determine the language code that the user is using, it's quite impossible to determine the lang tag by languge variant. Since currently many users in Chinese Wikipedia sets to disable the language variant by default because the Chinese words conversion cause much problems currently have. (This is the main point of the issue) Including me, I'am also using a Traditional Chinese (UI language = zh- hk) interface language and _without enable_ any of language variants (Variant = zh). So as the patch I've submitted, it's not to determine the lang tag only by the language variant, but checks with both interface language and $wgContLanguageCode (Global interface language).
So summarising my statement above, I've suggested to adding a new attribute to assign the lang tag correctly, by using arrays, or something like to provides a similar functionally. For example:
Language code | Language tag -------------------------+---------------------------------------------- en | en de | de simple (Simple English) | en zh-cn | zh-cn <= originally supposed to be zh-hans (R1) zh-sg | zh-sg <= originally supposed to be zh-hans (R1) zh-tw | zh-hant zh-hk | zh-hant zh-mo | zh-hant zh-min-nan | zh-hant <= (R2) zh-yue | zh-hant <= (R2)
*Remarks: 1. The tags is used as zh-cn/zh-sg instead of zh-hans for browsers compatibility (likely IE6 will misunderstand the zh-hans lang tag). 2. The tags is used as zh-hant instead of zh-min-nan/zh-yue browsers compatibility (likely both IE/Firefox will misunderstand both zh-min-nan and zh-yue lang tag).
For that table about is about to construct a lang tag mapping against various languages.
And after this kind of language mapping is done, it's need to modify the OutputPage.php (for older skins) and Monobook.php (for newer skins) to output the <html xml:lang"XXX" lang="XXX"> correctly ("XXX" is the correct lang code instead using the $wgContLanguageCode directly) to address this issue.
Hope my information I've proveded would help you to ongoing and addressing this kind of issue more smoothly. :)
regards Shinjiman
As for the problem in the Simple English Wikipedia, I think its $wgLanguageCode should be changed to "en", because it uses the English messages.
Shinjiman wrote:
Brion Vibber <brion@...> writes:
Shinjiman wrote:
The issue has been mentioned at
http://mail.wikipedia.org/pipermail/wikitech-
l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang
tags
which neither assiciated with ISO-693 nor IANA language tags. So I've
examined
this issue, and have a patch sibmitted to BugZilla. However the patch I've made cannot be accepted as Brian said that this would break up the caching system.
There was not any such patch for this, and it would be totally unnecessary to make one. Just let us know what the incorrect ones are and we'll fix the configuration.
Are you maybe thinking of the patch for something totally different which
tries
to guess the visitor's language variant and change the lang attribute based
on it?
For Example: For Simple English Wikipedia, it's show the lang tag as "simple" which do not exists neither ISO639 nor IANA language tag tag. For 'Simple English', the lang code should be "en" (English).
And for Traditional Chinese readers reads a Traditional Chinese webpage, it's supposed to read a page using the Traditional Chinese font. However this does not apply to Wikipedia (and various wiki site running MediaWiki). As the Chinese Wikipedia (and various Chinese based wiki sites that running MediaWiki) has been introduced the LanguageConverter class. The lang tag is "zh" (Chinese). It's not the problem for the Simplified Chinese readers while the (major) browsers will using the Simplified Chinese fonts. However, it's having a problem for Traditional Chinese readers to reading the Chinese context using a Simplified Chinese font.
As your opinion mentioned, to use the language variant to determine the language code that the user is using, it's quite impossible to determine the lang tag by languge variant. Since currently many users in Chinese Wikipedia sets to disable the language variant by default because the Chinese words conversion cause much problems currently have. (This is the main point of the issue) Including me, I'am also using a Traditional Chinese (UI language = zh- hk) interface language and _without enable_ any of language variants (Variant = zh). So as the patch I've submitted, it's not to determine the lang tag only by the language variant, but checks with both interface language and $wgContLanguageCode (Global interface language).
So summarising my statement above, I've suggested to adding a new attribute to assign the lang tag correctly, by using arrays, or something like to provides a similar functionally. For example:
Language code | Language tag -------------------------+---------------------------------------------- en | en de | de simple (Simple English) | en zh-cn | zh-cn <= originally supposed to be zh-hans (R1) zh-sg | zh-sg <= originally supposed to be zh-hans (R1) zh-tw | zh-hant zh-hk | zh-hant zh-mo | zh-hant zh-min-nan | zh-hant <= (R2) zh-yue | zh-hant <= (R2)
*Remarks:
- The tags is used as zh-cn/zh-sg instead of zh-hans for browsers
compatibility (likely IE6 will misunderstand the zh-hans lang tag). 2. The tags is used as zh-hant instead of zh-min-nan/zh-yue browsers compatibility (likely both IE/Firefox will misunderstand both zh-min-nan and zh-yue lang tag).
For that table about is about to construct a lang tag mapping against various languages.
And after this kind of language mapping is done, it's need to modify the OutputPage.php (for older skins) and Monobook.php (for newer skins) to output the <html xml:lang"XXX" lang="XXX"> correctly ("XXX" is the correct lang code instead using the $wgContLanguageCode directly) to address this issue.
Hope my information I've proveded would help you to ongoing and addressing this kind of issue more smoothly. :)
regards Shinjiman
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Rotem Liss <mail@...> writes:
As for the problem in the Simple English Wikipedia, I think its $wgLanguageCode should be changed to "en", because it uses the English messages.
Shinjiman wrote:
Brion Vibber <brion <at> ...> writes:
Shinjiman wrote:
The issue has been mentioned at
http://mail.wikipedia.org/pipermail/wikitech-
l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang
tags
which neither assiciated with ISO-693 nor IANA language tags. So I've
examined
this issue, and have a patch sibmitted to BugZilla. However the patch
I've
made cannot be accepted as Brian said that this would break up the
caching
system.
There was not any such patch for this, and it would be totally unnecessary
to
make one. Just let us know what the incorrect ones are and we'll fix the configuration.
Are you maybe thinking of the patch for something totally different which
tries
to guess the visitor's language variant and change the lang attribute
based
on it?
For Example: For Simple English Wikipedia, it's show the lang tag as "simple" which do
not
exists neither ISO639 nor IANA language tag tag. For 'Simple English', the lang code should be "en" (English).
And for Traditional Chinese readers reads a Traditional Chinese webpage,
it's
supposed to read a page using the Traditional Chinese font. However this
does
not apply to Wikipedia (and various wiki site running MediaWiki). As the Chinese Wikipedia (and various Chinese based wiki sites that running MediaWiki) has been introduced the LanguageConverter class. The lang tag is "zh" (Chinese). It's not the problem for the Simplified Chinese readers while the (major) browsers will using the Simplified Chinese fonts.
However,
it's having a problem for Traditional Chinese readers to reading the
Chinese
context using a Simplified Chinese font.
As your opinion mentioned, to use the language variant to determine the language code that the user is using, it's quite impossible to determine
the
lang tag by languge variant. Since currently many users in Chinese
Wikipedia
sets to disable the language variant by default because the Chinese words conversion cause much problems currently have. (This is the main point of
the
issue) Including me, I'am also using a Traditional Chinese (UI language =
zh-
hk) interface language and _without enable_ any of language variants
(Variant
= zh). So as the patch I've submitted, it's not to determine the lang tag
only
by the language variant, but checks with both interface language and $wgContLanguageCode (Global interface language).
So summarising my statement above, I've suggested to adding a new attribute
to
assign the lang tag correctly, by using arrays, or something like to
provides
a similar functionally. For example:
Language code | Language tag -------------------------+---------------------------------------------- en | en de | de simple (Simple English) | en zh-cn | zh-cn <= originally supposed to be zh-hans (R1) zh-sg | zh-sg <= originally supposed to be zh-hans (R1) zh-tw | zh-hant zh-hk | zh-hant zh-mo | zh-hant zh-min-nan | zh-hant <= (R2) zh-yue | zh-hant <= (R2)
*Remarks:
- The tags is used as zh-cn/zh-sg instead of zh-hans for browsers
compatibility (likely IE6 will misunderstand the zh-hans lang tag). 2. The tags is used as zh-hant instead of zh-min-nan/zh-yue browsers compatibility (likely both IE/Firefox will misunderstand both zh-min-nan
and
zh-yue lang tag).
For that table about is about to construct a lang tag mapping against
various
languages.
And after this kind of language mapping is done, it's need to modify the OutputPage.php (for older skins) and Monobook.php (for newer skins) to
output
the <html xml:lang"XXX" lang="XXX"> correctly ("XXX" is the correct lang
code
instead using the $wgContLanguageCode directly) to address this issue.
Hope my information I've proveded would help you to ongoing and addressing this kind of issue more smoothly. :)
regards Shinjiman
Wikitech-l mailing list Wikitech-l@... http://mail.wikipedia.org/mailman/listinfo/wikitech-l
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some language is not correct and it's supposed to not making this value identical to $wgContLanguageCode. For example there's no such language tag called "simple", according to ISO639, RFC1766, RFC3066 (R1,R2). Hence for my previous patch that submitted to Bug:5790. The main purpose of the patch is adding a new Language Tag Mapping against the user interface language which using the incorrect language tag. And change the getting method obtaining the value "XXX", which is not supposed to be $wgContLanguageCode. I think Brion may not fully-understand the actual situation in some language wikis. However, a resolution regarding to this issue is considerable.
An alternative way to solving this issue, can be done by adding a new text field which can make the lang attribute in the HTML tag customisable. So any logged on users can change the value per user's perferences. (this idea is originally submitted by 百楽兎 [http://zh.wikipedia.org/wiki/User:%CE%A0rate])
References ========== R1: W3C, Language information and text direction [http://www.w3.org/TR/html4/struct/dirlang.html] R2: W3C, Language tags in HTML and XML [http://www.w3.org/International/articles/language-tags/]
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some language is not correct and it's supposed to not making this value identical to $wgContLanguageCode.
Incorrect; it *is* supposed to be the value of $wgContLanguageCode, as by definition $wgContLanguageCode is the RFC 3066 language code for the language of the wiki's content.
A reasonable case might be made that when variant display conversion is engaged, the lang attribute should be overridden.
For example there's no such language tag called "simple",
Indeed there's not; that would be "en".
Note that $wgContLanguageCode is not the same as the *domain name* or *interwiki identifier*. These are separate issues.
according to ISO639, RFC1766, RFC3066 (R1,R2). Hence for my previous patch that submitted to Bug:5790. The main purpose of the patch is adding a new Language Tag Mapping against the user interface language which using the incorrect language tag.
That would be stupid and useless. Instead, use the correct code to begin with.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some language is not correct and it's supposed to not making this value identical to $wgContLanguageCode.
Incorrect; it *is* supposed to be the value of $wgContLanguageCode, as by definition $wgContLanguageCode is the RFC 3066 language code for the language of the wiki's content.
A reasonable case might be made that when variant display conversion is engaged, the lang attribute should be overridden.
For example there's no such language tag called "simple",
Indeed there's not; that would be "en".
Note that $wgContLanguageCode is not the same as the *domain name* or *interwiki identifier*. These are separate issues.
according to ISO639, RFC1766, RFC3066 (R1,R2). Hence for my previous patch that submitted to Bug:5790. The main purpose of the patch is adding a new Language Tag Mapping against the user interface language which using the incorrect language tag.
That would be stupid and useless. Instead, use the correct code to begin with.
-- brion vibber (brion @ pobox.com)
Hoi, The case for the simple wikipedia is indeed obvious. More problematic is when you want to link a wikipedia that uses a code that will never be accepted as a language code because it is considered a language family. Or a code that is used for another language. Or a language where the code is specific while the wikipedia uses it to indicate a larger language "continuum". Another issue is that language codes are retired; this leads to a different interpretation of the meaning of the ku, fa and several others (this is part of ISO-639-3)
Having meaningful links between the Wikimedia codes for interwikil inks and language codes is not trivial. For WiktionaryZ we are going to standardise on ISO-639-3 and have CLEAR codes that identify languages that are not recognised at present. One consequence is, that the Babel templates will be the ISO-639-3 codes as well.
RFC 3066 indicates to be reserving tags for subsequent revisions of the ISO-639 code. ISO-639-3 clearly states that the codes will not be recycled. It also says that this principle will be maintained for any future revisions of the code. It is therefore safe to use ISO-639-3.
Thanks, GerardM
Gerard Meijssen <gerard.meijssen@...> writes:
Brion Vibber wrote:
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some
language is
not correct and it's supposed to not making this value identical to $wgContLanguageCode.
Incorrect; it *is* supposed to be the value of $wgContLanguageCode, as by definition $wgContLanguageCode is the RFC 3066 language code for the
language of
the wiki's content.
A reasonable case might be made that when variant display conversion is
engaged,
the lang attribute should be overridden.
For example there's no such language tag called "simple",
Indeed there's not; that would be "en".
Note that $wgContLanguageCode is not the same as the *domain name* or
*interwiki
identifier*. These are separate issues.
according to ISO639, RFC1766, RFC3066 (R1,R2). Hence for my previous
patch
that submitted to Bug:5790. The main purpose of the patch is adding a new Language Tag Mapping against the user interface language which using the incorrect language tag.
That would be stupid and useless. Instead, use the correct code to begin
with.
-- brion vibber (brion <at> pobox.com)
Hoi, The case for the simple wikipedia is indeed obvious. More problematic is when you want to link a wikipedia that uses a code that will never be accepted as a language code because it is considered a language family. Or a code that is used for another language. Or a language where the code is specific while the wikipedia uses it to indicate a larger language "continuum". Another issue is that language codes are retired; this leads to a different interpretation of the meaning of the ku, fa and several others (this is part of ISO-639-3)
Having meaningful links between the Wikimedia codes for interwikil inks and language codes is not trivial. For WiktionaryZ we are going to standardise on ISO-639-3 and have CLEAR codes that identify languages that are not recognised at present. One consequence is, that the Babel templates will be the ISO-639-3 codes as well.
RFC 3066 indicates to be reserving tags for subsequent revisions of the ISO-639 code. ISO-639-3 clearly states that the codes will not be recycled. It also says that this principle will be maintained for any future revisions of the code. It is therefore safe to use ISO-639-3.
Thanks, GerardM
For this, I think you maybe have some ideas about the language tags attribute mapping, which was previously posted in Bugzilla:5790 (http://bugzilla.wikimedia.org/show_bug.cgi?id=5790). I've made a flow chart (http://bugzilla.wikimedia.org/attachment.cgi?id=1704&action=view) how to determine the lang attribute to render the texts using the correct font. And also would you like to see my proposed patches for this (http://bugzilla.wikimedia.org/attachment.cgi?id=1705&action=view and http://bugzilla.wikimedia.org/attachment.cgi?id=1706&action=view) is that have any problems that Brion pointed out those stuffs added are consider to making unsafe to the cache (At this case Brion didn't says *how* it would affecting the cache.) Would you have any opinions regarding to my proposed design, or is there have any problems by running the code? (For this one I've done the test on my local wiki, but haven't got the idea *how* the cache would be affected, which is pointed out by Brion). :)
thanks and regards Shinjiman
Brion Vibber <brion@...> writes:
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some language
is
not correct and it's supposed to not making this value identical to $wgContLanguageCode.
Incorrect; it *is* supposed to be the value of $wgContLanguageCode, as by definition $wgContLanguageCode is the RFC 3066 language code for the
language of
the wiki's content.
A reasonable case might be made that when variant display conversion is
engaged,
the lang attribute should be overridden.
Please note that the original purpose/goal of the Language Converter is to changing the page(article, project page, portal, etc.)'s content *only* (For example: [中国↔中國] in Chinese and also [standardize↔standardize] in English). Seems Brion is not understanding the original purpose/goal of this, and he said the to change the lang attribute in the HTML tag. (However I've wondering is that have any PHP functions that can change the lang attribute in the HTML tag that already set by the output of both OutputPage.php and SkinTemplate.php).
For example there's no such language tag called "simple",
Indeed there's not; that would be "en".
Note that $wgContLanguageCode is not the same as the *domain name* or
*interwiki
identifier*. These are separate issues.
As this, we still don't have any values that defines the lang attribute in the HTML tag, but currently defined by the $wgContLanguageCode. However, if Brion (or any developers having the shell access) changed the $wgContLanguageCode from "simple" into "en" in the LocalSettings.php at simple, it can correct the mislabelled tag. But this change is going to *break up* the user language settings to using the English messages instead of Simple English messages, for anonymous visitors and newly registered users. (See http://simple.wikipedia.org/wiki/Special:Allmessages for the diffs between the original English messages and Simple English messages, and currently none of LanguageSimple.php / MessagesSimple.php are defined in the MediaWiki package).
As far as I know the value "simple" is *not* either the domain name or the interwiki identifier. However, the values (including "simple") that affecting the $wgLanguagecode / $wgContLanguageCode is defined at the $wgLanguageNames array in languages/Names.php.
according to ISO639, RFC1766, RFC3066 (R1,R2). Hence for my previous patch that submitted to Bug:5790. The main purpose of the patch is adding a new Language Tag Mapping against the user interface language which using the incorrect language tag.
That would be stupid and useless. Instead, use the correct code to begin
with.
Have you read my previous posts? *http://mail.wikipedia.org/pipermail/wikitech-l/2006-May/035546.html *http://mail.wikipedia.org/pipermail/wikitech-l/2006-May/035578.html
On the posts that I've mentioned, as there's some users that using the [zh- tw/zh-hk] user interface language, but have the language variant *turned off* [zh]. It is impossible to determining the new lang attribute by the language variant. So by Brion's idea, to using the language variant to determine the new lang attribute, it would rendered with Traditional Chinese texts, using Simplified Chinese fonts (as the language variant perference is [zh]). Therefore, That's why I say that using the language variant value is not suitable for this case.
Instead, use the correct code to begin with.
That's somehow I agree with this, seems the filenames in languages directory and the values in languages/Names.php is not defined by either the latest ISO- 639-3 and IANA language tags. Hopfully there's maybe having some changes regarding to this. :)
Hopfully this may make more sense regarding for this issue.
Shinjiman wrote:
Please note that the original purpose/goal of the Language Converter is to changing the page(article, project page, portal, etc.)'s content *only* (For example: [中国↔中國] in Chinese and also [standardize↔standardize] in English). Seems Brion is not understanding the original purpose/goal of this,
I understand it just fine. I have no idea how it relates to what you're proposing, which appears to be something totally unrelated trying to snag the user-agent's Accept-Language string (?!??!!?)
As this, we still don't have any values that defines the lang attribute in the HTML tag, but currently defined by the $wgContLanguageCode. However, if Brion (or any developers having the shell access) changed the $wgContLanguageCode from "simple" into "en" in the LocalSettings.php at simple, it can correct the mislabelled tag. But this change is going to *break up* the user language settings to using the English messages instead of Simple English messages,
No it won't. There is no "simple English" language, it's just English. The defined user interface messages on the wiki will be shown when English is selected.
On the posts that I've mentioned, as there's some users that using the [zh- tw/zh-hk] user interface language, but have the language variant *turned off* [zh]. It is impossible to determining the new lang attribute by the language variant.
I don't understand what this means. Can you explain it?
If you mean, "some people haven't selected a language variant", then in that case we cannot assign anything more specific than "zh" to that page display.
-- brion vibber (brion @ pobox.com)
Brion Vibber <brion@...> writes:
Shinjiman wrote:
Please note that the original purpose/goal of the Language Converter is to changing the page(article, project page, portal, etc.)'s content *only*
(For
example: [中国↔中國] in Chinese and also [standardize↔standardize] in English). Seems Brion is not understanding the original purpose/goal of
this,
I understand it just fine. I have no idea how it relates to what you're proposing, which appears to be something totally unrelated trying to snag the user-agent's Accept-Language string (?!??!!?)
As this, we still don't have any values that defines the lang attribute in
the
HTML tag, but currently defined by the $wgContLanguageCode. However, if
Brion
(or any developers having the shell access) changed the
$wgContLanguageCode
from "simple" into "en" in the LocalSettings.php at simple, it can correct
the
mislabelled tag. But this change is going to *break up* the user language settings to using the English messages instead of Simple English messages,
No it won't. There is no "simple English" language, it's just English. The defined user interface messages on the wiki will be shown when English is
selected.
On the posts that I've mentioned, as there's some users that using the [zh- tw/zh-hk] user interface language, but have the language variant *turned
off*
[zh]. It is impossible to determining the new lang attribute by the
language
variant.
I don't understand what this means. Can you explain it?
If you mean, "some people haven't selected a language variant", then in that case we cannot assign anything more specific than "zh" to that page display.
Yes, that's what I said in this issue, for my perferences, I've use the Traditional Chinese user interface, and having the language variant disabled [i.e. the language variant is set to zh]. And it is not "some people *haven't* selected a language variant", however it is likely to said that "they have *disabled* the language variant". That's the problem happens as your suggestion that using the language varient to determine the lang attribute.
I've concerned for the problem above, it shows that there is not possible to determine the lang attribute by using the language variant value.
Hence, I've trying to write a method to determine the lang attribute by *both* user interface language value and $wgContLanguageCode. (It's the same design as the submitted function in Bug:5790, I would like to explain these procedures in sentenses.) (And it has been mentioned *in your Point Of View*, this code is writtened totally difficult that you wanted.)
Firstly, we need to check is the $wgContLanguageCode is on the change list, if it is *not*, returns the $wgContLanguageCode value. (This is determined by the first array $wgDispLangNeedChanges)
If the language code *is* on the change list, do the following procedures:
- If the key is found in the correctto list, returns the value which associated to the key (Defined in the second array $wgDispLangChangeTo)
For logged on users ------------------- - A method that can get the lang attribute is trying to get a language code value by User Interface perferences for logged in users.
For anonymous visitors ---------------------- - Then for anonymous visitors, it need to know which font character sets should be used (Simplified Chinese fonts / Traditional Chinese fonts) determine by the HTTP_ACCEPT_LANGUAGE only if the browser *supports and enabled* this function. - If the method determining the lang attribute *fails*, it will use the $wgContLanguageCode for final fallback.
- Finally do the language tag changes as some font character sets supports incorrectly, for example, zh-hk and zh-mo is assigned to use the Simplified Chinese fonts, we need to fix this into zh-hant or something like that as zh- hant is assigned to using the Traditional Chinese fonts charsets. If the key that cannot found in the array, it returns the value $wgContLanguageCode for final fallback. (This is defined in the third array $wgDispLangChanges)
This code would not affects the return value of lang attribute which the $wgContLanguageCode value is *outside* of the lang attribute change list (which is defined in the first array, for example, en, de, fr, ja, ko, etc., i.e. the return value of lang attribute is the same as $wgContLanguageCode)
I think that it win't have problems in my design. However, I did not get *how* this design affects the cache as you've mentioned before. As the best I've done is trying to make a design as good as possible, and solving the problem more effective. :)
So why I need to change that value at there to make the lang attribute is not fixed by the $wgContLanguageCode value? It's a quite a long story to tell why I proposed to doing this.
======================================================================
It's been seen that there are some discussions that have complained by other users regarding the font rendering problem. (http://zh.wikipedia.org/wiki/Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88/... E6%8A%80%E6%9C%AF/%E5%AD%98%E6%A1%A3/2005%E5%B9%B412%E6%9C% 88#.E7.B9.81.E9.AB.94.E7.B6.B2.E9.A0.81.E8.83.BD.E5.90.A6.E6.94.B9.E6.88.90.E6. 96.B0.E7.B4.B0.E6.98.8E.E9.AB.94.EF.BC.9F (in Chinese, Translating the message into English: "The readers in Taiwan and Hong Kong are commonly use the MingLiu font to see the Chinese contents, and using MingLiu font is more friendly to the readers from Taiwan and Hong Kong." (my additional note: MingLiu is the Traditional Chinese system font that is used in Windows OS))).
And also something that request to add a character to the Conversion Table: (http://zh.wikipedia.org/wiki/Wikipedia:%E7%B9%81%E7%AE%80%E4%BD%93%E8%BD%AC% E6%8D%A2%E8%AF%B7%E6%B1%82#- .7Btw:.E8.A7.92.3Dhk:.E8.A7.92_.E2.89.A0_cn:.EF.A0.B3.7D- (in Chinese, Translating the message into English: "The character 角 is different to 角, requesting that character to be added into the conversion table" (my additional note: since both character 角 in Traditional/Simplified Chinese shares the same unicode number [U+89D2], There are only difference when using Traditional Chenese and Simplified Chinese fonts.)))
======================================================================
So regarding to those complaints mentioned above, we would likely to see where's the issue come from. And the user called 百楽兎(which is the original bug submitter in bug 5790) has found out the following tag causes the font problem: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh" lang="zh" dir="ltr"> while the lang attribute [zh] is determine as Simplified Chinese in most browsers, so I had investigate which PHP file containing thost stuffs at there.
As I examine the code in the beginning, I have to find out where is the issue that happens, so I used the phase "http://www.w3.org/1999/xhtml" fo find out which file has containing this string.
While I've found out the <html> tag is in the OutputPage.php and Monobook.php, it's likely that the lang attribute is fixed by the value $wgContLanguageCode. So I would think that "how to make the lang attribute varies, *without changing any values including the $wgContLanguageCode value*".
======================================================================
From this point, I know that the functionallies for those two files and the
Language Converter:
OutputPage.php - the main part that *outputs* the <html> tag in the old skins. Monobook.php - the main part that *outputs* the <html> tag in the new skins. LanguageConverter - the Language Converter, this is used to change the text of contents *only*.
As I think that the functionally of the LanguageConverter is to have changes the text of contents only, and it does not changing any attributes defined in HTML code. Also, I have considered to using the language variant value that told by Brion, however it is quite impossible to determine the lang attribute from the language variant. (in zh, there's 25 combinations that can be used.). It's likely have problems while the user interface language is set to Traditional Chinese [zh-tw/zh-hk] and the language variant has been turned off [zh]. It would render the Traditional Chinese interface, using Simplified Chinese font. According to this case it is quite obvious that we cannot use the language variant to determining the lang attribute. And I think that it would overdo the LanguageConverter if we adding that function into there. So I think that if there's an alternate way to achieve the goal (to check *both* user interface language and $wgContLanguageCode).
So I put the function that determining the lang attribute into OutputPage.php and Monobook.php, and I've tested on the local installation, the lang attribute detection using *both* user interface language and $wgLanguageCode is succeeded. However Brion said that that would affects the cache (no idea *how* would the cache is affected).
So there's sonething I wanna to ask: If my design (I mean the lang attribute detect function) don't have the problem, how to solve the cache issue that Brion has been mentioned? And if Brion's method (using the language varient to detect the lang attribute) is used, how would we solving the problem in some cases (UI=zh- tw,V=zh or UI=zh-hk,V=zh), and where we can put the code?
regards and thanks Man
------------------------------------ A little test to testing the font rendering Copy the stuffs into the text editor and save as a HTML file:
<span lang="zh-cn"><font size=6">角 骨 解 述</font></span><br/> <span lang="zh-tw"><font size=6">角 骨 解 述</font></span><br/> <span lang="ja"><font size=6">角 骨 解 述</font></span><br/>
use a browser to test it. It shows that even have the same unicode character, it would renders in different way (even I've tried in my Linux (FC5) and the Linux machines in my school (which runs FC4), the character 述 renders into different word.)
The language in Simple English Wikipedia is defined ($wgLanguageCode) as "simple", but because this language is not exist, all the messages fall back to English (except the changed ones). It's an ugly workaround and the configuration ($wgLanguageCode) should be changed into "en" in simple.wikipedia.org. I don't think it will break the settings: the custom messages ("change this page" instead of "edit", "show any page" instead of "random page", etc.) *will* remain, because they are defined in MediaWiki namespace. However, the "lang" and "xml:lang" attributes will be "en" as well if $wgLangaugeCode="en".
So finally, we should just apply:
- $wgLanguageCode = "simple"; + $wgLanguageCode = "en";
in LocalSettings.php of simple.wikipedia.org to fix the problem there, and it *won't* break any messages. However, I don't understand the problem in Chinese Wikipedia.
Brion Vibber wrote:
Shinjiman wrote:
As this, we still don't have any values that defines the lang attribute in the HTML tag, but currently defined by the $wgContLanguageCode. However, if Brion (or any developers having the shell access) changed the $wgContLanguageCode from "simple" into "en" in the LocalSettings.php at simple, it can correct the mislabelled tag. But this change is going to *break up* the user language settings to using the English messages instead of Simple English messages,
No it won't. There is no "simple English" language, it's just English. The defined user interface messages on the wiki will be shown when English is selected.
-- brion vibber (brion @ pobox.com)
On Fri, May 12, 2006 at 09:08:24PM -0700, Brion Vibber wrote:
No it won't. There is no "simple English" language, it's just English.
Are you making that assertion, brion, in the limited context of "languages that Mediawiki supports"?
Cause, y'know:
http://en.wikipedia.org/wiki/Simple_english
(and, on reading the disambig page, I guess I was really remembering Special English), but... could you expand, one more sentence?
Cheers, -- jra
Jay R. Ashworth wrote:
On Fri, May 12, 2006 at 09:08:24PM -0700, Brion Vibber wrote:
No it won't. There is no "simple English" language, it's just English.
Are you making that assertion, brion, in the limited context of "languages that Mediawiki supports"?
Cause, y'know:
Hey wow, that page agrees with me.
-- brion vibber (brion @ pobox.com)
On Sat, May 13, 2006 at 12:25:47PM -0700, Brion Vibber wrote:
Jay R. Ashworth wrote:
On Fri, May 12, 2006 at 09:08:24PM -0700, Brion Vibber wrote:
No it won't. There is no "simple English" language, it's just English.
Are you making that assertion, brion, in the limited context of "languages that Mediawiki supports"?
Cause, y'know:
Hey wow, that page agrees with me.
If there's no Simple English language; how can we have a wikipedia in it?
:-)
Cheers, -- jra
There's no Zlatiborian language, but we have a test wiki in it.
<zlatibor humour>The glorious Zlatiborians shall soon triumph over their evil Serbian overlords and will be able to speak their beautiful language and practice their magnificent culture freely once again!</zlatibor humour>
On 13/05/06, Jay R. Ashworth jra@baylink.com wrote:
On Sat, May 13, 2006 at 12:25:47PM -0700, Brion Vibber wrote:
Jay R. Ashworth wrote:
On Fri, May 12, 2006 at 09:08:24PM -0700, Brion Vibber wrote:
No it won't. There is no "simple English" language, it's just English.
Are you making that assertion, brion, in the limited context of "languages that Mediawiki supports"?
Cause, y'know:
http://en.wikipedia.org/wiki/Simple_english
Hey wow, that page agrees with me.
If there's no Simple English language; how can we have a wikipedia in it?
:-)
Cheers,
-- jra
Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274
A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 13/05/06, Mark Williamson node.ue@gmail.com wrote:
There's no Zlatiborian language, but we have a test wiki in it.
<zlatibor humour>The glorious Zlatiborians shall soon triumph over their evil Serbian overlords and will be able to speak their beautiful language and practice their magnificent culture freely once again!</zlatibor humour>
Next up, the Vogon poetry wiki.
Rob Church
On Sat, May 13, 2006 at 10:37:36PM +0100, Rob Church wrote:
On 13/05/06, Mark Williamson node.ue@gmail.com wrote:
There's no Zlatiborian language, but we have a test wiki in it.
<zlatibor humour>The glorious Zlatiborians shall soon triumph over their evil Serbian overlords and will be able to speak their beautiful language and practice their magnificent culture freely once again!</zlatibor humour>
Next up, the Vogon poetry wiki.
Like plurdled gabblebloggits on a lurgid bee, I'm tellin' ya.
Cheers, -- jra
Brion Vibber <brion@...> writes:
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
The lang (and xml:lang) attribute defined at the HTML tag in some language is not correct and it's supposed to not making this value identical to $wgContLanguageCode.
Hi, Brion, how about the second idea which Shinjiman said?
"An alternative way to solve this issue is adding a new option in user's preference setting, then the lang attribute in the HTML tag can be customizable and Logged on users can change the value to meet their actual need."
We need your professional advices, thank you.
And I would like to present to you again why correct lang code is important for Chinese users and how it is complicated. Please see this screenshot:
http://img90.imageshack.us/img90/9348/zhfamily6zw.png
You may not read Chinese, but it's alright, because you can identify there are 8 lang codes and 2 fonts in the table. You may also find character structures are also different between the 2 fonts. It does be the point. To Chinese people, different character structure affects their reading. Therefore, I think making <HTML lang="XXX"> customizable is a wiser method because of the complication of Chinese.
百楽兎
Brion Vibber <brion@...> writes:
Shinjiman wrote:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="XXX" lang="XXX">
Hi, Brion, how about the second idea which Shinjiman said?
"An alternative way to solve this issue is adding a new option in user's preference setting, then the lang attribute in the HTML tag can be customizable and Logged on users can change the value to meet their actual need."
We need your professional advices, thank you.
And I would like to present to you again why correct lang code is important for Chinese users and how it is complicated. Please see this screenshot:
http://img90.imageshack.us/img90/9348/zhfamily6zw.png
You may not read Chinese, but it's alright, because you can identify there are 8 lang codes and 2 fonts in the table. You may also find character structures are also different between the 2 fonts. It does be the point. To Chinese people, different character structure affects their reading. Therefore, I think making <HTML lang="XXX"> customizable is a wiser method because of the complication of Chinese.
Shinjiman and Happy Rabbit, I don't know about everybody else here, but I'd find this issue easier to think about if I understood a bit more about what you're trying to accomplish. Apologies for not knowing as much as I should about these issues.
My main question is this. Why do we need the extra mapping of the HTML lang tag? Why is it not adequate to set the value of $wgContLanguageCode appropriately for each wiki, so that it matches the language that the wiki's pages are written in?
(I'm ignoring the question of interwiki links, and of the wiki's customized pages, because I think those are separate issues, and not the one I think we're most interested in here.)
I can guess what the problem might be, but I'm not sure. Please tell me which of the following scenarios it is, or if it isn't any of these, please explain what the issue really is.
1. Perhaps, within one wiki, there are different pages written in different variants of the language. Since $wgContLanguageCode is constant across the whole wiki, it can't be correct for each page. It must be overridden depending on the page being displayed.
2. Perhaps some browsers are broken, and don't choose the correct font based on the HTML lang tag. So even if $wgContLanguageCode is set correctly for the language a wiki's pages are written in, some users will see it incorrectly. So those users need a way to override the value of $wgContLanguageCode.
3. Perhaps the problem is this "Language Converter" thing. Even though a wiki's pages are written in one language, that language can be automatically altered, on the fly, as the page is displayed, to some other variant of the language. But when this is done, the value of $wgContLanguageCode for that wiki is no longer correct for the language being displayed.
Also, i don't understand what information would be used to adjust the HTML lang tag. Brion has suggested that $wgContLanguageCode should always be used, that is, that
html_lang_tag = $wgContLanguageCode
You're suggesting that the HTML lang tag has to be derived as a function of $wgContLanguageCode and some other information:
html_lang_tag = f( $wgContLanguageCode, ??? )
The question is, what is the other information? A per-user preference? Extra tags on the page being displayed?
Steve Summit wrote:
Shinjiman and Happy Rabbit, I don't know about everybody else here, but I'd find this issue easier to think about if I understood a bit more about what you're trying to accomplish. Apologies for not knowing as much as I should about these issues.
My main question is this. Why do we need the extra mapping of the HTML lang tag? Why is it not adequate to set the value of $wgContLanguageCode appropriately for each wiki, so that it matches the language that the wiki's pages are written in?
The base problem is that various characters may appear differently when rendered in a Simplified Chinese font vs a Traditional Chinese font.
See http://en.wikipedia.org/wiki/Han_unification for some background on this issue.
Characters that differ semantically between the variants generally are assigned separate Unicode code points, but for characters that simply have different customary appearances the difference is left to the fonts.
Browsers will (or at least, may) pick appropriate default fonts based on the language specified in the web page.
Thus for example, a page which is marked as "zh" (defaulting usually to Simplified Chinese) but which contains text written for Traditional form may display oddly.
We have both Traditional and Simplified forms mixed on the Chinese-language Wikipedia. The wiki software is extended with two key abilities relevant to this:
1) When logged in, a user can select which language the user-interface text appears in. As well as completely different languages, they can choose a specific variant of Chinese (eg, Traditional instead of Chinese).
2) A visitor can select to have the text of a page automatically converted to their preferred variant (Simplified or Traditional) for more comfortable reading.
For a quick sample, here's the main page in default: http://zh.wikipedia.org/wiki/%E9%A6%96%E9%A1%B5
and here with all text converted to Simplified script forms: http://zh.wikipedia.org/w/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh-...
and here with all text converted to Traditional script forms: http://zh.wikipedia.org/w/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh-...
I don't know Chinese well enough to tell what's displaying right and what's displaying wrong, but depending on your browser and fonts situation, you might be able to see that some of the characters appear in different fonts.
At the moment no matter which variant you pick, the HTML still declares its language to be "zh", just plain Chinese. Shinjiman indicates that this defaults to Simplified Chinese in browsers that pick fonts depending on the declared language code; thus this might be displaying incorrectly when text is actually Traditional Chinese.
Now, it would appear to make reasonable sense to pick a more detailed language code when variant conversion is in use, so that browsers' font selections pick the matching variant. On this I concur completely.
Where we appear to have got bogged down is in how this relates to two things a) The user interface language selected in the user's wiki preferences. b) The Accept-Language header sent from the browser.
As for a), it may make sense for a lot of user-interface text to appear with its own overriding lang attributes. This isn't necessarily easy in all cases, but if it's appropriate it's something we can talk about.
As for b), by itself it doesn't seem to make a lot of sense to me; first, picking a language code from the client would simply cause it to fail to match with the content language. Second, the more general issue of picking user-interface language based on the header is something we've provisionally rejected for a long time; non-default UI languages are often not fully customized and would often produce a sub-par or confusing user experience. Additionally supporting it would require changes to the caching infrastructure and would increase server load by an unknown and potentially large amount.
b) appears to be entirely separate from the question at hand, so I'd prefer to leave it for another discussion.
-- brion vibber (brion @ pobox.com)
Thanks for all that background, Brion.
You wrote:
Now, it would appear to make reasonable sense to pick a more detailed language code when variant conversion is in use, so that browsers' font selections pick the matching variant. On this I concur completely.
So if I'm understanding all this (including several of the other things you wrote) the progression would be:
First implement language code adjustment when variant conversion is in use (and based solely on that variant conversion). This seems to be the most clearly required, the most straightforward to implement, and the most likely to yield big benefits.
Then, if there are still problems which variant conversion can't solve, investigate additional language code adjustment based on the user's user interface language preference.
Then, if there are *still* problems, revisit the possibility of also looking at the browser's Accept-Language header.
No, it's not right to use "zh-hant" for zh-min-nan. That Wikipedia is written in Latin letters, some are special diacritics that don't show up in most Traditional Chinese fonts. This was an issue before -- when it said "zh-min-nan" in the tag, it caused the text to display wrong in some browsers. I'm not sure what tag is used at present, though.
Mark
On 10/05/06, Shinjiman shinjiman@gmail.com wrote:
Brion Vibber <brion@...> writes:
Shinjiman wrote:
The issue has been mentioned at
http://mail.wikipedia.org/pipermail/wikitech-
l/2006-March/034397.html in wikitecl-l, and also in BugZilla:5790 http://bugzilla.wikimedia.org/show_bug.cgi?id=5790.
Hi all,
As I've mention that pages above, some Wiki sites has the incorrect lang
tags
which neither assiciated with ISO-693 nor IANA language tags. So I've
examined
this issue, and have a patch sibmitted to BugZilla. However the patch I've made cannot be accepted as Brian said that this would break up the caching system.
There was not any such patch for this, and it would be totally unnecessary to make one. Just let us know what the incorrect ones are and we'll fix the configuration.
Are you maybe thinking of the patch for something totally different which
tries
to guess the visitor's language variant and change the lang attribute based
on it?
For Example: For Simple English Wikipedia, it's show the lang tag as "simple" which do not exists neither ISO639 nor IANA language tag tag. For 'Simple English', the lang code should be "en" (English).
And for Traditional Chinese readers reads a Traditional Chinese webpage, it's supposed to read a page using the Traditional Chinese font. However this does not apply to Wikipedia (and various wiki site running MediaWiki). As the Chinese Wikipedia (and various Chinese based wiki sites that running MediaWiki) has been introduced the LanguageConverter class. The lang tag is "zh" (Chinese). It's not the problem for the Simplified Chinese readers while the (major) browsers will using the Simplified Chinese fonts. However, it's having a problem for Traditional Chinese readers to reading the Chinese context using a Simplified Chinese font.
As your opinion mentioned, to use the language variant to determine the language code that the user is using, it's quite impossible to determine the lang tag by languge variant. Since currently many users in Chinese Wikipedia sets to disable the language variant by default because the Chinese words conversion cause much problems currently have. (This is the main point of the issue) Including me, I'am also using a Traditional Chinese (UI language = zh- hk) interface language and _without enable_ any of language variants (Variant = zh). So as the patch I've submitted, it's not to determine the lang tag only by the language variant, but checks with both interface language and $wgContLanguageCode (Global interface language).
So summarising my statement above, I've suggested to adding a new attribute to assign the lang tag correctly, by using arrays, or something like to provides a similar functionally. For example:
Language code | Language tag -------------------------+---------------------------------------------- en | en de | de simple (Simple English) | en zh-cn | zh-cn <= originally supposed to be zh-hans (R1) zh-sg | zh-sg <= originally supposed to be zh-hans (R1) zh-tw | zh-hant zh-hk | zh-hant zh-mo | zh-hant zh-min-nan | zh-hant <= (R2) zh-yue | zh-hant <= (R2)
*Remarks:
- The tags is used as zh-cn/zh-sg instead of zh-hans for browsers
compatibility (likely IE6 will misunderstand the zh-hans lang tag). 2. The tags is used as zh-hant instead of zh-min-nan/zh-yue browsers compatibility (likely both IE/Firefox will misunderstand both zh-min-nan and zh-yue lang tag).
For that table about is about to construct a lang tag mapping against various languages.
And after this kind of language mapping is done, it's need to modify the OutputPage.php (for older skins) and Monobook.php (for newer skins) to output the <html xml:lang"XXX" lang="XXX"> correctly ("XXX" is the correct lang code instead using the $wgContLanguageCode directly) to address this issue.
Hope my information I've proveded would help you to ongoing and addressing this kind of issue more smoothly. :)
regards Shinjiman
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Mark Williamson <node.ue@...> writes:
No, it's not right to use "zh-hant" for zh-min-nan. That Wikipedia is written in Latin letters, some are special diacritics that don't show up in most Traditional Chinese fonts. This was an issue before -- when it said "zh-min-nan" in the tag, it caused the text to display wrong in some browsers. I'm not sure what tag is used at present, though.
Mark
As this problem, I can remove/change the key from those arrays to addressing this issue, and having the value zh-min-nan itself or en (latin character sets) returned if the zh-min-nan is set to $wgContLanguageCode value. :)
Man
wikitech-l@lists.wikimedia.org