I am working on an update of really outdated LanguageCs.php (using the up-to-date messages in cs:MediaWiki namespace) and I have found a problem I do not understand:
In SpecialBlockip.php, there are commands like
$wgOut->addWikiText( htmlspecialchars( wfMsg( "blockiptext" ) ) );
AFAICS this means that no HTML entities may occur in the blockiptext message (as every & gets converted to &). Is there any reason for that? Why not drop the htmlspecialchars() out? Is there any generic rule when it is used and when not? (I mean, some other special pages seem to use addHTML( wfMsg( ... ) ) etc.)
Thanks, [[cs:User:Mormegil|Mormegil]]
Petr Kadlec wrote:
I am working on an update of really outdated LanguageCs.php (using the up-to-date messages in cs:MediaWiki namespace) and I have found a problem I do not understand:
In SpecialBlockip.php, there are commands like
$wgOut->addWikiText( htmlspecialchars( wfMsg( "blockiptext" ) ) );
AFAICS this means that no HTML entities may occur in the blockiptext message (as every & gets converted to &). Is there any reason for that? Why not drop the htmlspecialchars() out? Is there any generic rule when it is used and when not? (I mean, some other special pages seem to use addHTML( wfMsg( ... ) ) etc.)
I don't think there's any particular reason for that, besides helping out translators who generally don't know HTML. Remember that wfMsg() can take parameters, and such parameters always have to be cleaned before being output if they come from the user.
-- Tim Starling
OK, although I would imagine that wrapping the parameters to htmlspecialchars() would be a little bit "more correct", I can understand that.
So I'll ask a question that seems to be far off this topic, but it is not. :-) What importance, meaning and purpose is there in the following message that appears in some LanguageXx.php files?
# This file is encoded in UTF-8, no byte order mark. # For compatibility with Latin-1 installations, please # don't add literal characters above U+00ff.
What difference is there between e.g. U+00FF (UTF-8 encoding C3, BF) and U+0100 (encoded to C4, 80), with regards to Latin-1 installations?
And if the message would be true, what is the other method I could use to include characters above U+00FF in my messages? (Which is the only reason for using UTF-8 anyway.) As I see from the previous problem (ha! topic connection :-) ), I cannot use HTML entities like "scaron".
Well, anyway we do use literal characters above U+00FF in LanguageCs.php just because there is no other practically usable way to write Czech...
So, my final deduction is that the abovementioned message is rather strange and I should ignore it, write any Unicode character to the file normally, and generally not use HTML entities. (Which is unfortunate especially for nbsp, which is normally indistinguishable from a plain space character.) Am I correct?
Thanks, [[cs:User:Mormegil|Mormegil]]
Petr Kadlec wrote:
OK, although I would imagine that wrapping the parameters to htmlspecialchars() would be a little bit "more correct", I can understand that.
So I'll ask a question that seems to be far off this topic, but it is not. :-) What importance, meaning and purpose is there in the following message that appears in some LanguageXx.php files?
# This file is encoded in UTF-8, no byte order mark. # For compatibility with Latin-1 installations, please # don't add literal characters above U+00ff.
What difference is there between e.g. U+00FF (UTF-8 encoding C3, BF) and U+0100 (encoded to C4, 80), with regards to Latin-1 installations?
That warning doesn't apply to LanguageCs.php. It only applies to the language files with that comment. The story is that some wikis (in particular en, da, nl and sv) have been encoded in latin-1 since the year dot. The language files for those wikis used to be latin-1, but that prevented the creation of new utf-8 wikis in those languages. So Brion converted all the language files to utf-8, and wrote LanguageLatin1.php, which uses iconv to convert the text to latin-1 at runtime. Characters above U+00FF can't be represented in latin-1, and are instead converted to a question mark. This is rarely an issue since the languages with latin-1 wikis generally only need latin-1 characters.
So, my final deduction is that the abovementioned message is rather strange and I should ignore it, write any Unicode character to the file normally, and generally not use HTML entities. (Which is unfortunate especially for nbsp, which is normally indistinguishable from a plain space character.) Am I correct?
Yes, that's correct.
-- Tim Starling
Well, meta:LanguageCs.php did contain the Latin-1 note, but now I see it was in error. Thanks for claryfying that.
And one (I hope final) request: I have updated meta:LanguageCs.php. I would like to ask someone to transfer it into CVS. Or, is there anything special I should do in order to get the CVS updated?
Thanks, [[cs:User:Mormegil|Mormegil]]
On Oct 24, 2004, at 7:01 AM, Petr Kadlec wrote:
AFAICS this means that no HTML entities may occur in the blockiptext message (as every & gets converted to &). Is there any reason for that? Why not drop the htmlspecialchars() out? Is there any generic rule when it is used and when not? (I mean, some other special pages seem to use addHTML( wfMsg( ... ) ) etc.)
Raw HTML is dangerous for several reasons, chiefly these two: * If a message does not contain 100% valid HTML, pages using it would become completely inaccessible in strict XHTML output mode. An accidental change by a sysop to a critical messages could make the ENTIRE WIKI inaccessible to everyone until someone with direct database access came in to fix it.
* HTML can contain JavaScript; a sysop account could be used to add a cross-site scripting attack to EVERY PAGE OF THE WIKI. (On some larger wikis we have literally hundreds of sysops, who might not all be trusted; also accounts may be compromised by various means including unknown but limited-exposure cross-site scripting attacks.)
As such I've been gradually moving wiki messages over to be wikitext where possible, or else plaintext (via htmlspecialchars). This is not yet complete.
-- brion vibber (brion @ pobox.com)
Yes, I understand why direct HTML is dangerous, but the impossibility of using HTML entities is annoying when you want to use some Unicode characters that are not on keyboard (e.g. bdquo, ldquo, nbsp etc.). So really _plain_ text messages should be IMHO limited only to those where wikitext support would be too complicated/impossible to add.
[[cs:User:Mormegil|Mormegil]]
On Sunday 24 October 2004 22:33, Petr Kadlec wrote:
HTML is dangerous
Perhaps only sysops or devels could use direct HTML and PHP in a wikipage.
When a user tries to add HTML/PHP, it will be rendered as plain text.
On Oct 24, 2004, at 12:52 PM, NSK wrote:
On Sunday 24 October 2004 22:33, Petr Kadlec wrote:
HTML is dangerous
Perhaps only sysops or devels could use direct HTML and PHP in a wikipage.
If you read the following paragraphs you'll see two reasons why that's insufficient.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org