In that sense, I think we should change what valhallasw originally suggested to something like "all files must be in UTF-8 format, without BOM. They must start with "# -*- coding: utf-8 -*-" (without quotes) as their first line..."

Hojjat (aka Huji)


On 11/7/07, Russell Blau < russblau@imapmail.org> wrote:
Huji wrote:

> About BOM, I hope every editor has a way to add it to the beginning of the
> file. (In MediaWiki
> codes, when a BOM was added by notepad, I had to remove it to make the
> code work
> correct, and it was a pain in ass; now, I have this pessimistic feeling
> about adding it for
> Pywikipedia).

UTF-8 files should not contain a BOM.  According to the Unicode BOM FAQ[1],
"UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An
initial BOM is only used as a signature — an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded
data do not expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol or file
format that expects specific ASCII characters at the beginning, such as the
use of "#!" of at the beginning of Unix shell scripts."  In Python, the
encoding is specified by an explicit "# -*- coding" line; not only is there
no need for a BOM, but having one there screws up Python's interpretation of
the file.

[1] http://unicode.org/faq/utf_bom.html#25

Russ