Okay, Danny and I have argued about this on IRC for the past 45
minutes and still don't agree. My position is as follows. XHTML1 is
a reformulation of HTML4 and does not change it except to make it
XML-conformant, or as otherwise noted. HTML4 clearly states that id
and name attributes must meet exactly the same production: in
particular, they must start with a letter.
Now, the XHTML DTD doesn't say this. According to the XHTML1
Transitional DTD[1], the name attribute is of type NMTOKEN. This
production can, in particular, begin with non-letters[2]. On the
other hand, according to the exact same DTD, the id attribute is of
type ID, and this can also begin with non-letters[3]. So it seems to
me that if you think the DTD takes precedence over the productions
given in HTML4, both id *and* name attributes can begin with
non-letters (although the exact class of characters they can begin
with varies).
However, the DTD does not in any XML standard specify the full
restrictions on the type of content allowed. Additional restrictions
may be added in the prose of the specification. XML that conforms to
the DTD is valid, but still might be non-conformant. In the
particular case, it's not clear from the normative text whether the
constraints from HTML4 are supposed to be maintained. In the
informative appendix C.8 of XHTML1[4], it says:
"Further, since the set of legal values for attributes of type ID is
much smaller than for those of type CDATA, the type of the name
attribute has been changed to NMTOKEN. This attribute is constrained
such that it can only have the same values as type ID, or as the Name
production in XML 1.0 Section 2.3, production 5. Unfortunately, this
constraint cannot be expressed in the XHTML 1.0 DTDs."
In other words, the NMTOKEN type was only selected because it was the
closest possible match. In fact, the intent seems to have been to
weaken the HTML4 restrictions on both name and id, and allow a wider
selection of characters to be used (the NAME production from XML is
considerably laxer than [a-zA-Z][a-zA-Z0-9_:.-]* from HTML4).
However, it also implies that id and name adhere to exactly the same
standards, and in particular, neither one may start with a digit,
hyphen, etc. But this isn't stated anywhere normative that I can
find.
On the other hand, it's recommended that XHTML1 documents that are
intended to be backward-compatible restrict themselves to the HTML4
definitions anyway:[4]
"Note that the collection of legal values in XML 1.0 Section 2.3,
production 5 is much larger than that permitted to be used in the ID
and NAME types defined in HTML 4. When defining fragment identifiers
to be backward-compatible, only strings matching the pattern
[A-Za-z][A-Za-z0-9:_.-]* should be used. See Section 6.2 of [HTML4]
for more information."
In fact, this is what we currently try to do (see
Sanitizer::escapeId()). I've always considered the fact that we allow
section anchors that start with non-letters to be a bug. But I don't
think we gain anything in conformance from not adding anchors to id's
as well: the requirements for both are the same. Also, in section
4.10[5], the standard says:
"In order to ensure that XHTML 1.0 documents are well-structured XML
documents, XHTML 1.0 documents MUST use the id attribute when defining
fragment identifiers on the elements listed above."
According to which we're non-conformant for *not* having an id
attribute on the <a> element. (This is in an informative section but
uses the word "MUST", so I have no idea what that's supposed to mean.)
But above all, Tidy has been adding the id attribute to <a>'s on
Wikipedia for years and still will with or without this change. If
something is wrong with that, this reversion is not helping anything.
The fix should be to change what we're outputting for anchors in the
first place. For the time being, we should keep the change so that
we're at least doing things consistently with or without Tidy. If
it's wrong, fix it for everyone, not just everyone not using Tidy.
[1]
http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Transitional
[2]
http://www.w3.org/TR/REC-xml/#NT-Nmtoken
[3]
http://www.w3.org/TR/REC-xml/#NT-Name,
http://www.w3.org/TR/REC-xml/#id
[4]
http://www.w3.org/TR/xhtml1/#C_8
[5]
http://www.w3.org/TR/xhtml1/#h-4.10