In XML, named entity references like and • (with the special exceptions of < > & " ') can be treated as well-formedness errors across the board by conformant XML processors. (Yes, this means that *any* XML document that uses *any* named entity reference except the special five is not well-formed, if you ask these XML processors.) Alternatively, if a DTD is provided, conformant XML processors can retrieve the DTD, parse it, and treat the reference as a well-formedness error if it doesn't occur in the DTD, otherwise parse it as you'd expect. (Yes, processors can really pick whichever behavior they want, as far as I understand it. As we all know, the great thing about standards is how many there are to choose from.)
In practice, as far as I can tell, XML UAs that our users use do the latter, retrieving the DTD. (Otherwise they'd instantly break, and our users wouldn't use them!) Thus we get away with using and such, and still work in these UAs. But this means we have to provide a doctype with a DTD, which means not just <!DOCTYPE html>. This is the default behavior on trunk -- we output an XHTML Strict DTD when the document is actually HTML5. This has a few disadvantages, in addition to just being odd:
1) Validators treat the content as XHTML Strict, not HTML5, so it fails validation unless you specifically ask for HTML5 validation. I've already seen a couple of complaints about this, and we haven't even released yet. Lots of people care about validation.
2) XML processors are still within their rights to reject the page, declining to process the DTD and treating the page as non-well-formed.
3) For XML processors that do process the DTD, we force them to do a network load as soon as they start parsing the page. Presumably this slows down parsing (dunno how much in practice), and it also hurts the W3C's poor servers: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
The alternative is to simply not use any named character references -- replace them all by numeric ones. E.g., use   instead of , and ยท instead of •. Then we can use <!DOCTYPE html> by default and avoid these problems. In fact, we already do this for anything that passes through the parser, as far as I can tell -- we convert it to UTF-8.
The problem is that if we do this and then miss a few entities somewhere in the source code, some pages will mysteriously become non-well-formed and tools will break. Plus, of course, you have the usual risks of breakage from mass changes. Overall, though, I'd prefer that we do this, because the alternative is that I'd have to pester the standards people and validator people for a means to let us validate properly with an XHTML Strict doctype.
Are there any objections to me removing all named entity references from MediaWiki output?