Some notes.
1.
From english database dump (cur only)
things that looked like character entities were extracted.
2. Lot of them were incorrect - they didn't end with ;, but with <, space or something like that.
Here are results from perlscripting:
These were considered "html entities": 547567 &\S+?; &#\d+;? &#[xX][0-9a-fA-F]+;?
Other matches: 574064 &\S+?;? 516728 &\S+?; &#\d+; &#[xX][0-9a-fA-F]+; 29547   23015
3. Of that 547567: 823 hex refs (all correctly ended) 134505 decimal refs (103661 ended correctly - only 77%) 412239 other (incorrectly ended already excluded)
4. After hex->dec conversion: 32701 unique entities 32415 unique numerical entities 286 unique named entities (after second look it seems that lot of these are things like cgi part of URLs etc. and not real html entities) 5. 30 most popular named entities: 294565 sup2 48034 deg 23015 nbsp 9558 middot 2449 eacute 2092 amp 1738 gt 1480 lt 1474 times 1417 radic 1273 quot 1210 mdash 909 ouml 759 uuml 743 alpha 708 rarr 651 lambda 629 aacute 615 pi 612 phi 511 epsilon 510 mu 508 egrave 455 iacute 409 ndash 396 oacute 395 omega 390 le 368 gamma 357 sigma
6. XHTML absolutely can't contain incorrect &-entities. It won't even display. So if we want to move to XHTML and have goodies like MathML, we must make Wikipedia parser understand them.
7. Search will benefit much from replacing html entities with proper characters in searching text form.
8. If we want to add option of generating PNGs of non-Latin characters, then we must parse them.
9. We may want small inline images, like those on Sensei's Library generated by W1 B3 etc. Of course we can't make every W1 turn into image. But using &W1; will do fine. They will be needed if we ever add support for game diagrams, Using lot of [[Image:w1.png]] just doesn't seem right.
10. Summary note: We need to make parser understand &-entities. It's impossible to ignore the problem for much longer.
If you create new parser for Wikipedia, please consider this issue.