No subject

Fri Aug 17 21:06:45 UTC 2012

things that looked like character entities were extracted.

2.
Lot of them were incorrect - they didn't end with ;, but with <, space
or something like that.

Here are results from perlscripting:

These were considered "html entities":
547567  &\S+?; &#\d+;? &#[xX][0-9a-fA-F]+;?

Other matches:
574064  &\S+?;?
516728  &\S+?; &#\d+; &#[xX][0-9a-fA-F]+;
 29547  &nbsp
 23015   

3.
Of that 547567:
       823	hex refs (all correctly ended)
    134505	decimal refs (103661 ended correctly - only 77%)
    412239	other (incorrectly ended already excluded)

4.
After hex->dec conversion:
    32701	unique entities
    32415	unique numerical entities
      286	unique named entities	(after second look it seems that lot of
    					 these are things like cgi part of URLs
					 etc. and not real html entities)
5.
30 most popular named entities:
 294565	sup2
  48034	deg
  23015	nbsp
   9558	middot
   2449	eacute
   2092	amp
   1738	gt
   1480	lt
   1474	times
   1417	radic
   1273	quot
   1210	mdash
    909	ouml
    759	uuml
    743	alpha
    708	rarr
    651	lambda
    629	aacute
    615	pi
    612	phi
    511	epsilon
    510	mu
    508	egrave
    455	iacute
    409	ndash
    396	oacute
    395	omega
    390	le
    368	gamma
    357	sigma

6.
XHTML absolutely can't contain incorrect &-entities. It won't even display.
So if we want to move to XHTML and have goodies like MathML,
we must make Wikipedia parser understand them.

7.
Search will benefit much from replacing html entities with proper characters
in searching text form.

8.
If we want to add option of generating PNGs of non-Latin characters,
then we must parse them.

9.
We may want small inline images, like those on Sensei's Library generated
by W1 B3 etc. Of course we can't make every W1 turn into image.
But using &W1; will do fine. They will be needed if we ever add support
for game diagrams, Using lot of [[Image:w1.png]] just doesn't seem right.

10. Summary note:
We need to make parser understand &-entities.
It's impossible to ignore the problem for much longer.

If you create new parser for Wikipedia, please consider this issue.