[Pywikipedia-l] Recursive HTML entities - pywikibot

12 Mar 2015


      I've been working on html2unicode in the last days and I stumbled upon
the fact that a &amp; also works as a normal ampersand, so that
&amp;amp; for example gets converted into &. Now the commit which
introduced it into core (fc61025 [1]) is not really descriptive so I
searched in compat's code and found the corresponding commit f97dfb0
[2].
There it links to the discussion on @xqt's talk page [3] which doesn't
really explain what is happening there. The API never returns HTML
entities unless it's the content of a page. I've been testing [4] such
a link and [[&amp;]] does work but not [[&amp;amp;]]. Also the entitey
&nbsp; gets properly encoded, but [[&amp;nbsp;]] also only once.
My question here is why is it necessary and especially in core which
only does API requests which shouldn't suffer from such a problem it
could be changed probably. The only reason I see if something is
decoding text improperly and converts &nbsp; into &amp;nbsp; which
shouldn't be our concern.
Fabian
[1]: https://github.com/wikimedia/pywikibot-core/commit/fc6102527e4c556cd77aa8773...
[2]: https://git.wikimedia.org/blobdiff/pywikibot%2Fcompat.git/f97dfb0d1ca49751cc...
[3]: https://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion%3AXqt&act...
[4]: https://en.wikipedia.org/wiki/User:XZise/linktest