I've been working on html2unicode in the last days and I stumbled upon the fact that a & also works as a normal ampersand, so that & for example gets converted into &. Now the commit which introduced it into core (fc61025 [1]) is not really descriptive so I searched in compat's code and found the corresponding commit f97dfb0 [2].
There it links to the discussion on @xqt's talk page [3] which doesn't really explain what is happening there. The API never returns HTML entities unless it's the content of a page. I've been testing [4] such a link and [[&]] does work but not [[&]]. Also the entitey gets properly encoded, but [[ ]] also only once.
My question here is why is it necessary and especially in core which only does API requests which shouldn't suffer from such a problem it could be changed probably. The only reason I see if something is decoding text improperly and converts into   which shouldn't be our concern.
Fabian
[1]: https://github.com/wikimedia/pywikibot-core/commit/fc6102527e4c556cd77aa8773... [2]: https://git.wikimedia.org/blobdiff/pywikibot%2Fcompat.git/f97dfb0d1ca49751cc... [3]: https://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion%3AXqt&act... [4]: https://en.wikipedia.org/wiki/User:XZise/linktest
Hi,
I think the problem is that langlinks did/can/could use &#xxxx;
https://no.wikipedia.org/w/index.php?title=Wikipedia:Om&diff=9814748&...
However, those &'s dont appear to be in the current API langlinks results for the old revision.
https://no.wikipedia.org/w/api.php?action=query&prop=langlinks&revid...
But I wouldnt be surprised if the MW 1.18 API did literally emit the langlinks unparsed. Just as MW API currently emits #redirect targets unparsed.
On Fri, Mar 13, 2015 at 7:42 AM, Fabian Neundorf CommodoreFabianus@gmx.de wrote:
I've been working on html2unicode in the last days and I stumbled upon the fact that a & also works as a normal ampersand, so that & for example gets converted into &. Now the commit which introduced it into core (fc61025 [1]) is not really descriptive so I searched in compat's code and found the corresponding commit f97dfb0 [2].
There it links to the discussion on @xqt's talk page [3] which doesn't really explain what is happening there. The API never returns HTML entities unless it's the content of a page. I've been testing [4] such a link and [[&]] does work but not [[&]]. Also the entitey gets properly encoded, but [[ ]] also only once.
My question here is why is it necessary and especially in core which only does API requests which shouldn't suffer from such a problem it could be changed probably. The only reason I see if something is decoding text improperly and converts into   which shouldn't be our concern.
Fabian
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l