Dear all,
a colleague of mine described the following issue at https://www.mediawiki.org/wiki/Extension_talk:External_Data#Problems_with_sp... and I was able to find the root cause.
When retrieving XML (did not check other formats), and the actual value is non-ASCII, the XML parser calls the data handler for each piece (see https://secure.php.net/manual/en/function.xml-set-character-data-handler.php). So, let's assume the value is "grün" or "journée", the data handler is called twice for each of these values (1. "gr", 2. "ün", or 1. "journ", 2. "ée", resp.).
The data handler in get_web_data for XML is ED_Utils::getContenthttps://github.com/wikimedia/mediawiki-extensions-ExternalData/blob/master/ED_Utils.php. The current implementation of getContent adds a new element to the $edgXMLValues[$edgCurrentXMLTag] array every time it is called. So, it creates new elements for each piece.
My understanding is that only multi values should end up in different buckets:
<colors> <color>blau</color> <color>grün</color> <color>rot</color> </colors>
should end up as
$edgXMLValues['color'][0] = 'blau' $edgXMLValues['color'][1] = 'grün' $edgXMLValues['color'][2] = 'rot'
and
<greetings> <greeting>Bonne journée</greeting> <greeting>Bonne soirée</greeting> </greetings>
should end up as
$edgXMLValues['greeting'][0] = 'Bonne journée' $edgXMLValues['greeting'][1] = 'Bonne soirée'
However, the current implementation returns:
$edgXMLValues['color'][0] = 'blau' $edgXMLValues['color'][1] = 'gr' $edgXMLValues['color'][2] = 'ün' $edgXMLValues['color'][3] = 'rot'
$edgXMLValues['greeting'][0] = 'Bonne journ' $edgXMLValues['greeting'][1] = 'ée' $edgXMLValues['greeting'][2] = 'Bonne soir' $edgXMLValues['greeting'][3] = 'ée'
IMHO, getContent should check whether it is called for the very same XML element and then, append the content to the last element's value.
Best regards Christian
mediawiki-l@lists.wikimedia.org