On 2 May 2014 08:00, Brian Wolff bawolff@gmail.com wrote: ...
The encoding should be utf8 with NFC, but even if its not quite correct gw/mw should convert it.
...
SUMMARY
Not a bug. Most headaches with the NYPL upload seem to be due to the source data.
ANALYSIS
Based on more tranches of uploads from NYPL and several days of playing with format problems, I believe this was a source site problem rather than a bug with GWT. Here's what I know based on : * My xml is created from Python correctly encoded using UTF-8. * Most of the metadata is scraped from webpages defined as "iso-8859-1" i.e. the latin-1 charset. The (key needed) API is limited, I only use it to find a direct link to the full tiff, which is mysteriously avoided in the fully open website. * Encoding of characters like "e acute" may read both correctly and incorrectly in the resulting xml. The incorrect forms exist in at least two variations for this character, however my use of encode('Windows-1252') on source fields may be part of this, I'm using this based on earlier trial and error of multiple methods. The number of records with errors is now relatively small. * Reading the file as UTF-8 in JEdit may include hidden characters, making hand-correction, tricky.
These behaviours indicate to me that the NYPL source site has inconsistently encoded its web pages. If I decode or encode I get unpredictable results. At the moment, my shortest workflow seems to be to let Python create the xml as a UTF file without any more character encoding and then fix any oddities by hand.
As a "more advanced" user of GWT, the fact that I am spending so much time in xml preparation should be of concern for how much we say in our user guides for the tool about xml checking, testing and preparation. I have little doubt that similar encoding problems (especially for multilanguage or ancient texts) will continue to dog some of our users. Other xml valitation issues and on-wiki conventions with regard to html encoding, hidden text etc. would be worth the user community expanding on the manual help pages; possibly growing the trouble-shooting guide as a separate document.
PS with regard to Dan's timeout point, I am unsure how much this is affecting my uploads. I did have to compile an xml file of re-uploads after many were mysteriously skipped, but the second time around they appear to have all been uploaded so this performance related issue might be entirely dependant on the stress WMF's servers happen to be under. My most significant reason for skipped images is because of NYPL missing catalogue pages, however these are discovered when I create the xml file. This is something I am deferring until after uploads are complete; in practice I might not spend time investigating it as the NYPL is moving to a new website scheme, so it might make sense to park new uploads for a year and let their system stabilize.
Links * Example NYPL source file with non-ascii characters: http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?imageID=126... * API http://api.repo.nypl.org/
Fae