On 2 May 2014 08:00, Brian Wolff <bawolff(a)gmail.com> wrote:
...
The encoding should be utf8 with NFC, but even if its
not quite correct
gw/mw should convert it.
...
SUMMARY
Not a bug. Most headaches with the NYPL upload seem to be due to the
source data.
ANALYSIS
Based on more tranches of uploads from NYPL and several days of
playing with format problems, I believe this was a source site problem
rather than a bug with GWT. Here's what I know based on :
* My xml is created from Python correctly encoded using UTF-8.
* Most of the metadata is scraped from webpages defined as
"iso-8859-1" i.e. the latin-1 charset. The (key needed) API is
limited, I only use it to find a direct link to the full tiff, which
is mysteriously avoided in the fully open website.
* Encoding of characters like "e acute" may read both correctly and
incorrectly in the resulting xml. The incorrect forms exist in at
least two variations for this character, however my use of
encode('Windows-1252') on source fields may be part of this, I'm using
this based on earlier trial and error of multiple methods. The number
of records with errors is now relatively small.
* Reading the file as UTF-8 in JEdit may include hidden characters,
making hand-correction, tricky.
These behaviours indicate to me that the NYPL source site has
inconsistently encoded its web pages. If I decode or encode I get
unpredictable results. At the moment, my shortest workflow seems to be
to let Python create the xml as a UTF file without any more character
encoding and then fix any oddities by hand.
As a "more advanced" user of GWT, the fact that I am spending so much
time in xml preparation should be of concern for how much we say in
our user guides for the tool about xml checking, testing and
preparation. I have little doubt that similar encoding problems
(especially for multilanguage or ancient texts) will continue to dog
some of our users. Other xml valitation issues and on-wiki conventions
with regard to html encoding, hidden text etc. would be worth the user
community expanding on the manual help pages; possibly growing the
trouble-shooting guide as a separate document.
PS with regard to Dan's timeout point, I am unsure how much this is
affecting my uploads. I did have to compile an xml file of re-uploads
after many were mysteriously skipped, but the second time around they
appear to have all been uploaded so this performance related issue
might be entirely dependant on the stress WMF's servers happen to be
under. My most significant reason for skipped images is because of
NYPL missing catalogue pages, however these are discovered when I
create the xml file. This is something I am deferring until after
uploads are complete; in practice I might not spend time investigating
it as the NYPL is moving to a new website scheme, so it might make
sense to park new uploads for a year and let their system stabilize.
Links
* Example NYPL source file with non-ascii characters:
http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?imageID=12…
* API
http://api.repo.nypl.org/
Fae
--
faewik(a)gmail.com
https://commons.wikimedia.org/wiki/User:Fae