Does anyone have a working definition of characters allowed in the filenames that I could apply in my pre-processing of the xml files? See [1] and [2] for the technical standards that apply by default.
In my NYPL uploads I have found that characters in the chosen filename like: * ö (o umlaut) * Æ (upper case ash / ae ligature) caused GWT to halt the upload at that point (no warning back to me). These characters should be acceptable to the MediaWiki software.
These characters seem to be okay in the image page body, just not the filename. Other characters like é (e acute) appear to process fine. For the 18th century and earlier maps from the NYPL, this is a major time-sink. :-(
Links 1. https://commons.wikimedia.org/wiki/MediaWiki:Filename-prefix-blacklist 2. https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist
Fae
Fæ, 01/05/2014 14:10:
In my NYPL uploads I have found that characters in the chosen filename like:
- ö (o umlaut)
- Æ (upper case ash / ae ligature)
caused GWT to halt the upload at that point (no warning back to me). These characters should be acceptable to the MediaWiki software.
Sounds like a major bug, is it filed?
Nemo
On 1 May 2014 13:21, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Sounds like a major bug, is it filed?
Not on bugzilla yet. I'd like to know if others have a similar experience of GWT halting mid-run for these sorts of reasons.
Unfortunately with no error reporting back from partial runs, diagnosis becomes quite hard, even working out which file your run failed on! Charsets are quite a tricky thing to debug as this sort of error can boil down to the local file encoding (my xml should be 'utf-8', I am working on an OSX system, and my xml header is correct) but for all I know my use of JEdit could be creating discrepencies.
Fae
As an example, below is a failed record in my most recent 're-run', where some records are being skipped (I have no idea how the GWT chooses when to skip or fail). If someone can spot the problem I'd be very grateful .
<record> <filename>Islip Village and Vicinity Suffolk County, N.Y. NYPL1602998</filename> <additional_names>L.E. Neuman and Co. -- Engraver ;F.W. Beers and Co. -- Publisher</additional_names> <creator>Wendelken and Co. -- Publisher</creator> <link_image>http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ</link_image> <o1_notes>{{Information field|name=Notes|value=- }}</o1_notes> <o_source_description>{{Information field|name=Source description|value=1 p. l., 22 maps on 30 l. (part fold.) 20 in. }}</o_source_description> <o1_catalog_call_number>{{Information field|name=Catalogue call number|value=Map Div.++ (Suffolk county, N.Y.) (Wendelken and company, New York. Atlas of the towns of Babylon) }}</o1_catalog_call_number> <cat_uploader>Images uploaded by {{subst:User:Fae/Fae}}</cat_uploader> <image_title>Islip Village and Vicinity Suffolk County, N.Y.</image_title> <o1_item_page_plate>{{Information field|name=Item/Page/Plate|value=Section J }}</o1_item_page_plate> <link_catalog> http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?imageID=160... {{Institution:New York Public Library}}</link_catalog> <source>Atlases of the United States / New York / Atlas of the towns of Babylon, Islip, and south part of Brookhaven in Suffolk Co., N.Y. Published by Wendelken and Co., 36 Vesey street, New York ... Engraved and printed by L.E. Neuman and Co.</source> <record_id>1092191</record_id> <o1_alternate_title>{{Information field|name=Alternate title|value=- }}</o1_alternate_title> <permission>From The Lionel Pincus and Princess Firyal Map Division. http://maps.nypl.org {{CC0}}</permission> <cat_upload_project>NYPL maps</cat_upload_project> <o1_item_physical_description>{{Information field|name=Item physical description|value=- }}</o1_item_physical_description> <date>c1888</date> <o1_standard_reference>{{Information field|name=Standard reference|value=- }}</o1_standard_reference> <o_location>{{Information field|name=Location|value=Stephen A. Schwarzman Building / The Lionel Pincus and Princess Firyal Map Division }}</o_location> <record_no>7906</record_no> <o_digital_item_published>{{Information field|name=Digital item published|value=11-15-2007; updated 3-25-2011 }}</o_digital_item_published> <o_digital_id>{{Information field|name=Digital ID|value=1602998 }}</o_digital_id> </record>
Fae
On 01/05/2014, Fæ faewik@gmail.com wrote:
As an example, below is a failed record in my most recent 're-run', where some records are being skipped (I have no idea how the GWT chooses when to skip or fail). If someone can spot the problem I'd be very grateful
...
After downloading the linked tiff file in the previous example, that may be a tangent, as the NYPL might be failing to put the right mime data on some of their tiffs. I doubt I'll ever fix that.
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links 1. https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Fae
Fæ, 01/05/2014 14:59:
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links 1.https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Thanks, this gives you clear steps to reproduce and makes a valuable bug report. Please file. :)
Nemo
characters ---------- i have a test xml i use to test titles and added the characters you mentioned. i had no problem uploading the test xml file. here are 2 results that seem to indicate that there should not be an issue with the characters:
http://commons.wikimedia.beta.wmflabs.org/wiki/File:The_%22King%E2%80%99s_of...
http://commons.wikimedia.beta.wmflabs.org/wiki/File:Dice_players_-_Lo_L%C3%A...
example record -------------- i tested the example record locally and after about 2 minutes i got the message:
The file you submitted was too large. original URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ evaluated URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ
my wiki was set to a limit of 100mb, so i up’d it to 1000mb.
i also switched to the new preview branch i have in gerrit, https://gerrit.wikimedia.org/r/#/c/127839/, for bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63864, which no longer downloads an image to the wiki during the preview step. instead it downloads all mediafiles in a background job.
the job successfully completed after 3 minutes and the image was viewable in my local wiki.
i also took a look at our wikitech instance and saw that you had uploaded the image there without issue. i also repeated the uploaded but got the message:
“This file did not pass file verification.”
this seems to have been thrown by UploadBase.php, so i'd have to look further into that issue. but i also suspect that commons may have just timed out on the download of the image in the preview step. this type of error seems similar to bug 63864. i just need someone to +2 the patch i made so that we can test the new preview step on the beta cluster.
with kind regards, dan
On May 1, 2014, at 18:35 , Federico Leva (Nemo) nemowiki@gmail.com wrote:
Fæ, 01/05/2014 14:59:
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links 1.https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Thanks, this gives you clear steps to reproduce and makes a valuable bug report. Please file. :)
Nemo
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Thanks for the detailed investigation Dan. There must be some oddity in the way I'm creating my xml (Python generated, then edited in JEdit for any tweaks, should be 'utf-8') so I'll continue plugging at it.
I keep missing your emails and finding them under my spam folder, no idea why.
Fae
On 1 May 2014 20:04, dan entous d_entous@yahoo.com wrote:
characters
i have a test xml i use to test titles and added the characters you mentioned. i had no problem uploading the test xml file. here are 2 results that seem to indicate that there should not be an issue with the characters:
http://commons.wikimedia.beta.wmflabs.org/wiki/File:The_%22King%E2%80%99s_of...
http://commons.wikimedia.beta.wmflabs.org/wiki/File:Dice_players_-_Lo_L%C3%A...
example record
i tested the example record locally and after about 2 minutes i got the message:
The file you submitted was too large. original URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ evaluated URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ
my wiki was set to a limit of 100mb, so i up’d it to 1000mb.
i also switched to the new preview branch i have in gerrit, https://gerrit.wikimedia.org/r/#/c/127839/, for bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63864, which no longer downloads an image to the wiki during the preview step. instead it downloads all mediafiles in a background job.
the job successfully completed after 3 minutes and the image was viewable in my local wiki.
i also took a look at our wikitech instance and saw that you had uploaded the image there without issue. i also repeated the uploaded but got the message:
“This file did not pass file verification.”
this seems to have been thrown by UploadBase.php, so i'd have to look further into that issue. but i also suspect that commons may have just timed out on the download of the image in the preview step. this type of error seems similar to bug 63864. i just need someone to +2 the patch i made so that we can test the new preview step on the beta cluster.
with kind regards, dan
On May 1, 2014, at 18:35 , Federico Leva (Nemo) nemowiki@gmail.com wrote:
Fæ, 01/05/2014 14:59:
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links
https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Thanks, this gives you clear steps to reproduce and makes a valuable bug
report. Please file. :)
Nemo
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Could you include what the binary code for your ae was if possible (on unix computers, possibly also mac, the hd or hexdump command can tell you this) or just attach the xml in question to a bug (since there is a possibility that your email client might change the character)?
The encoding should be utf8 with NFC, but even if its not quite correct gw/mw should convert it.
--bawolff On May 2, 2014 3:42 AM, "Fæ" faewik@gmail.com wrote:
Thanks for the detailed investigation Dan. There must be some oddity in
the way I'm creating my xml (Python generated, then edited in JEdit for any tweaks, should be 'utf-8') so I'll continue plugging at it.
I keep missing your emails and finding them under my spam folder, no idea
why.
Fae
On 1 May 2014 20:04, dan entous d_entous@yahoo.com wrote:
characters
i have a test xml i use to test titles and added the characters you
mentioned. i had no problem uploading the test xml file. here are 2 results that seem to indicate that there should not be an issue with the characters:
http://commons.wikimedia.beta.wmflabs.org/wiki/File:The_%22King%E2%80%99s_of...
http://commons.wikimedia.beta.wmflabs.org/wiki/File:Dice_players_-_Lo_L%C3%A...
example record
i tested the example record locally and after about 2 minutes i got the
message:
The file you submitted was too large. original URL:
http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ evaluated URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ
my wiki was set to a limit of 100mb, so i up’d it to 1000mb.
i also switched to the new preview branch i have in gerrit,
https://gerrit.wikimedia.org/r/#/c/127839/, for bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63864, which no longer downloads an image to the wiki during the preview step. instead it downloads all mediafiles in a background job.
the job successfully completed after 3 minutes and the image was
viewable in my local wiki.
i also took a look at our wikitech instance and saw that you had
uploaded the image there without issue. i also repeated the uploaded but got the message:
“This file did not pass file verification.”
this seems to have been thrown by UploadBase.php, so i'd have to look
further into that issue. but i also suspect that commons may have just timed out on the download of the image in the preview step. this type of error seems similar to bug 63864. i just need someone to +2 the patch i made so that we can test the new preview step on the beta cluster.
with kind regards, dan
On May 1, 2014, at 18:35 , Federico Leva (Nemo) nemowiki@gmail.com
wrote:
Fæ, 01/05/2014 14:59:
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links
https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Thanks, this gives you clear steps to reproduce and makes a valuable
bug report. Please file. :)
Nemo
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae Personal and confidential, please do not circulate or re-quote.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
bawolff,
i don’t think the issue with the records has to do with the characters in the xml file. i think the issue has to do with the timeout of downloading large mediafiles on commons from a web page and possibly that the mediafile is served in a way that the UploadBase::verifyFile() method doesn’t like.
if you, or anyone on the list, has some time to review the latest patch of https://gerrit.wikimedia.org/r/#/c/127839/; please do so, and +2 it if you’re okay with it, or let me know what you think needs to be changed, so that it can be deployed to the beta cluster and production servers. once the patch has been deployed to the beta cluster we can test fae’s records using this new preview method.
with kind regards, dan
On May 2, 2014, at 09:00 , Brian Wolff bawolff@gmail.com wrote:
Could you include what the binary code for your ae was if possible (on unix computers, possibly also mac, the hd or hexdump command can tell you this) or just attach the xml in question to a bug (since there is a possibility that your email client might change the character)?
The encoding should be utf8 with NFC, but even if its not quite correct gw/mw should convert it.
--bawolff On May 2, 2014 3:42 AM, "Fæ" faewik@gmail.com wrote:
Thanks for the detailed investigation Dan. There must be some oddity in the way I'm creating my xml (Python generated, then edited in JEdit for any tweaks, should be 'utf-8') so I'll continue plugging at it.
I keep missing your emails and finding them under my spam folder, no idea why.
Fae
On 1 May 2014 20:04, dan entous d_entous@yahoo.com wrote:
characters
i have a test xml i use to test titles and added the characters you mentioned. i had no problem uploading the test xml file. here are 2 results that seem to indicate that there should not be an issue with the characters:
http://commons.wikimedia.beta.wmflabs.org/wiki/File:The_%22King%E2%80%99s_of...
http://commons.wikimedia.beta.wmflabs.org/wiki/File:Dice_players_-_Lo_L%C3%A...
example record
i tested the example record locally and after about 2 minutes i got the message:
The file you submitted was too large. original URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ evaluated URL: http://link.nypl.org/2Qqj_oLvSbWRwPxtB1rq_wZ
my wiki was set to a limit of 100mb, so i up’d it to 1000mb.
i also switched to the new preview branch i have in gerrit, https://gerrit.wikimedia.org/r/#/c/127839/, for bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63864, which no longer downloads an image to the wiki during the preview step. instead it downloads all mediafiles in a background job.
the job successfully completed after 3 minutes and the image was viewable in my local wiki.
i also took a look at our wikitech instance and saw that you had uploaded the image there without issue. i also repeated the uploaded but got the message:
“This file did not pass file verification.”
this seems to have been thrown by UploadBase.php, so i'd have to look further into that issue. but i also suspect that commons may have just timed out on the download of the image in the preview step. this type of error seems similar to bug 63864. i just need someone to +2 the patch i made so that we can test the new preview step on the beta cluster.
with kind regards, dan
On May 1, 2014, at 18:35 , Federico Leva (Nemo) nemowiki@gmail.com wrote:
Fæ, 01/05/2014 14:59:
Instead, a good example of characters giving a problem is the file at [1]. This caused the GWT run to halt but was successfully loaded once I changed the "Æ" (ae ligature) character in Ægean to a simple "A". The only cause of this failure must have been the character, which is allowed in the mediawiki software.
Links 1.https://commons.wikimedia.org/wiki/File:A_new_map_of_the_islands_of_the_Agea...
Thanks, this gives you clear steps to reproduce and makes a valuable bug report. Please file. :)
Nemo
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae Personal and confidential, please do not circulate or re-quote.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On 2 May 2014 08:00, Brian Wolff bawolff@gmail.com wrote: ...
The encoding should be utf8 with NFC, but even if its not quite correct gw/mw should convert it.
...
SUMMARY
Not a bug. Most headaches with the NYPL upload seem to be due to the source data.
ANALYSIS
Based on more tranches of uploads from NYPL and several days of playing with format problems, I believe this was a source site problem rather than a bug with GWT. Here's what I know based on : * My xml is created from Python correctly encoded using UTF-8. * Most of the metadata is scraped from webpages defined as "iso-8859-1" i.e. the latin-1 charset. The (key needed) API is limited, I only use it to find a direct link to the full tiff, which is mysteriously avoided in the fully open website. * Encoding of characters like "e acute" may read both correctly and incorrectly in the resulting xml. The incorrect forms exist in at least two variations for this character, however my use of encode('Windows-1252') on source fields may be part of this, I'm using this based on earlier trial and error of multiple methods. The number of records with errors is now relatively small. * Reading the file as UTF-8 in JEdit may include hidden characters, making hand-correction, tricky.
These behaviours indicate to me that the NYPL source site has inconsistently encoded its web pages. If I decode or encode I get unpredictable results. At the moment, my shortest workflow seems to be to let Python create the xml as a UTF file without any more character encoding and then fix any oddities by hand.
As a "more advanced" user of GWT, the fact that I am spending so much time in xml preparation should be of concern for how much we say in our user guides for the tool about xml checking, testing and preparation. I have little doubt that similar encoding problems (especially for multilanguage or ancient texts) will continue to dog some of our users. Other xml valitation issues and on-wiki conventions with regard to html encoding, hidden text etc. would be worth the user community expanding on the manual help pages; possibly growing the trouble-shooting guide as a separate document.
PS with regard to Dan's timeout point, I am unsure how much this is affecting my uploads. I did have to compile an xml file of re-uploads after many were mysteriously skipped, but the second time around they appear to have all been uploaded so this performance related issue might be entirely dependant on the stress WMF's servers happen to be under. My most significant reason for skipped images is because of NYPL missing catalogue pages, however these are discovered when I create the xml file. This is something I am deferring until after uploads are complete; in practice I might not spend time investigating it as the NYPL is moving to a new website scheme, so it might make sense to park new uploads for a year and let their system stabilize.
Links * Example NYPL source file with non-ascii characters: http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?imageID=126... * API http://api.repo.nypl.org/
Fae