[GWToolset] Uploading 350 books - via command line or API? - Glamtools

List overview All Threads
Download

newer

[GWToolset] Uploading 350 books - via command line or API?

older

[Image donation] 760 historical...

Re: [Glamtools] Glamtools Digest,...

Olaf Janssen

15 Feb 2016 15 Feb '16

6:18 p.m.

Hi, I'm preparing an image donation of some 350 picture books from 1810 to 1880 (taken from the collection http://www.geheugenvannederland.nl/?/en/collecties/prentenboeken_van_1810_to...) For every book I've constructed an XML file describing the pages (metadata). So eg. for a book of 20 pages I've an XML with 20 records. I can upload these in the normal way via the GWToolset webinterface, also assigning a Commons category to the book.

For 1 book that's doable, but for 350 books I would need to upload 350 XML files, 1 by 1, using the GWT-webinterface (using the same json mapping file for all uploads). But this would take me a lot of time (and it's rather boring)...

So I'm wondering if / how I could automate this. Is there a more direct/efficient way?

I can image that I could do some command line interfacing (Pywiki??), with the XML, the json-mapping and the target Commonscat-name as input parameters. Would that be an option?

Any tricks, tips & directions are very welcome

Met vriendelijke groet / With kind regards

Olaf Janssen

Wikipedia & open data coordinator

Koninklijke Bibliotheek - National Library of the Netherlands olaf.janssen@kb.nlmailto:olaf.janssen@kb.nl +31 (0)70 3140 388 @ookgezellig www.slideshare.net/OlafJanssenNLhttp://www.slideshare.net/OlafJanssenNL

[Koninklijke Bibliotheek, National Library of the Netherlands] Prins Willem-Alexanderhof 5 | 2595 BE Den Haag Postbus 90407 | 2509 LK Den Haag | (070) 314 09 11 | www.kb.nlhttp://www.kb.nl/ [http://www.kb.nl/sites/default/files/dots.jpg] English versionhttp://www.kb.nl/en/email | Disclaimerhttp://www.kb.nl/disclaimer

Attachments:

attachment.htm (text/html — 17.6 KB)

Show replies by date

J Hayes

15 Feb 15 Feb

6:34 p.m.

New subject: [GWToolset] Uploading 350 books - via command line or API?

have you considered uploading to Internet Archive and then uploading to commons using IAupload tool? (this is the normal process for texts)

don't know if that can be automated

https://internetarchive.readthedocs.org/en/latest/cli.html https://tools.wmflabs.org/ia-upload/commons/init https://github.com/Tpt/ia-upload

i'm afraid i've only done them one at a time cheers

On Mon, Feb 15, 2016 at 7:18 AM, Olaf Janssen Olaf.Janssen@kb.nl wrote:

...

Hi,

I’m preparing an image donation of some 350 picture books from 1810 to 1880 (taken from the collection http://www.geheugenvannederland.nl/?/en/collecties/prentenboeken_van_1810_to... )

For every book I’ve constructed an XML file describing the pages (metadata). So eg. for a book of 20 pages I’ve an XML with 20 records. I can upload these in the normal way via the GWToolset webinterface, also assigning a Commons category to the book.

For 1 book that’s doable, but for 350 books I would need to upload 350 XML files, 1 by 1, using the GWT-webinterface (using the same json mapping file for all uploads). But this would take me a lot of time (and it’s rather boring)…

So I’m wondering if / how I could automate this. Is there a more direct/efficient way?

I can image that I could do some command line interfacing (Pywiki??), with the XML, the json-mapping and the target Commonscat-name as input parameters. Would that be an option?

Any tricks, tips & directions are very welcome

Met vriendelijke groet / With kind regards

Olaf Janssen

Wikipedia & open data coordinator

Koninklijke Bibliotheek - National Library of the Netherlands olaf.janssen@kb.nl

+31 (0)70 3140 388 @ookgezellig

www.slideshare.net/OlafJanssenNL

[image: Koninklijke Bibliotheek, National Library of the Netherlands] Prins Willem-Alexanderhof 5 | 2595 BE Den Haag Postbus 90407 | 2509 LK Den Haag | (070) 314 09 11 | www.kb.nl

English version http://www.kb.nl/en/email | Disclaimer http://www.kb.nl/disclaimer

Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools

Fæ

6:53 p.m.

New subject: [GWToolset] Uploading 350 books - via command line or API?

The easiest way is to merge your XML files into one large XML, then put this through the GWT. Though I can imagine ways of automating the front end of GWT, it would be a clumsy way of going about it.

If your concern is that you want to create separate book categories, then add a category field in the XML that can vary by book. You can add several variable categories as an option on the mappings page. For example https://commons.wikimedia.org/wiki/File:DAILY_MENU_%28held_by%29_REVERE_HOUSE_%28at%29_%22BOSTON,_MA%22_%28%28HOTEL%3F%29%29_%28NYPL_Hades-269316-476896%29.tiff was uploaded with the category "NYPL Rare Book Division" automatically generated from the NYPL metadata. (To be fair, I'm not using the GWT for most of the NYPL material for reasons mentioned on the related project page.)

Responding to J Hayes comment in this thread, you can mass upload to IA with off-the shelf Python modules. However just as much care should be taken to map out the metadata using IA's metadata options, as they are incredibly open/flexible, the archives tend to be confusingly inconsistent. This would still leave a challenge of finding a good mapping for Commons templates if you then wanted to upload from IA to Commons rather than from somewhere else.

Fae

On 15 February 2016 at 12:18, Olaf Janssen Olaf.Janssen@kb.nl wrote:

...

Hi,

I’m preparing an image donation of some 350 picture books from 1810 to 1880 (taken from the collection http://www.geheugenvannederland.nl/?/en/collecties/prentenboeken_van_1810_to...)

For every book I’ve constructed an XML file describing the pages (metadata). So eg. for a book of 20 pages I’ve an XML with 20 records. I can upload these in the normal way via the GWToolset webinterface, also assigning a Commons category to the book.

For 1 book that’s doable, but for 350 books I would need to upload 350 XML files, 1 by 1, using the GWT-webinterface (using the same json mapping file for all uploads). But this would take me a lot of time (and it’s rather boring)…

So I’m wondering if / how I could automate this. Is there a more direct/efficient way?

I can image that I could do some command line interfacing (Pywiki??), with the XML, the json-mapping and the target Commonscat-name as input parameters. Would that be an option?

Any tricks, tips & directions are very welcome

Met vriendelijke groet / With kind regards

Olaf Janssen

Wikipedia & open data coordinator

Koninklijke Bibliotheek - National Library of the Netherlands olaf.janssen@kb.nl

+31 (0)70 3140 388 @ookgezellig

www.slideshare.net/OlafJanssenNL

-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

Hans Muller

28 Feb 28 Feb

5:11 p.m.

New subject: [GWToolset] Upload using ftp.company.com possible?

Dear all,

Upload using ftp.company.com possible? -------------------------------------- A scanning company has scanned photos/slides for a Dutch GLAM and offers them at their ftp-site with username and passwd for download.

If they take away this username/passwd, would GWToolset accept as upload URL for instance

http://ftp.company.nl/(..directories to be specified..)/1_01.tif

? i would guess yes, but i'd like to here from you experts out there.

Thanks a lot, hans muller

Hans Muller

5:49 p.m.

New subject: Update Re: [GWToolset] Upload using ftp.company.com possible?

Update of my question:

1. Would a ftp-domain (not http(s) protocol) like

http://ftp.company.com

be acceptable for GWToolset?

2. If so, would a call with user/passwd be acceptable as an upload URL for GWToolset? (Of course after whitelisting the domain.) Type:

http://ftp.company.com?user=...&password=...&dir=TIFF/1&file=1_0...

Or would .... for instance the access time be too slow for GWToolset (depends of course) etc.

Thank you for considering this question,

hans muller https://commons.wikimedia.org/wiki/User:Hansmuller

Op Zo, 28 februari, 2016 12:11 pm schreef Hans Muller:

...

Dear all,

Upload using ftp.company.com possible?

A scanning company has scanned photos/slides for a Dutch GLAM and offers them at their ftp-site with username and passwd for download.

If they take away this username/passwd, would GWToolset accept as upload URL for instance

http://ftp.company.nl/(..directories to be specified..)/1_01.tif

? i would guess yes, but i'd like to here from you experts out there.

Thanks a lot, hans muller

bawolff

29 Feb 29 Feb

12:36 a.m.

New subject: Update Re: [GWToolset] Upload using ftp.company.com possible?

On Sun, Feb 28, 2016 at 6:49 AM, Hans Muller j.m.muller@hccnet.nl wrote:

...

Update of my question:

Would a ftp-domain (not http(s) protocol) like

http://ftp.company.com

be acceptable for GWToolset?

http://ftp.company.com

is an http protocol url for a server named ftp. Did you mean ftp://ftp.company.com ?

http://ftp.company.com is acceptable (Since it starts with http://), but ftp://ftp.company.com would not be.

In principle we might be able to add ftp support (It uses curl on the backend which supports ftp, I think its just the validation code that rejects ftp). We'd also need to make sure that the squid proxy supports ftp, (Squid in principle supports ftp, but I have no idea if its enabled in Wikimedia.

So basically, I'd suggest filing a bug. If anyone was actually maintaining GWToolset it would probably get fixed. Given the current situation, who knows.

...

If so, would a call with user/passwd be acceptable as an upload URL for

GWToolset? (Of course after whitelisting the domain.) Type:

http://ftp.company.com?user=...&password=...&dir=TIFF/1&file=1_0...

The syntax for username and password in urls is

ftp://username:password@ftp.company.com/TIFF/1/1_01.tiff

This is also true for http urls when using HTTP authentication.

...

Or would .... for instance the access time be too slow for GWToolset (depends of course) etc.

Timeouts would be the same as for http. Which are quite high, so it would probably be fine on that count.

-- -bawolff

Hans Muller

2 Mar 2 Mar

6:05 p.m.

New subject: Update Re: [GWToolset] Upload using ftp.company.com possible?

Thanks for your swift reply, even on Sunday!

Method 1

http://ftp.company.com?user=NAME&password=PASSWD&dir=DIR&file=FI...

works! but Method 2

ftp://username:password@ftp.company.com/DIR/FILE

did not, at least in my hands.

Anyway, i wonder whether the group on Phabricator would whitelist a domain for use with Method 1. However, they did whitelist a Dutch scanning company domain

memorix.nl

it says on the GWToolset starting page list of whitelisted domains, perhaps also behind a user/passwd lock...

Thx and best regards, hans muller https://commons.wikimedia.org/wiki/User:Hansmuller

Op Zo, 28 februari, 2016 7:36 pm schreef bawolff:

...

On Sun, Feb 28, 2016 at 6:49 AM, Hans Muller j.m.muller@hccnet.nl wrote:

...
Update of my question:

Would a ftp-domain (not http(s) protocol) like

http://ftp.company.com

be acceptable for GWToolset?

http://ftp.company.com

is an http protocol url for a server named ftp. Did you mean ftp://ftp.company.com ?

http://ftp.company.com is acceptable (Since it starts with http://), but ftp://ftp.company.com would not be.

In principle we might be able to add ftp support (It uses curl on the backend which supports ftp, I think its just the validation code that rejects ftp). We'd also need to make sure that the squid proxy supports ftp, (Squid in principle supports ftp, but I have no idea if its enabled in Wikimedia.

So basically, I'd suggest filing a bug. If anyone was actually maintaining GWToolset it would probably get fixed. Given the current situation, who knows.

...

If so, would a call with user/passwd be acceptable as an upload URL

for GWToolset? (Of course after whitelisting the domain.) Type:

http://ftp.company.com?user=...&password=...&dir=TIFF/1&file=1_0...

The syntax for username and password in urls is

ftp://username:password@ftp.company.com/TIFF/1/1_01.tiff

This is also true for http urls when using HTTP authentication.

...
Or would .... for instance the access time be too slow for GWToolset (depends of course) etc.

Timeouts would be the same as for http. Which are quite high, so it would probably be fine on that count.

-- -bawolff

3235

Age (days ago)

3251

Last active (days ago)

glamtools@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

bawolff
Fæ
Hans Muller
J Hayes
Olaf Janssen