Downloading a category from commons

List overview All Threads
Download

newer

older

Media: Editathon in Oxford

Communicate OER visitor

Jonathan Cardy

15 Oct 2013 15 Oct '13

9:32 p.m.

Dear all,

We need to do a couple of batch extracts from Wikimedia Commons including a copy of the "Wiki takes Chester" photos. Is there anyone who could do this for us or show us how?

The extract would be expected to fit on a memory stick, and we would supply the memory stick. Needless to say the extract needs to comply with reuse policyhttps://commons.wikimedia.org/wiki/Commons:Reusing_content_outside_Wikimedia#Downloadingso some metadata will also need extraction.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects).

Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.

Press Enter to send your message.

Attachments:

attachment.htm (text/html — 1.7 KB)

Show replies by date

Fæ

16 Oct 16 Oct

4:19 a.m.

I'm happy to sort this out with a small Python script if you email me: 1. A list of categories (or the parent the others sit under) 2. Explain what you want as metadata (a text file of the image page edit history perhaps?)

Cheers, Fae

On 15 October 2013 15:32, Jonathan Cardy jonathan.cardy@wikimedia.org.uk wrote:

...

Dear all,

We need to do a couple of batch extracts from Wikimedia Commons including a copy of the "Wiki takes Chester" photos. Is there anyone who could do this for us or show us how?

The extract would be expected to fit on a memory stick, and we would supply the memory stick. Needless to say the extract needs to comply with reuse policy so some metadata will also need extraction.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects).

Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.

Press Enter to send your message.

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

-- faewik@gmail.com http://j.mp/faewm Personal and confidential, please do not circulate or re-quote.

Richard Nevell

8:29 p.m.

Fae, would you be able to share the script once it has been written? Would it be a matter of plugging in a category at one and and picking up a zip file at the other? I ask because though there is a planned Python workshop, my availability at weekends is limited so I'm not sure I'd be able to attend and such a script would be quite handy.

On 15 October 2013 22:19, Fæ faewik@gmail.com wrote:

...

I'm happy to sort this out with a small Python script if you email me:

A list of categories (or the parent the others sit under)

Explain what you want as metadata (a text file of the image page

edit history perhaps?)

Cheers, Fae

On 15 October 2013 15:32, Jonathan Cardy jonathan.cardy@wikimedia.org.uk wrote:

...
Dear all,

We need to do a couple of batch extracts from Wikimedia Commons

including a

...
copy of the "Wiki takes Chester" photos. Is there anyone who could do

this

...
for us or show us how?

The extract would be expected to fit on a memory stick, and we would

supply

...
the memory stick. Needless to say the extract needs to comply with reuse policy so some metadata will also need extraction.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A

4LT.

...
United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects).

Wikimedia UK is an independent non-profit charity with no legal control

over

...
Wikipedia nor responsibility for its contents.

Press Enter to send your message.

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

-- faewik@gmail.com http://j.mp/faewm Personal and confidential, please do not circulate or re-quote.

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

-- Richard Nevell Wikimedia UK +44 (0) 20 7065 0753 Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects). *Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.*

Fæ

9:10 p.m.

I suggest that anyone with topics they would like to cover in a python/pywikipediabot workshop, consider adding to discussion on the event registration talk page, so that Jonathan can pull ideas and expected outcomes together. He's trying to agree a new date for a workshop and I'm thinking of the value of splitting it into a basics session for, say, 2 hours one evening and a more advanced practical session one afternoon (you can then choose to come to one rather than both). I would be happy for this to be either a weekday or a weekend depending on what most people can make.

Go to < https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_20... to add your ideas on dates and content of a workshop(s).

I have pasted the code for a recursive dump of Wikipedia Takes Chester that I cobbled together before breakfast below, but it's not all that helpful without getting the basics of python modules, pywikipediabot and the Wikimedia API (that it is built on) in your head first. It is badly written, but works, and I can tweak this to be a general multi cat-dump routine with a couple of minutes work. The idea of having a couple of workshops is to give a group of contributors the basic "bot" writing skills and an effective kit-bag of methods to write anything they can imagine, from clever analytical reports to daily house-keeping bots, even if they use fairly poor code to do so ;-)

The main problem we have with pywikipediabot is that documentation is poor (for example, I don't think the function "fileUrl()" is documented anywhere, for several months I was using the API directly to do what this function nicely does, as I didn't know it was available, and yet would probably fail in mysterious ways if used on the wrong class of object, such as a category rather than an image page, something that a manual ought to help the user understand). It would be great if those interested in improving the manuals could play around with the various commands and illustrate with example working code (and highlight common errors!). I would hope that the outcome of the workshop would be to achieve some of this, perhaps even laying down a few short demonstration screen-capture videos of what these tools can do, and how to go about setting yourself up to use them.

BTW the "unidecode" bit below was hacked on after the dump fell over trying to write façade in a local file name. It neatly transcribes it into "facade", a clever module for handling non-ascii international characters of all sorts.

Fae ---- /* The main part of 'batchCatDump.py', treat as CC-BY-SA. This takes all images recursively under Commons category 'catname' and saves the full size image along with the current text of its associated image page in a local directory. In this case it generated 468 image files and the same number of matching html files, taking just under 2GB on a usb stick. */

*catname="Wikipedia Takes Chester" cat = catlib.Category(site, catname) gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True) count=0

savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/" if not os.path.exists(savedir): os.makedirs(savedir)

for page in gen: title=page.title() if not title[0:5]=="File:": continue count+=1 utitle=unidecode(title[5:]) saveas=savedir+utitle if os.path.exists(saveas): continue if utitle!=title[5:]: print "Transcribing title as", utitle html = page.get() source=page.fileUrl() urllib.urlretrieve(source, saveas) f=open(saveas+".html","w") f.write(unidecode(html)) f.close()*

-- faewik@gmail.com http://j.mp/faewm

Michael Maggs

18 Oct 18 Oct

9:26 p.m.

Fae

This is very interesting and useful: I have never come across "fileUrl()" before.

In the example below, you are using source=page.fileUrl(). Is there a similar call that will get the full-size version of a file on Commons?

So far as I can see, getting the file via the API requires knowledge of the URL, which itself means calculating an md5 hash on the image name. In Applescript, this seems to work, so long as there are no odd characters, but I'm hoping there is a simpler call to use in Python:

set imageName to findReplStr's findReplStr(imageName, " ", "_") #replace spaces with underscores set hash to do shell script "md5 -q -s " & quoted form of imageName set sub2 to text 1 thru 2 of hash #Applescript uses 1-based strings set sub1 to text 1 of sub2 set imageURL1 to "http://upload.wikimedia.org/wikipedia/commons/" set imageURL to imageURL1 & sub1 & "/" & sub2 & "/" & imageName set imageURL to findReplStr's findReplStr(imageURL, " ", "_")

I am looking forward to your proposed workshops.

Michael

On 16 Oct 2013, at 15:10, Fæ wrote:

...

I suggest that anyone with topics they would like to cover in a python/pywikipediabot workshop, consider adding to discussion on the event registration talk page, so that Jonathan can pull ideas and expected outcomes together. He's trying to agree a new date for a workshop and I'm thinking of the value of splitting it into a basics session for, say, 2 hours one evening and a more advanced practical session one afternoon (you can then choose to come to one rather than both). I would be happy for this to be either a weekday or a weekend depending on what most people can make.

Go to https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_2013 to add your ideas on dates and content of a workshop(s).

I have pasted the code for a recursive dump of Wikipedia Takes Chester that I cobbled together before breakfast below, but it's not all that helpful without getting the basics of python modules, pywikipediabot and the Wikimedia API (that it is built on) in your head first. It is badly written, but works, and I can tweak this to be a general multi cat-dump routine with a couple of minutes work. The idea of having a couple of workshops is to give a group of contributors the basic "bot" writing skills and an effective kit-bag of methods to write anything they can imagine, from clever analytical reports to daily house-keeping bots, even if they use fairly poor code to do so ;-)

The main problem we have with pywikipediabot is that documentation is poor (for example, I don't think the function "fileUrl()" is documented anywhere, for several months I was using the API directly to do what this function nicely does, as I didn't know it was available, and yet would probably fail in mysterious ways if used on the wrong class of object, such as a category rather than an image page, something that a manual ought to help the user understand). It would be great if those interested in improving the manuals could play around with the various commands and illustrate with example working code (and highlight common errors!). I would hope that the outcome of the workshop would be to achieve some of this, perhaps even laying down a few short demonstration screen-capture videos of what these tools can do, and how to go about setting yourself up to use them.

BTW the "unidecode" bit below was hacked on after the dump fell over trying to write façade in a local file name. It neatly transcribes it into "facade", a clever module for handling non-ascii international characters of all sorts.

Fae

/* The main part of 'batchCatDump.py', treat as CC-BY-SA. This takes all images recursively under Commons category 'catname' and saves the full size image along with the current text of its associated image page in a local directory. In this case it generated 468 image files and the same number of matching html files, taking just under 2GB on a usb stick. */

catname="Wikipedia Takes Chester" cat = catlib.Category(site, catname) gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True) count=0

savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/" if not os.path.exists(savedir): os.makedirs(savedir)

for page in gen: title=page.title() if not title[0:5]=="File:": continue count+=1 utitle=unidecode(title[5:]) saveas=savedir+utitle if os.path.exists(saveas): continue if utitle!=title[5:]: print "Transcribing title as", utitle html = page.get() source=page.fileUrl() urllib.urlretrieve(source, saveas) f=open(saveas+".html","w") f.write(unidecode(html)) f.close()

-- faewik@gmail.com http://j.mp/faewm

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

Fæ

11:48 p.m.

On 18/10/2013, Michael Maggs Michael@maggs.name wrote:

...

So far as I can see, getting the file via the API requires knowledge of the URL, which itself means calculating an md5 hash on the image name. In Applescript, this seems to work, so long as there are no odd characters, but

No need for md5 hash calculations :-)

In python the call to fileUrl() actually 'translates' the Commons image page name into the URL for the full size image file to download without having to provide any other information. Behind the scenes this actually uses the API to call a "query" returning "imageinfo".

You may find a real example of API calls useful. Starting with the image title "Gay.jpg" [1] we can ask the API to tell us its properties by creating a call to API query [2], within the results (which you could ask for as xml or JSON) this returns us the URL link to the full sized image file which you can download. [3]

Rather than full size, you can request a particular size, such as with a width of 100px [4]. In my script to query tineye.com to check for matches to mobile images as possible copyright violations, I only used thumbnails of width 300px, it saved a lot of bandwidth. :-)

1. http://commons.wikimedia.org/wiki/File:Gay.jpg 2. https://commons.wikimedia.org/w/api.php?action=query&titles=File:Gay.jpg... 3. https://upload.wikimedia.org/wikipedia/commons/9/90/Gay.jpg 4. https://commons.wikimedia.org/w/api.php?action=query&titles=File:Gay.jpg...

I would imagine that an introductory workshop should cover the basics of how to use the API and how to find parameters in the manual, this would be particularly useful for folks more comfortable programming in other languages that Python, or using other tools, and would still use all the features of the API to do interesting stuff.

Fae

-- faewik@gmail.com http://j.mp/faewm

Michael Maggs

11:53 p.m.

Thank you. I will make use of that over the weekend to download some of the WLM images for the jury to review.

Regards

Michael

On 18 Oct 2013, at 17:48, Fæ wrote:

...

On 18/10/2013, Michael Maggs Michael@maggs.name wrote:

...
So far as I can see, getting the file via the API requires knowledge of the URL, which itself means calculating an md5 hash on the image name. In Applescript, this seems to work, so long as there are no odd characters, but

No need for md5 hash calculations :-)

In python the call to fileUrl() actually 'translates' the Commons image page name into the URL for the full size image file to download without having to provide any other information. Behind the scenes this actually uses the API to call a "query" returning "imageinfo".

You may find a real example of API calls useful. Starting with the image title "Gay.jpg" [1] we can ask the API to tell us its properties by creating a call to API query [2], within the results (which you could ask for as xml or JSON) this returns us the URL link to the full sized image file which you can download. [3]

Rather than full size, you can request a particular size, such as with a width of 100px [4]. In my script to query tineye.com to check for matches to mobile images as possible copyright violations, I only used thumbnails of width 300px, it saved a lot of bandwidth. :-)

http://commons.wikimedia.org/wiki/File:Gay.jpg

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Gay.jpg...

https://upload.wikimedia.org/wikipedia/commons/9/90/Gay.jpg

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Gay.jpg...

I would imagine that an introductory workshop should cover the basics of how to use the API and how to find parameters in the manual, this would be particularly useful for folks more comfortable programming in other languages that Python, or using other tools, and would still use all the features of the API to do interesting stuff.

Fae

faewik@gmail.com http://j.mp/faewm

Brian McNeil

20 Oct 20 Oct

2:58 p.m.

On Fri, 2013-10-18 at 17:53 +0100, Michael Maggs wrote:

...

Thank you. I will make use of that over the weekend to download some of the WLM images for the jury to review.

I've always wondered why the 'reference' library for the API is in Python. Given that MediaWiki is written in PHP, why isn't that the language used?

I've not kept my NewsieBot classes[1] fully up-to-date; but, a rewrite is underway. All it is currently in-use for is cleaning up the sandbox on enWN. It should still all work, although the image handling is not well-tested.

[1] https://github.com/Brian-McNeil/NewsieBot-WikiInterface

Brian McNeil.

-- Wikinewsie.org | https://en.wikinews.org/wiki/Brian_McNeil "Facts don't cease to be facts, but news ceases to be news." GPG Key Fingerprint: 7C3D FFD5 5ED5 B80F 1D18 A52B E84C 8928 6ABC A7AD

HJ Mitchell

16 Oct 16 Oct

8:23 p.m.

I'm curious, why do you want to do this? Harry Mitchell

http://enwp.org/User:HJ

Phone: 024 7698 0977 Skype: harry_j_mitchell

________________________________ From: Jonathan Cardy jonathan.cardy@wikimedia.org.uk To: UK Wikimedia mailing list wikimediauk-l@lists.wikimedia.org Sent: Tuesday, 15 October 2013, 15:32 Subject: [Wikimediauk-l] Downloading a category from commons

Dear all,

We need to do a couple of batch extracts from Wikimedia Commons including a copy of the "Wiki takes Chester" photos. Is there anyone who could do this for us or show us how?

The extract would be expected to fit on a memory stick, and we would supply the memory stick. Needless to say the extract needs to comply with reuse policy so some metadata will also need extraction.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents. Press Enter to send your message. _______________________________________________ Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

Jonathan Cardy

8:34 p.m.

Hi Harry,

There are a couple of reciprocal image donations where we are getting or hoping to get sets of images and I want to send them the equivalent part of the Commons collection.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.

Press Enter to send your message.

On 16 October 2013 14:23, HJ Mitchell hjmitchell@ymail.com wrote:

...

I'm curious, why do you want to do this?

Harry Mitchell http://enwp.org/User:HJ Phone: 024 7698 0977 Skype: harry_j_mitchell

*From:* Jonathan Cardy jonathan.cardy@wikimedia.org.uk *To:* UK Wikimedia mailing list wikimediauk-l@lists.wikimedia.org *Sent:* Tuesday, 15 October 2013, 15:32 *Subject:* [Wikimediauk-l] Downloading a category from commons

Dear all,

We need to do a couple of batch extracts from Wikimedia Commons including a copy of the "Wiki takes Chester" photos. Is there anyone who could do this for us or show us how?

The extract would be expected to fit on a memory stick, and we would supply the memory stick. Needless to say the extract needs to comply with reuse policyhttps://commons.wikimedia.org/wiki/Commons:Reusing_content_outside_Wikimedia#Downloadingso some metadata will also need extraction.

Regards

Jonathan Cardy GLAM (Galleries, Libraries, Archives & Museums) Organiser/Trefnydd GLAM (Galeriau, Llyfrgelloedd, Archifdai a llawer Mwy!) Wikimedia UK 0207 065 0990

Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects).

Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.

Press Enter to send your message.

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org

4089

Age (days ago)

4094

Last active (days ago)

wikimediauk-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Brian McNeil
Fæ
HJ Mitchell
Jonathan Cardy
Michael Maggs
Richard Nevell