Re: [Wikimediauk-l] Downloading a category from commons

18 Oct 2013

Fae

This is very interesting and useful: I have never come across "fileUrl()"
before. 

In the example below, you are using  source=page.fileUrl().  Is there a similar call that
will get the full-size version of a file on Commons?  

So far as I can see, getting the file via the API requires knowledge of the URL, which
itself means calculating an md5 hash on the image name.  In Applescript, this seems to
work, so long as there are no odd characters, but I'm hoping there is a simpler call
to use in Python:

		set imageName to findReplStr's findReplStr(imageName, " ", "_") 
#replace spaces with underscores

		set hash to do shell script "md5 -q -s " & quoted form of imageName

		set sub2 to text 1 thru 2 of hash  #Applescript uses 1-based strings
		set sub1 to text 1 of sub2

		set imageURL1 to "http://upload.wikimedia.org/wikipedia/commons/"

		set imageURL to imageURL1 & sub1 & "/" & sub2 & "/"
& imageName
		set imageURL to findReplStr's findReplStr(imageURL, " ", "_")

I am looking forward to your proposed workshops.

Michael

On 16 Oct 2013, at 15:10, Fæ wrote:

...
  I suggest that anyone with topics they would like to
cover in a python/pywikipediabot workshop, consider adding to discussion on the event
registration talk page, so that Jonathan can pull ideas and expected outcomes together.
He's trying to agree a new date for a workshop and I'm thinking of the value of
splitting it into a basics session for, say, 2 hours one evening and a more advanced
practical session one afternoon (you can then choose to come to one rather than both). I
would be happy for this to be either a weekday or a weekend depending on what most people
can make.

 Go to
<https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_2013> to
add your ideas on dates and content of a workshop(s).

 I have pasted the code for a recursive dump of Wikipedia Takes Chester that I cobbled
together before breakfast below, but it's not all that helpful without getting the
basics of python modules, pywikipediabot and the Wikimedia API (that it is built on) in
your head first. It is badly written, but works, and I can tweak this to be a general
multi cat-dump routine with a couple of minutes work. The idea of having a couple of
workshops is to give a group of contributors the basic "bot" writing skills and
an effective kit-bag of methods to write anything they can imagine, from clever analytical
reports to daily house-keeping bots, even if they use fairly poor code to do so ;-)

 The main problem we have with pywikipediabot is that documentation is poor (for example,
I don't think the function "fileUrl()" is documented anywhere, for several
months I was using the API directly to do what this function nicely does, as I didn't
know it was available, and yet would probably fail in mysterious ways if used on the wrong
class of object, such as a category rather than an image page, something that a manual
ought to help the user understand). It would be great if those interested in improving the
manuals could play around with the various commands and illustrate with example working
code (and highlight common errors!). I would hope that the outcome of the workshop would
be to achieve some of this, perhaps even laying down a few short demonstration
screen-capture videos of what these tools can do, and how to go about setting yourself up
to use them.

 BTW the "unidecode" bit below was hacked on after the dump fell over trying to
write façade in a local file name. It neatly transcribes it into "facade", a
clever module for handling non-ascii international characters of all sorts.

 Fae
 ----
 /* The main part of 'batchCatDump.py', treat as CC-BY-SA.
 This takes all images recursively under Commons category 'catname' and saves the
full size image along with the current text of its associated image page in a local
directory. In this case it generated 468 image files and the same number of matching html
files, taking just under 2GB on a usb stick. */

 catname="Wikipedia Takes Chester"
 cat = catlib.Category(site, catname)
 gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True)
 count=0

 savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/"
 if not os.path.exists(savedir):
     os.makedirs(savedir)

 for page in gen:
         title=page.title()
         if not title[0:5]=="File:": continue
         count+=1
         utitle=unidecode(title[5:])
         saveas=savedir+utitle
         if os.path.exists(saveas):
                 continue
         if utitle!=title[5:]:
                 print "Transcribing title as", utitle
         html = page.get()
         source=page.fileUrl()
         urllib.urlretrieve(source, saveas)
         f=open(saveas+".html","w")
         f.write(unidecode(html))
         f.close()

 -- 
 faewik(a)gmail.com http://j.mp/faewm

 _______________________________________________
 Wikimedia UK mailing list
 wikimediauk-l(a)wikimedia.org
 http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l
 WMUK: http://uk.wikimedia.org 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediauk-l] Downloading a category from commons