Fae
This is very interesting and useful: I have never come across "fileUrl()" before.
In the example below, you are using source=page.fileUrl(). Is there a similar call that will get the full-size version of a file on Commons?
So far as I can see, getting the file via the API requires knowledge of the URL, which itself means calculating an md5 hash on the image name. In Applescript, this seems to work, so long as there are no odd characters, but I'm hoping there is a simpler call to use in Python:
set imageName to findReplStr's findReplStr(imageName, " ", "_") #replace spaces with underscores set hash to do shell script "md5 -q -s " & quoted form of imageName set sub2 to text 1 thru 2 of hash #Applescript uses 1-based strings set sub1 to text 1 of sub2 set imageURL1 to "http://upload.wikimedia.org/wikipedia/commons/" set imageURL to imageURL1 & sub1 & "/" & sub2 & "/" & imageName set imageURL to findReplStr's findReplStr(imageURL, " ", "_")
I am looking forward to your proposed workshops.
Michael
On 16 Oct 2013, at 15:10, Fæ wrote:
I suggest that anyone with topics they would like to cover in a python/pywikipediabot workshop, consider adding to discussion on the event registration talk page, so that Jonathan can pull ideas and expected outcomes together. He's trying to agree a new date for a workshop and I'm thinking of the value of splitting it into a basics session for, say, 2 hours one evening and a more advanced practical session one afternoon (you can then choose to come to one rather than both). I would be happy for this to be either a weekday or a weekend depending on what most people can make.
Go to https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_2013 to add your ideas on dates and content of a workshop(s).
I have pasted the code for a recursive dump of Wikipedia Takes Chester that I cobbled together before breakfast below, but it's not all that helpful without getting the basics of python modules, pywikipediabot and the Wikimedia API (that it is built on) in your head first. It is badly written, but works, and I can tweak this to be a general multi cat-dump routine with a couple of minutes work. The idea of having a couple of workshops is to give a group of contributors the basic "bot" writing skills and an effective kit-bag of methods to write anything they can imagine, from clever analytical reports to daily house-keeping bots, even if they use fairly poor code to do so ;-)
The main problem we have with pywikipediabot is that documentation is poor (for example, I don't think the function "fileUrl()" is documented anywhere, for several months I was using the API directly to do what this function nicely does, as I didn't know it was available, and yet would probably fail in mysterious ways if used on the wrong class of object, such as a category rather than an image page, something that a manual ought to help the user understand). It would be great if those interested in improving the manuals could play around with the various commands and illustrate with example working code (and highlight common errors!). I would hope that the outcome of the workshop would be to achieve some of this, perhaps even laying down a few short demonstration screen-capture videos of what these tools can do, and how to go about setting yourself up to use them.
BTW the "unidecode" bit below was hacked on after the dump fell over trying to write façade in a local file name. It neatly transcribes it into "facade", a clever module for handling non-ascii international characters of all sorts.
Fae
/* The main part of 'batchCatDump.py', treat as CC-BY-SA. This takes all images recursively under Commons category 'catname' and saves the full size image along with the current text of its associated image page in a local directory. In this case it generated 468 image files and the same number of matching html files, taking just under 2GB on a usb stick. */
catname="Wikipedia Takes Chester" cat = catlib.Category(site, catname) gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True) count=0
savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/" if not os.path.exists(savedir): os.makedirs(savedir)
for page in gen: title=page.title() if not title[0:5]=="File:": continue count+=1 utitle=unidecode(title[5:]) saveas=savedir+utitle if os.path.exists(saveas): continue if utitle!=title[5:]: print "Transcribing title as", utitle html = page.get() source=page.fileUrl() urllib.urlretrieve(source, saveas) f=open(saveas+".html","w") f.write(unidecode(html)) f.close()
-- faewik@gmail.com http://j.mp/faewm
Wikimedia UK mailing list wikimediauk-l@wikimedia.org http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l WMUK: http://uk.wikimedia.org