Re: [Wikimediauk-l] Downloading a category from commons

16 Oct 2013

I suggest that anyone with topics they would like to cover in a
python/pywikipediabot workshop, consider adding to discussion on the event
registration talk page, so that Jonathan can pull ideas and expected
outcomes together. He's trying to agree a new date for a workshop and I'm
thinking of the value of splitting it into a basics session for, say, 2
hours one evening and a more advanced practical session one afternoon (you
can then choose to come to one rather than both). I would be happy for this
to be either a weekday or a weekend depending on what most people can make.

Go to <
https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_2…
to add your ideas on dates and content of a workshop(s).

I have pasted the code for a recursive dump of Wikipedia Takes Chester that
I cobbled together before breakfast below, but it's not all that helpful
without getting the basics of python modules, pywikipediabot and the
Wikimedia API (that it is built on) in your head first. It is badly
written, but works, and I can tweak this to be a general multi cat-dump
routine with a couple of minutes work. The idea of having a couple of
workshops is to give a group of contributors the basic "bot" writing skills
and an effective kit-bag of methods to write anything they can imagine,
from clever analytical reports to daily house-keeping bots, even if they
use fairly poor code to do so ;-)

The main problem we have with pywikipediabot is that documentation is poor
(for example, I don't think the function "fileUrl()" is documented
anywhere, for several months I was using the API directly to do what this
function nicely does, as I didn't know it was available, and yet would
probably fail in mysterious ways if used on the wrong class of object, such
as a category rather than an image page, something that a manual ought to
help the user understand). It would be great if those interested in
improving the manuals could play around with the various commands and
illustrate with example working code (and highlight common errors!). I
would hope that the outcome of the workshop would be to achieve some of
this, perhaps even laying down a few short demonstration screen-capture
videos of what these tools can do, and how to go about setting yourself up
to use them.

BTW the "unidecode" bit below was hacked on after the dump fell over trying
to write façade in a local file name. It neatly transcribes it into
"facade", a clever module for handling non-ascii international characters
of all sorts.

Fae
----
/* The main part of 'batchCatDump.py', treat as CC-BY-SA.
This takes all images recursively under Commons category 'catname' and
saves the full size image along with the current text of its associated
image page in a local directory. In this case it generated 468 image files
and the same number of matching html files, taking just under 2GB on a usb
stick. */

*catname="Wikipedia Takes Chester"
cat = catlib.Category(site, catname)
gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True)
count=0

savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/"
if not os.path.exists(savedir):
    os.makedirs(savedir)

for page in gen:
        title=page.title()
        if not title[0:5]=="File:": continue
        count+=1
        utitle=unidecode(title[5:])
        saveas=savedir+utitle
        if os.path.exists(saveas):
                continue
        if utitle!=title[5:]:
                print "Transcribing title as", utitle
        html = page.get()
        source=page.fileUrl()
        urllib.urlretrieve(source, saveas)
        f=open(saveas+".html","w")
        f.write(unidecode(html))
        f.close()*

-- 
faewik(a)gmail.com http://j.mp/faewm

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediauk-l] Downloading a category from commons