Hi Kent.
I'll respond to a few points. The rest are outside my knowledge / scope (EX: mwxml2sql; Debian).
Hope this helps, and good luck on wp-mirror.
----
> interlanguage links have been removed to the wikidata project, the rendering of which requires mediawiki-1.21+;
You will need a local instance of
http://www.wikidata.org . You could probably build it using the files from here:
http://dumps.wikimedia.org/wikidatawiki/
.
Note that wikidata is also being used in infoboxes. For example,
now has multiple {{#property}} statements.
> infoboxes now require the scribunto extension which requires mediawiki-1.20+
I believe most of the wikis have moved the infobox generation from Templates to Modules. They've moved a lot of other functionality as well (for example: references and message boxes).
The Scribunto extension is at
https://www.mediawiki.org/wiki/Extension:Scribunto . It transforms a {{#invoke:Module_name|function_name|arguments}} into the appropriate text.
> category - dump files now have 5 fields, whereas the database schema has 6 fields;
I believe they removed the cat_hidden field which was effectively deprecated. A category's hidden status is saved in page_props
> The large image dump tarballs are now a year old.
I think there's still some infrastructure work that needs to be done on the Wikimedia side.
> We are beginning to see thumb dumps from the xowa project.
I've been uploading thumbs for the major wikis to
archive.org. See here for a summary:
https://archive.org/search.php?query=xowa
I'm planning to upload all thumbs for all the major languages that are listed as > 200,000 on
https://en.wikipedia.org/wiki/Main_Page . There are roughly 27 languages listed there. My progress has been about 1 wiki per week (I'm also uploading sister wikis for a given language). I've done 4 so far. I'm hoping to be done with the other 23 sometime by mid-year.
At about that time, I'm hoping to have a more automated way of generating updates. Currently, I'm only releasing monthly updates for
en.wikipedia.org and quarterly updates for the other main wikis.
Note that the thumbs are uploaded as sqlite databases. I chose sqlite b/c tarballs are slower to extract / update / query. The database schema is fairly basic, and you should be able to retrieve any file with a sqlite library.