Hi Kent.

I'll respond to a few points. The rest are outside my knowledge / scope (EX: mwxml2sql; Debian).

Hope this helps, and good luck on wp-mirror.

----

> interlanguage links have been removed to the wikidata project, the rendering of which requires mediawiki-1.21+;
You will need a local instance of http://www.wikidata.org . You could probably build it using the files from here: http://dumps.wikimedia.org/wikidatawiki/

Each wiki would also need the wikibase extension: https://www.mediawiki.org/wiki/Extension:Wikibase.

Note that wikidata is also being used in infoboxes. For example, https://simple.wikipedia.org/w/index.php?title=Google&action=edit now has multiple {{#property}} statements.

> infoboxes now require the scribunto extension which requires mediawiki-1.20+
I believe most of the wikis have moved the infobox generation from Templates to Modules. They've moved a lot of other functionality as well (for example: references and message boxes).

The Scribunto extension is at https://www.mediawiki.org/wiki/Extension:Scribunto . It transforms a {{#invoke:Module_name|function_name|arguments}} into the appropriate text.

> category - dump files now have 5 fields, whereas the database schema has 6 fields;

I believe they removed the cat_hidden field which was effectively deprecated. A category's hidden status is saved in page_props

>  The large image dump tarballs are now a year old. 

I raised this issue back in July. See here for Kevin Day's response (from your.org): http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html
I think there's still some infrastructure work that needs to be done on the Wikimedia side.

> We are beginning to see thumb dumps from the xowa project.

I've been uploading thumbs for the major wikis to archive.org. See here for a summary: https://archive.org/search.php?query=xowa

I'm planning to upload all thumbs for all the major languages that are listed as > 200,000 on https://en.wikipedia.org/wiki/Main_Page . There are roughly 27 languages listed there. My progress has been about 1 wiki per week (I'm also uploading sister wikis for a given language). I've done 4 so far. I'm hoping to be done with the other 23 sometime by mid-year.

At about that time, I'm hoping to have a more automated way of generating updates. Currently, I'm only releasing monthly updates for en.wikipedia.org and quarterly updates for the other main wikis.

Note that the thumbs are uploaded as sqlite databases. I chose sqlite b/c tarballs are slower to extract / update / query. The database schema is fairly basic, and you should be able to retrieve any file with a sqlite library.



On Fri, Jan 10, 2014 at 2:43 AM, wp mirror <wpmirrordev@gmail.com> wrote:
Dear Ariel,

Happy New Year.  I am gearing up for wp-mirror-0.7.  To that end, I would like to list some issues that I see; and I would like to offer my help in solving them.

0) Problem Statements

0.1) Page Rendering.  Wp-mirror-0.6 works well in the sense that it builds a faithful mirror of any of your wikis.  However, during 2013 the rendering of pages eroded materially.  For example,

     o interlanguage links have vanished both from rendered pages and from dump files;
     o infoboxes are no longer rendered;
     o most transclusions now render as redlinks even though the templates are easily found in the underlying database; etc.

I understand that this erosion occurred because wp-mirror-0.6 still uses mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23.  For example, I understand that:

     o interlanguage links have been removed to the wikidata project, the rendering of which requires mediawiki-1.21+;
     o infoboxes now require the scribunto extension which requires mediawiki-1.20+

0.2) Database Schema.  Some differences in database schema have appeared.

     o category - dump files now have 5 fields, whereas the database schema has 6 fields;
     o exterallinks - dump files now have 4 fields, whereas the database schema has 3 fields.

Loading these two tables generate the error message:  ``Column count doesn't match value at row 1.''

0.3) Version Lifecycle.  According to <http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is slated for May 2014.  However, the Debian packaging team is silent as to their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS.

0.4) Image Dumps.  The large image dump tarballs are now a year old.  This means that, while wp-mirror still downloads the bulk of its images from these tarballs, there are a growing number that must be downloaded individually from WMF.

0.5) Thumbs.  One person has asked me if dump files of thumbs could be made available. We are beginning to see thumb dumps from the xowa project.

0.6) IPv6.  I am glad to see that <gerrit.wikimedia.org> has an IPv6 address.  However, <bastion.wmflabs.org> still does not.  My internal network is IPv6 only.

1) mwxml2sql

This utility from Ariel Glenn has proved invaluable to the wp-mirror project. This utility, together with MySQL 5.5 fast index creation, allows wp-mirror to build mirrors much faster than before (80% less time). 

1.1) Need for update.  According to its version information, mwxml2sql may only be valid through mediawiki-1.21.

(shell)$ mwxml2sql --version
mwxml2sql 0.0.2
Supported input schema versions: 0.4 through 0.8.
Supported output MediaWiki versions: 1.5 through 1.21.

Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would like to know if mwxml2sql should be updated.

1.2) Help Offer.  If mwxml2sql does need updating, I would be happy to help with this; and to package it for Debian as I have done before. Perhaps we could call it mwxml2sql-0.0.3.

2) mediawiki-1.23 LTS.  

2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that serves pages that look no different than those served by WMF.

2.2) DEB package.  To that end, I am thinking of packaging mediawiki-1.23 together with the extensions needed for rendering WMF wikis with wikidata content, infoboxes, math, transclusions, etc.   Given WMF's ``continuous integration'' development model, I would like to be able to automatically generate a tarball and DEB package each time WMF pushes an update to its servers.

2.3) Debian package repository.  Such a DEB package would be distributed with wp-mirror. In preparation for this, I have set up a Debian package repository at <http://download.savannah.gnu.org/releases/wp-mirror/>.  It is currently used to distribute wp-mirror-0.6 and an unstable version of wp-mirror-0.7.  Home page <http://www.nongnu.org/wp-mirror/>.

2.4) Help Offer.  I am happy to do most of this work myself.  However, I will need some guidance on interacting with the appropriate GIT repositories.  I hope that you can put me in touch with someone involved in the ``continuous integration'' process.

3) Media dumps

I am thinking that updating the image dumps annually would be adequate.  Including thumbs in those dumps would materially assist the off-line community.  I could easily update wp-mirror-0.7 to give the user a choice (no media files, thumbs only, full size media files).

Sincerely Yours,
Kent


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l