Hi Kent.
I'll respond to a few points. The rest are outside my knowledge / scope
(EX: mwxml2sql; Debian).
Hope this helps, and good luck on wp-mirror.
----
interlanguage links have been removed to the wikidata
project, the
rendering of which requires mediawiki-1.21+;
You will need a local instance of
http://www.wikidata.org . You could
probably build it using the files from here:
http://dumps.wikimedia.org/wikidatawiki/
Each wiki would also need the wikibase extension:
https://www.mediawiki.org/wiki/Extension:Wikibase.
Note that wikidata is also being used in infoboxes. For example,
https://simple.wikipedia.org/w/index.php?title=Google&action=edit now has
multiple {{#property}} statements.
infoboxes now require the scribunto extension which
requires
mediawiki-1.20+
I believe most of the wikis have moved the infobox generation from
Templates to Modules. They've moved a lot of other functionality as well
(for example: references and message boxes).
The Scribunto extension is at
https://www.mediawiki.org/wiki/Extension:Scribunto . It transforms a
{{#invoke:Module_name|function_name|arguments}} into the appropriate text.
category - dump files now have 5 fields, whereas the
database schema has
6 fields;
I believe they removed the cat_hidden field which was effectively
deprecated. A category's hidden status is saved in page_props
The large image dump tarballs are now a year old.
I raised this issue back in July. See here for Kevin Day's response (from
your.org):
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html
I think there's still some infrastructure work that needs to be done on the
Wikimedia side.
We are beginning to see thumb dumps from the xowa
project.
I've been uploading thumbs for the major wikis to
archive.org. See here for
a summary:
https://archive.org/search.php?query=xowa
I'm planning to upload all thumbs for all the major languages that are
listed as > 200,000 on
https://en.wikipedia.org/wiki/Main_Page . There are
roughly 27 languages listed there. My progress has been about 1 wiki per
week (I'm also uploading sister wikis for a given language). I've done 4 so
far. I'm hoping to be done with the other 23 sometime by mid-year.
At about that time, I'm hoping to have a more automated way of generating
updates. Currently, I'm only releasing monthly updates for
en.wikipedia.organd quarterly updates for the other main wikis.
Note that the thumbs are uploaded as sqlite databases. I chose sqlite b/c
tarballs are slower to extract / update / query. The database schema is
fairly basic, and you should be able to retrieve any file with a sqlite
library.
On Fri, Jan 10, 2014 at 2:43 AM, wp mirror <wpmirrordev(a)gmail.com> wrote:
Dear Ariel,
Happy New Year. I am gearing up for wp-mirror-0.7. To that end, I would
like to list some issues that I see; and I would like to offer my help in
solving them.
0) Problem Statements
0.1) Page Rendering. Wp-mirror-0.6 works well in the sense that it builds
a faithful mirror of any of your wikis. However, during 2013 the rendering
of pages eroded materially. For example,
o interlanguage links have vanished both from rendered pages and from
dump files;
o infoboxes are no longer rendered;
o most transclusions now render as redlinks even though the templates
are easily found in the underlying database; etc.
I understand that this erosion occurred because wp-mirror-0.6 still uses
mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23. For example, I
understand that:
o interlanguage links have been removed to the wikidata project, the
rendering of which requires mediawiki-1.21+;
o infoboxes now require the scribunto extension which requires
mediawiki-1.20+
0.2) Database Schema. Some differences in database schema have appeared.
o category - dump files now have 5 fields, whereas the database
schema has 6 fields;
o exterallinks - dump files now have 4 fields, whereas the database
schema has 3 fields.
Loading these two tables generate the error message: ``Column count
doesn't match value at row 1.''
0.3) Version Lifecycle. According to <
http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is
slated for May 2014. However, the Debian packaging team is silent as to
their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS.
0.4) Image Dumps. The large image dump tarballs are now a year old. This
means that, while wp-mirror still downloads the bulk of its images from
these tarballs, there are a growing number that must be downloaded
individually from WMF.
0.5) Thumbs. One person has asked me if dump files of thumbs could be
made available. We are beginning to see thumb dumps from the xowa project.
0.6) IPv6. I am glad to see that <gerrit.wikimedia.org> has an IPv6
address. However, <bastion.wmflabs.org> still does not. My internal
network is IPv6 only.
1) mwxml2sql
This utility from Ariel Glenn has proved invaluable to the wp-mirror
project. This utility, together with MySQL 5.5 fast index creation, allows
wp-mirror to build mirrors much faster than before (80% less time).
1.1) Need for update. According to its version information, mwxml2sql may
only be valid through mediawiki-1.21.
(shell)$ mwxml2sql --version
mwxml2sql 0.0.2
Supported input schema versions: 0.4 through 0.8.
Supported output MediaWiki versions: 1.5 through 1.21.
Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would
like to know if mwxml2sql should be updated.
1.2) Help Offer. If mwxml2sql does need updating, I would be happy to
help with this; and to package it for Debian as I have done before. Perhaps
we could call it mwxml2sql-0.0.3.
2) mediawiki-1.23 LTS.
2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that
serves pages that look no different than those served by WMF.
2.2) DEB package. To that end, I am thinking of packaging mediawiki-1.23
together with the extensions needed for rendering WMF wikis with wikidata
content, infoboxes, math, transclusions, etc. Given WMF's ``continuous
integration'' development model, I would like to be able to automatically
generate a tarball and DEB package each time WMF pushes an update to its
servers.
2.3) Debian package repository. Such a DEB package would be distributed
with wp-mirror. In preparation for this, I have set up a Debian package
repository at <http://download.savannah.gnu.org/releases/wp-mirror/>. It
is currently used to distribute wp-mirror-0.6 and an unstable version of
wp-mirror-0.7. Home page <http://www.nongnu.org/wp-mirror/>.
2.4) Help Offer. I am happy to do most of this work myself. However, I
will need some guidance on interacting with the appropriate GIT
repositories. I hope that you can put me in touch with someone involved in
the ``continuous integration'' process.
3) Media dumps
I am thinking that updating the image dumps annually would be adequate.
Including thumbs in those dumps would materially assist the off-line
community. I could easily update wp-mirror-0.7 to give the user a choice
(no media files, thumbs only, full size media files).
Sincerely Yours,
Kent
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l