Re: [Wikitech-l] Importing XML Dumps - templates not working

21 Sep 2015

Note that Kiwix's "mw-offliner" script (
http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner ) does a
pretty good job of converting a bunch of wiki pages to HTML, although
it starts from a live wiki instance (and a properly-configured Parsoid
pointed at it) rather than an XML dump.  Zim-format dumps (for
example, from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ )
can also be unpacked into a directory tree of HTML files.

There are also the "HTML dumps" that the service team is involved
with.  This following links have more information:
https://phabricator.wikimedia.org/T88728
https://phabricator.wikimedia.org/T93396

Perhaps your use case could inform the ongoing design of that service.
 --scott

On Mon, Sep 21, 2015 at 3:12 PM, Brion Vibber &lt;bvibber(a)wikimedia.org&gt; wrote:
...
  On Mon, Sep 21, 2015 at 12:09 PM, v0id null
&lt;v0idnull(a)gmail.com&gt; wrote:

  #1: mwdumper has not been updated in a very long
time. I did try to use it,
 but it did not seem to work properly. I don't entirely remember what the
 problem was but I believe it was related to schema incompatibility. xml2sql
 comes with a warning about having to rebuild links. Considering that I'm
 just in a command line and passing in page IDs manually, do I really need
 to worry about it? I'd be thrilled not to have to reinvent the wheel here.

 You would need to rebuild link tables if you need them for either mwdumper
 or xml2sql. For your case it doesn't sound like you'd need them.

  #2: Is there some way to figure it out? as I
showed in a previous reply,
 the template that it can't find, is there in the page table.

 As noted in previous reply, your import process is buggy and the page
 record's page_title field is incorrect, so it cannot be found. You need to
 correctly parse the incoming title into namespace and base title portions
 and store them correctly into page_namespace numeric ID and page_title text
 portion.

  #3: Those lua modules, are they stock modules
included with the mediawiki
 software, or something much more custom? If the latter, are they available
 to download somewhere?

 They are on the wiki, in the 'Module' namespace. Should be included with a
 complete dump. I have no idea about the 'articles' dump, but I would assume
 it *should* include them.

 #4: I'm not any expert on mediawiki, but it seems when that the titles in
 the xml dump need to be formatted, mainly replacing spaces with
 underscores.

 That's another thing your import process needs to do. I recommend using
 existing code that already has all this logic. :)

 -- brion
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

-- 
(http://cscott.net)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Importing XML Dumps - templates not working