Re: [Wikitech-l] Importing XML Dumps - templates not working

23 Sep 2015

Hi alex. I added some notes below based on my experience. (I'm the
developer for XOWA (http://gnosygnu.github.io/xowa/) which generates
offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
or off-list if you are interested. Thanks.

On Mon, Sep 21, 2015 at 3:09 PM, v0id null &lt;v0idnull(a)gmail.com&gt; wrote:

...
  #1: mwdumper has not been updated in a very long time.
I did try to use it,
 but it did not seem to work properly. I don't entirely remember what the
 problem was but I believe it was related to schema incompatibility. xml2sql
 comes with a warning about having to rebuild links. Considering that I'm
 just in a command line and passing in page IDs manually, do I really need
 to worry about it? I'd be thrilled not to have to reinvent the wheel here.

...
  #2: Is there some way to figure it out? as I showed in
a previous reply,
 the template that it can't find, is there in the page table.

 As brion indicated, you need to strip the namespace name. The XML dump also has a
"namespaces" node near the beginning. It lists every namespace
in the wiki with "name" and "ID". You can use a rule like "if the
title
starts with a namespace and a colon, strip it". Hence, a title like
"Template:Date" starts with "Template:" and goes into the page table
with a
title of just "Date" and a namespace of "10" (the namespace id for
"Template").

...
  #3: Those lua modules, are they stock modules included
with the mediawiki
 software, or something much more custom? If the latter, are they available
 to download somewhere?

 Yes, these are articles with a title starting with "Module:". They will be
in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
set up on your wiki, or else it won't use them. See:
https://www.mediawiki.org/wiki/Extension:Scribunto

...
  #4: I'm not any expert on mediawiki, but it seems
when that the titles in
 the xml dump need to be formatted, mainly replacing spaces with
 underscores.

 Yes, surprisingly, the only change you'll need to make is to replace spaces
with underscores.

Hope this helps.

...
  thanks for the response
 --alex

 On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber &lt;bvibber(a)wikimedia.org&gt;
 wrote:

  A few notes:

 1) It sounds like you're recreating all the logic of importing a dump  into
  a SQL database, which may be introducing problems
if you have bugs in  your
  code. For instance you may be mistakenly treating
namespaces as text
 strings instead of numbers, or failing to escape things, or missing
 something else. I would recommend using one of the many existing tools  for
  importing a dump, such as mwdumper or xml2sql:

 https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper

 2) Make sure you've got a dump that includes the templates and lua  modules
  etc. It sounds like either you don't have the
Template: pages or your
 import process does not handle namespaces correctly.

 3) Make sure you've got all the necessary extensions to replicate the  wiki
  you're using a dump from, such as Lua. Many
templates on Wikipedia call  Lua
  modules, and won't work without them.

 4) Not sure what "not web friendly" means regarding titles?

 -- brion

 On Mon, Sep 21, 2015 at 11:50 AM, v0id null &lt;v0idnull(a)gmail.com&gt; wrote:

 > Hello Everyone,
 >
 > I've been trying to write a python script that will take an XML dump, 
and
   generate
all HTML, using Mediawiki itself to handle all the
 parsing/processing, but I've run into a problem where all the parsed  output
  have warnings that templates couldn't be
found. I'm not sure what I'm  doing
 > wrong.
 >
 > So I'll explain my steps:
 >
 > First I execute the SQL script maintenance/table.sql
 >
 > Then I remove some indexes from the tables to speed up insertion.
 >
 > Finally I go through the XML which will execute the following insert
 > statements:
 >
 >  'insert into page
 >     (page_id, page_namespace, page_title, page_is_redirect,  page_is_new,

page_random,
      page_latest, page_len, page_content_model) values (%s, %s, %s, %s,  %s,
  %s, %s, %s, %s)'

 'insert into text (old_id, old_text) values (%s, %s)'

 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
 rc_deleted,
    rc_logid)
    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,  %s,
  %s, %s)'

 'insert into revision
     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,  rev_timestamp,
 >      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
 >       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
 >
 > All IDs from the XML dump are kept. I noticed that the titles are not  web
  > friendly. Thinking this was the problem I
ran the
 > maintenance/cleanupTitles.php script but it didn't seem to fix any 
thing.
  >
 > Doing this, I can now run the following PHP script:
 >     $id = 'some revision id'
 >     $rev = Revision::newFromId( $id );
 >     $titleObj = $rev->getTitle();
 >     $pageObj = WikiPage::factory( $titleObj );
 >
 >     $context = RequestContext::newExtraneousContext($titleObj);
 >
 >     $popts = ParserOptions::newFromContext($context);
 >     $pout = $pageObj->getParserOutput($popts);
 >
 >     var_dump($pout);
 >
 > The mText property of $pout contains the parsed output, but it is full  of
   stuff
like this:

 <a href="/index.php?title=Template:Date&action=edit&redlink=1"
 class="new"
 > title="Template:Date (page does not exist)">Template:Date</a>
 >
 >
 > I feel like I'm missing a step here. I tried importing the  templatelinks
   SQL dump,
but it also did not fix anything. It also did not include any
 header or footer which would be useful.

 Any insight or help is much appreciated, thank you.

 --alex
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 
_______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Importing XML Dumps - templates not working