#1: mwdumper has not been updated in a very long time. I did try to use it,
but it did not seem to work properly. I don't entirely remember what the
problem was but I believe it was related to schema incompatibility. xml2sql
comes with a warning about having to rebuild links. Considering that I'm
just in a command line and passing in page IDs manually, do I really need
to worry about it? I'd be thrilled not to have to reinvent the wheel here.
#2: Is there some way to figure it out? as I showed in a previous reply,
the template that it can't find, is there in the page table.
#3: Those lua modules, are they stock modules included with the mediawiki
software, or something much more custom? If the latter, are they available
to download somewhere?
#4: I'm not any expert on mediawiki, but it seems when that the titles in
the xml dump need to be formatted, mainly replacing spaces with
underscores.
thanks for the response
--alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> wrote:
A few notes:
1) It sounds like you're recreating all the logic of importing a dump into
a SQL database, which may be introducing problems if you have bugs in your
code. For instance you may be mistakenly treating namespaces as text
strings instead of numbers, or failing to escape things, or missing
something else. I would recommend using one of the many existing tools for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
2) Make sure you've got a dump that includes the templates and lua modules
etc. It sounds like either you don't have the Template: pages or your
import process does not handle namespaces correctly.
3) Make sure you've got all the necessary extensions to replicate the wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call Lua
modules, and won't work without them.
4) Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump, and
generate all HTML, using Mediawiki itself to handle all the
parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be
found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert
statements:
'insert into page
(page_id, page_namespace, page_title, page_is_redirect, page_is_new,
page_random,
page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
rc_deleted,
rc_logid)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision
(rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len,
rev_parent_id, rev_sha1)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
All IDs from the XML dump are kept. I noticed that the titles are not web
friendly. Thinking this was the problem I ran the
maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script:
$id = 'some revision id'
$rev = Revision::newFromId( $id );
$titleObj = $rev->getTitle();
$pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj);
$popts = ParserOptions::newFromContext($context);
$pout = $pageObj->getParserOutput($popts);
var_dump($pout);
The mText property of $pout contains the parsed output, but it is full of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not
exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks
SQL dump, but it also did not fix anything. It also did not include any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l