Hi alex. I added some notes below based on my experience. (I'm the
developer for XOWA (
) which generates
offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idnull(a)gmail.com> wrote:
#1: mwdumper has not been updated in a very long time.
I did try to use it,
but it did not seem to work properly. I don't entirely remember what the
problem was but I believe it was related to schema incompatibility. xml2sql
comes with a warning about having to rebuild links. Considering that I'm
just in a command line and passing in page IDs manually, do I really need
to worry about it? I'd be thrilled not to have to reinvent the wheel here.
#2: Is there some way to figure it out? as I showed in
a previous reply,
the template that it can't find, is there in the page table.
As brion indicated, you need to strip the namespace name. The XML dump
also has a
"namespaces" node near the beginning. It lists every namespace
in the wiki with "name" and "ID". You can use a rule like "if the
title
starts with a namespace and a colon, strip it". Hence, a title like
"Template:Date" starts with "Template:" and goes into the page table
with a
title of just "Date" and a namespace of "10" (the namespace id for
"Template").
#3: Those lua modules, are they stock modules included
with the mediawiki
software, or something much more custom? If the latter, are they available
to download somewhere?
Yes, these are articles with a title starting with "Module:". They will be
in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
set up on your wiki, or else it won't use them. See:
#4: I'm not any expert on mediawiki, but it seems
when that the titles in
the xml dump need to be formatted, mainly replacing spaces with
underscores.
Yes, surprisingly, the only change you'll need to make is to replace
thanks for the response
--alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org>
wrote:
A few notes:
1) It sounds like you're recreating all the logic of importing a dump
into
a SQL database, which may be introducing problems
if you have bugs in
your
code. For instance you may be mistakenly treating
namespaces as text
strings instead of numbers, or failing to escape things, or missing
something else. I would recommend using one of the many existing tools
for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
2) Make sure you've got a dump that includes the templates and lua
modules
etc. It sounds like either you don't have the
Template: pages or your
import process does not handle namespaces correctly.
3) Make sure you've got all the necessary extensions to replicate the
wiki
you're using a dump from, such as Lua. Many
templates on Wikipedia call
Lua
modules, and won't work without them.
4) Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote:
> Hello Everyone,
>
> I've been trying to write a python script that will take an XML dump,
and
generate
all HTML, using Mediawiki itself to handle all the
parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be
found. I'm not sure what I'm
doing
> wrong.
>
> So I'll explain my steps:
>
> First I execute the SQL script maintenance/table.sql
>
> Then I remove some indexes from the tables to speed up insertion.
>
> Finally I go through the XML which will execute the following insert
> statements:
>
> 'insert into page
> (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random,
page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
rc_deleted,
rc_logid)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision
(rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
> rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
> values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
>
> All IDs from the XML dump are kept. I noticed that the titles are not
web
> friendly. Thinking this was the problem I
ran the
> maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
>
> Doing this, I can now run the following PHP script:
> $id = 'some revision id'
> $rev = Revision::newFromId( $id );
> $titleObj = $rev->getTitle();
> $pageObj = WikiPage::factory( $titleObj );
>
> $context = RequestContext::newExtraneousContext($titleObj);
>
> $popts = ParserOptions::newFromContext($context);
> $pout = $pageObj->getParserOutput($popts);
>
> var_dump($pout);
>
> The mText property of $pout contains the parsed output, but it is full
of
stuff
like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
> title="Template:Date (page does not exist)">Template:Date</a>
>
>
> I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump,
but it also did not fix anything. It also did not include any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l