this one. I believe this was to contain all latest revisions of all pages.
I do see that there are template pages in there, at least, they are pages
with a title in the format of Template:[some template name]
On Mon, Sep 21, 2015 at 2:53 PM, John <phoenixoverride(a)gmail.com> wrote:
What kind of dump are you working from?
On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idnull(a)gmail.com> wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to
handle all the
parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be
found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert
statements:
'insert into page
(page_id, page_namespace, page_title, page_is_redirect, page_is_new,
page_random,
page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
rc_deleted,
rc_logid)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision
(rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len,
rev_parent_id, rev_sha1)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran
the
maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script:
$id = 'some revision id'
$rev = Revision::newFromId( $id );
$titleObj = $rev->getTitle();
$pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj);
$popts = ParserOptions::newFromContext($context);
$pout = $pageObj->getParserOutput($popts);
var_dump($pout);
The mText property of $pout contains the parsed output, but it is full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not
exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks
SQL dump, but it also did not fix anything. It also did not include any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l