Hello Everyone,
I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed output have warnings that templates couldn't be found. I'm not sure what I'm doing wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj);
$popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts);
var_dump($pout);
The mText property of $pout contains the parsed output, but it is full of stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new" title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex
What kind of dump are you working from?
On Mon, Sep 21, 2015 at 2:50 PM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed output have warnings that templates couldn't be found. I'm not sure what I'm doing wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full of stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new" title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-article...
this one. I believe this was to contain all latest revisions of all pages. I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name]
On Mon, Sep 21, 2015 at 2:53 PM, John phoenixoverride@gmail.com wrote:
What kind of dump are you working from?
On Mon, Sep 21, 2015 at 2:50 PM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full of stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
For example, the above mentioned missing template does seem to exist from what I can tell:
mysql> select page_title from page where page_title='Template:Date'; +---------------+ | page_title | +---------------+ | Template:Date | +---------------+ 1 row in set (0.02 sec)
On Mon, Sep 21, 2015 at 3:00 PM, v0id null v0idnull@gmail.com wrote:
http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-article...
this one. I believe this was to contain all latest revisions of all pages. I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name]
On Mon, Sep 21, 2015 at 2:53 PM, John phoenixoverride@gmail.com wrote:
What kind of dump are you working from?
On Mon, Sep 21, 2015 at 2:50 PM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Your import process is definitely broken. page_title should be just 'Date', while page_namespace has the numeric key for template pages.
-- brion
On Mon, Sep 21, 2015 at 12:02 PM, v0id null v0idnull@gmail.com wrote:
For example, the above mentioned missing template does seem to exist from what I can tell:
mysql> select page_title from page where page_title='Template:Date'; +---------------+ | page_title | +---------------+ | Template:Date | +---------------+ 1 row in set (0.02 sec)
On Mon, Sep 21, 2015 at 3:00 PM, v0id null v0idnull@gmail.com wrote:
http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-article...
this one. I believe this was to contain all latest revisions of all
pages.
I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name]
On Mon, Sep 21, 2015 at 2:53 PM, John phoenixoverride@gmail.com wrote:
What kind of dump are you working from?
On Mon, Sep 21, 2015 at 2:50 PM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random, page_latest, page_len, page_content_model) values (%s, %s, %s,
%s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user,
rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump, but it also did not fix anything. It also did not include
any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
A few notes:
1) It sounds like you're recreating all the logic of importing a dump into a SQL database, which may be introducing problems if you have bugs in your code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools for importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
2) Make sure you've got a dump that includes the templates and lua modules etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly.
3) Make sure you've got all the necessary extensions to replicate the wiki you're using a dump from, such as Lua. Many templates on Wikipedia call Lua modules, and won't work without them.
4) Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed output have warnings that templates couldn't be found. I'm not sure what I'm doing wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full of stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new" title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
#1: mwdumper has not been updated in a very long time. I did try to use it, but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility. xml2sql comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel here.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
thanks for the response --alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber bvibber@wikimedia.org wrote:
A few notes:
- It sounds like you're recreating all the logic of importing a dump into
a SQL database, which may be introducing problems if you have bugs in your code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools for importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
- Make sure you've got a dump that includes the templates and lua modules
etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly.
- Make sure you've got all the necessary extensions to replicate the wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call Lua modules, and won't work without them.
- Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full of stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Sep 21, 2015 at 12:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to use it, but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility. xml2sql comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel here.
You would need to rebuild link tables if you need them for either mwdumper or xml2sql. For your case it doesn't sound like you'd need them.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
As noted in previous reply, your import process is buggy and the page record's page_title field is incorrect, so it cannot be found. You need to correctly parse the incoming title into namespace and base title portions and store them correctly into page_namespace numeric ID and page_title text portion.
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?
They are on the wiki, in the 'Module' namespace. Should be included with a complete dump. I have no idea about the 'articles' dump, but I would assume it *should* include them.
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
That's another thing your import process needs to do. I recommend using existing code that already has all this logic. :)
-- brion
Note that Kiwix's "mw-offliner" script ( http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner ) does a pretty good job of converting a bunch of wiki pages to HTML, although it starts from a live wiki instance (and a properly-configured Parsoid pointed at it) rather than an XML dump. Zim-format dumps (for example, from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ ) can also be unpacked into a directory tree of HTML files.
There are also the "HTML dumps" that the service team is involved with. This following links have more information: https://phabricator.wikimedia.org/T88728 https://phabricator.wikimedia.org/T93396
Perhaps your use case could inform the ongoing design of that service. --scott
On Mon, Sep 21, 2015 at 3:12 PM, Brion Vibber bvibber@wikimedia.org wrote:
On Mon, Sep 21, 2015 at 12:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to use it, but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility. xml2sql comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel here.
You would need to rebuild link tables if you need them for either mwdumper or xml2sql. For your case it doesn't sound like you'd need them.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
As noted in previous reply, your import process is buggy and the page record's page_title field is incorrect, so it cannot be found. You need to correctly parse the incoming title into namespace and base title portions and store them correctly into page_namespace numeric ID and page_title text portion.
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?
They are on the wiki, in the 'Module' namespace. Should be included with a complete dump. I have no idea about the 'articles' dump, but I would assume it *should* include them.
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
That's another thing your import process needs to do. I recommend using existing code that already has all this logic. :)
-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to use it, but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility. xml2sql comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel here.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
As brion indicated, you need to strip the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template").
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?
Yes, these are articles with a title starting with "Module:". They will be
in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
Yes, surprisingly, the only change you'll need to make is to replace
spaces with underscores.
Hope this helps.
thanks for the response --alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber bvibber@wikimedia.org wrote:
A few notes:
- It sounds like you're recreating all the logic of importing a dump
into
a SQL database, which may be introducing problems if you have bugs in
your
code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools
for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
- Make sure you've got a dump that includes the templates and lua
modules
etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly.
- Make sure you've got all the necessary extensions to replicate the
wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call
Lua
modules, and won't work without them.
- Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for the input everyone. I was not aware that importing the XML dumps was so involved.
In the end I used xml2sql, but it required two patches, and a bit more work on my end, to get it to work. I also had to strip out the <DiscussionThreading> tag from the xml dump. But nevertheless it is very fast.
For those wondering, I'm toying around with an automated news categorizer and wanted to use Wikinews as a corpus. Not perfect, but this is just hobbyist level stuff here. I'm using nltk so I wanted to keep things python-centric, but I've written up a PHP script that runs as a simple tcp server that my python script can connect to and ask for the HTML output. My python script first downloads mediawiki, the right xml dump, unzips everything, sets up LocalSettings.php, compiles xml2sql, runs it then imports the sql into the database. So essentially automates making an offline installation of what I assume is any mediawiki xml dump. Then it starts that simple PHP server (using plain sockets), and just sends it page IDs and it responds with the fully rendered HTML including headers and footers.
I figure this approach, I can run a few forks on the python and php side to speed up the process.
then I use python to parse through the HTML to get whatever I need from the page, which for now are the categories and the article content, which I can then use to train classifiers from nltk.
maybe not the easiest approach, but it does make it easy to use. I've looked at the python parsers but none of them seem like they will be as successful or as correct as using Mediawiki itself.
---alex
On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu gnosygnu@gmail.com wrote:
Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to use
it,
but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility.
xml2sql
comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel
here.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
As brion indicated, you need to strip the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template").
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they
available
to download somewhere?
Yes, these are articles with a title starting with "Module:". They will
be in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
Yes, surprisingly, the only change you'll need to make is to replace
spaces with underscores.
Hope this helps.
thanks for the response --alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber bvibber@wikimedia.org wrote:
A few notes:
- It sounds like you're recreating all the logic of importing a dump
into
a SQL database, which may be introducing problems if you have bugs in
your
code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools
for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
- Make sure you've got a dump that includes the templates and lua
modules
etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly.
- Make sure you've got all the necessary extensions to replicate the
wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call
Lua
modules, and won't work without them.
- Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com
wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random, page_latest, page_len, page_content_model) values (%s, %s, %s,
%s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user,
rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is
full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump, but it also did not fix anything. It also did not include
any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
You might consider pointing a Parsoid instance at your "simple PHP server". Using the Parsoid-format HTML DOM has several benefits over using the output of the PHP parser directly. Categories are much easier to extract, for instance.
See https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Conten... (recording at https://youtu.be/3WJID_WC7BQ) and https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more hints on running queries over the Parsoid DOM. --scott
On Wed, Sep 23, 2015 at 2:25 PM, v0id null v0idnull@gmail.com wrote:
Thanks for the input everyone. I was not aware that importing the XML dumps was so involved.
In the end I used xml2sql, but it required two patches, and a bit more work on my end, to get it to work. I also had to strip out the <DiscussionThreading> tag from the xml dump. But nevertheless it is very fast.
For those wondering, I'm toying around with an automated news categorizer and wanted to use Wikinews as a corpus. Not perfect, but this is just hobbyist level stuff here. I'm using nltk so I wanted to keep things python-centric, but I've written up a PHP script that runs as a simple tcp server that my python script can connect to and ask for the HTML output. My python script first downloads mediawiki, the right xml dump, unzips everything, sets up LocalSettings.php, compiles xml2sql, runs it then imports the sql into the database. So essentially automates making an offline installation of what I assume is any mediawiki xml dump. Then it starts that simple PHP server (using plain sockets), and just sends it page IDs and it responds with the fully rendered HTML including headers and footers.
I figure this approach, I can run a few forks on the python and php side to speed up the process.
then I use python to parse through the HTML to get whatever I need from the page, which for now are the categories and the article content, which I can then use to train classifiers from nltk.
maybe not the easiest approach, but it does make it easy to use. I've looked at the python parsers but none of them seem like they will be as successful or as correct as using Mediawiki itself.
---alex
On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu gnosygnu@gmail.com wrote:
Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to use
it,
but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility.
xml2sql
comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel
here.
#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.
As brion indicated, you need to strip the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template").
#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they
available
to download somewhere?
Yes, these are articles with a title starting with "Module:". They will
be in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto
#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.
Yes, surprisingly, the only change you'll need to make is to replace
spaces with underscores.
Hope this helps.
thanks for the response --alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber bvibber@wikimedia.org wrote:
A few notes:
- It sounds like you're recreating all the logic of importing a dump
into
a SQL database, which may be introducing problems if you have bugs in
your
code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools
for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
- Make sure you've got a dump that includes the templates and lua
modules
etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly.
- Make sure you've got all the necessary extensions to replicate the
wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call
Lua
modules, and won't work without them.
- Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com
wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed
output
have warnings that templates couldn't be found. I'm not sure what I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following insert statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random, page_latest, page_len, page_content_model) values (%s, %s, %s,
%s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user,
rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is
full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump, but it also did not fix anything. It also did not include
any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Looking at https://www.mediawiki.org/wiki/Parsoid/Setup
It seems that I need a web server set up for mediawiki, and nodejs and I'd have to go through the Parsoid API which I guess is going through Meidawiki anyhow.
Right now I use xpath to find everything I need. Getting categories for example is as simple as:
$xpath = new DOMXPath($dom); $contents = $xpath->query("//div[@id='mw-normal-catlinks']//li/a");
$categories = [];
foreach ($contents as $el) { $categories[] = $el->textContent; }
Is there information that Parsoid makes available that isn't available from Mediawiki output directly?
thanks, -alex
On Wed, Sep 23, 2015 at 2:49 PM, C. Scott Ananian cananian@wikimedia.org wrote:
You might consider pointing a Parsoid instance at your "simple PHP server". Using the Parsoid-format HTML DOM has several benefits over using the output of the PHP parser directly. Categories are much easier to extract, for instance.
See https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Conten... (recording at https://youtu.be/3WJID_WC7BQ) and https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more hints on running queries over the Parsoid DOM. --scott
On Wed, Sep 23, 2015 at 2:25 PM, v0id null v0idnull@gmail.com wrote:
Thanks for the input everyone. I was not aware that importing the XML
dumps
was so involved.
In the end I used xml2sql, but it required two patches, and a bit more
work
on my end, to get it to work. I also had to strip out the <DiscussionThreading> tag from the xml dump. But nevertheless it is very fast.
For those wondering, I'm toying around with an automated news categorizer and wanted to use Wikinews as a corpus. Not perfect, but this is just hobbyist level stuff here. I'm using nltk so I wanted to keep things python-centric, but I've written up a PHP script that runs as a simple
tcp
server that my python script can connect to and ask for the HTML output.
My
python script first downloads mediawiki, the right xml dump, unzips everything, sets up LocalSettings.php, compiles xml2sql, runs it then imports the sql into the database. So essentially automates making an offline installation of what I assume is any mediawiki xml dump. Then it starts that simple PHP server (using plain sockets), and just sends it
page
IDs and it responds with the fully rendered HTML including headers and footers.
I figure this approach, I can run a few forks on the python and php side
to
speed up the process.
then I use python to parse through the HTML to get whatever I need from
the
page, which for now are the categories and the article content, which I
can
then use to train classifiers from nltk.
maybe not the easiest approach, but it does make it easy to use. I've looked at the python parsers but none of them seem like they will be as successful or as correct as using Mediawiki itself.
---alex
On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu gnosygnu@gmail.com wrote:
Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up
on-list
or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null v0idnull@gmail.com wrote:
#1: mwdumper has not been updated in a very long time. I did try to
use
it,
but it did not seem to work properly. I don't entirely remember what
the
problem was but I believe it was related to schema incompatibility.
xml2sql
comes with a warning about having to rebuild links. Considering that
I'm
just in a command line and passing in page IDs manually, do I really
need
to worry about it? I'd be thrilled not to have to reinvent the wheel
here.
#2: Is there some way to figure it out? as I showed in a previous
reply,
the template that it can't find, is there in the page table.
As brion indicated, you need to strip the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every
namespace
in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table
with a
title of just "Date" and a namespace of "10" (the namespace id for "Template").
#3: Those lua modules, are they stock modules included with the
mediawiki
software, or something much more custom? If the latter, are they
available
to download somewhere?
Yes, these are articles with a title starting with "Module:". They
will
be in the pages-articles.xml.bz2 dump. You should make sure you have
Scribunto
set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto
#4: I'm not any expert on mediawiki, but it seems when that the
titles in
the xml dump need to be formatted, mainly replacing spaces with underscores.
Yes, surprisingly, the only change you'll need to make is to replace
spaces with underscores.
Hope this helps.
thanks for the response --alex
On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber bvibber@wikimedia.org wrote:
A few notes:
- It sounds like you're recreating all the logic of importing a
dump
into
a SQL database, which may be introducing problems if you have bugs
in
your
code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing
tools
for
importing a dump, such as mwdumper or xml2sql:
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
- Make sure you've got a dump that includes the templates and lua
modules
etc. It sounds like either you don't have the Template: pages or
your
import process does not handle namespaces correctly.
- Make sure you've got all the necessary extensions to replicate
the
wiki
you're using a dump from, such as Lua. Many templates on Wikipedia
call
Lua
modules, and won't work without them.
- Not sure what "not web friendly" means regarding titles?
-- brion
On Mon, Sep 21, 2015 at 11:50 AM, v0id null v0idnull@gmail.com
wrote:
Hello Everyone,
I've been trying to write a python script that will take an XML
dump,
and
generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the
parsed
output
have warnings that templates couldn't be found. I'm not sure what
I'm
doing
wrong.
So I'll explain my steps:
First I execute the SQL script maintenance/table.sql
Then I remove some indexes from the tables to speed up insertion.
Finally I go through the XML which will execute the following
insert
statements:
'insert into page (page_id, page_namespace, page_title, page_is_redirect,
page_is_new,
page_random, page_latest, page_len, page_content_model) values (%s, %s,
%s,
%s,
%s,
%s, %s, %s, %s)'
'insert into text (old_id, old_text) values (%s, %s)'
'insert into recentchanges (rc_id, rc_timestamp, rc_user,
rc_user_text,
rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
rc_last_oldid,
rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len,
rc_new_len,
rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
%s,
%s, %s)'
'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
rev_timestamp,
rev_minor_edit, rev_deleted, rev_len, rev_parent_id,rev_sha1)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'All IDs from the XML dump are kept. I noticed that the titles are
not
web
friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any
thing.
Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj );
$context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout);The mText property of $pout contains the parsed output, but it is
full
of
stuff like this:
<a href="/index.php?title=Template:Date&action=edit&redlink=1"
class="new"
title="Template:Date (page does not exist)">Template:Date</a>
I feel like I'm missing a step here. I tried importing the
templatelinks
SQL dump, but it also did not fix anything. It also did not
include
any
header or footer which would be useful.
Any insight or help is much appreciated, thank you.
--alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- (http://cscott.net)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Sep 23, 2015 at 3:27 PM, v0id null v0idnull@gmail.com wrote:
Is there information that Parsoid makes available that isn't available from Mediawiki output directly?
Yes, certainly. https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec should give you an idea.
One (of many) examples are comments in wikitext, which are stripped from PHP output but present in Parsoid HTML.
Your question really is, "is there information *I need* that Parsoid makes available that isn't available from Mediawiki output directly?" I don't know the answer to that. Probably the right thing is to keep going with the implementation you've got, but if you get stuck keep in the back of your mind that switching to Parsoid might help. --scott
wikitech-l@lists.wikimedia.org