Importing XML Dumps - templates not working - Wikitech-l

List overview All Threads
Download

newer

Importing XML Dumps - templates not working

older

HTMLForm and default values

Re: [Wikitech-l] [Wikimedia-l]...

v0id null

21 Sep 2015 21 Sep '15

6:50 p.m.

Hello Everyone, I've been trying to write a python script that will take an XML dump, and generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed output have warnings that templates couldn't be found. I'm not sure what I'm doing wrong. So I'll explain my steps: First I execute the SQL script maintenance/table.sql Then I remove some indexes from the tables to speed up insertion. Finally I go through the XML which will execute the following insert statements: 'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s, %s, %s, %s, %s)' 'insert into text (old_id, old_text) values (%s, %s)' 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' 'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing. Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj ); $context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout); The mText property of $pout contains the parsed output, but it is full of stuff like this: <a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new" title="Template:Date (page does not exist)">Template:Date</a> I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful. Any insight or help is much appreciated, thank you. --alex

Show replies by date

John

21 Sep 21 Sep

6:53 p.m.

What kind of dump are you working from? On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

v0id null

7 p.m.

http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-articl… this one. I believe this was to contain all latest revisions of all pages. I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name] On Mon, Sep 21, 2015 at 2:53 PM, John <phoenixoverride(a)gmail.com> wrote:

...

What kind of dump are you working from? On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idnull(a)gmail.com> wrote:

output

have warnings that templates couldn't be found. I'm not sure what I'm

doing

wrong. So I'll explain my steps: First I execute the SQL script maintenance/table.sql Then I remove some indexes from the tables to speed up insertion. Finally I go through the XML which will execute the following insert statements: 'insert into page (page_id, page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,

%s,

%s, %s, %s, %s)' 'insert into text (old_id, old_text) values (%s, %s)' 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,

%s,

%s, %s)' 'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,

rev_timestamp,

rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' All IDs from the XML dump are kept. I noticed that the titles are not web friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any thing. Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj ); $context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout); The mText property of $pout contains the parsed output, but it is full of stuff like this: <a href="/index.php?title=Template:Date&action=edit&redlink=1"

class="new"

title="Template:Date (page does not exist)">Template:Date</a> I feel like I'm missing a step here. I tried importing the templatelinks SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful. Any insight or help is much appreciated, thank you. --alex _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

v0id null

7:02 p.m.

For example, the above mentioned missing template does seem to exist from what I can tell: mysql> select page_title from page where page_title='Template:Date'; +---------------+ | page_title | +---------------+ | Template:Date | +---------------+ 1 row in set (0.02 sec) On Mon, Sep 21, 2015 at 3:00 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

What kind of dump are you working from? On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idnull(a)gmail.com> wrote:

Hello Everyone, I've been trying to write a python script that will take an XML dump,

and

generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed

output

have warnings that templates couldn't be found. I'm not sure what I'm

doing

%s,

%s, %s)' 'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,

rev_timestamp,

rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' All IDs from the XML dump are kept. I noticed that the titles are not

web

friendly. Thinking this was the problem I ran the maintenance/cleanupTitles.php script but it didn't seem to fix any

thing.

Doing this, I can now run the following PHP script: $id = 'some revision id' $rev = Revision::newFromId( $id ); $titleObj = $rev->getTitle(); $pageObj = WikiPage::factory( $titleObj ); $context = RequestContext::newExtraneousContext($titleObj); $popts = ParserOptions::newFromContext($context); $pout = $pageObj->getParserOutput($popts); var_dump($pout); The mText property of $pout contains the parsed output, but it is full

stuff like this: <a href="/index.php?title=Template:Date&action=edit&redlink=1"

class="new"

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

7:02 p.m.

Your import process is definitely broken. page_title should be just 'Date', while page_namespace has the numeric key for template pages. -- brion On Mon, Sep 21, 2015 at 12:02 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-articl…

> this one. I believe this was to contain all latest revisions of all pages.

I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name] On Mon, Sep 21, 2015 at 2:53 PM, John <phoenixoverride(a)gmail.com> wrote: > What kind of dump are you working from? > > > On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idnull(a)gmail.com> wrote: > > > Hello Everyone, > > > > I've been trying to write a python script that will take an XML dump, > and > > generate all HTML, using Mediawiki itself to handle all the > > parsing/processing, but I've run into a problem where all the parsed > output > > have warnings that templates couldn't be found. I'm not sure what I'm > doing > > wrong. > > > > So I'll explain my steps: > > > > First I execute the SQL script maintenance/table.sql > > > > Then I remove some indexes from the tables to speed up insertion. > > > > Finally I go through the XML which will execute the following insert > > statements: > > > > 'insert into page > > (page_id, page_namespace, page_title, page_is_redirect,

page_is_new,

> > page_random, > > page_latest, page_len, page_content_model) values (%s, %s, %s,

%s,

> %s, > > %s, %s, %s, %s)' > > > > 'insert into text (old_id, old_text) values (%s, %s)' > > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user,

rc_user_text,

> > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,

rc_last_oldid,

> > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, > > rc_deleted, > > rc_logid) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, > %s, > > %s, %s)' > > > > 'insert into revision > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, > rev_timestamp, > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > > > > All IDs from the XML dump are kept. I noticed that the titles are not > web > > friendly. Thinking this was the problem I ran the > > maintenance/cleanupTitles.php script but it didn't seem to fix any > thing. > > > > Doing this, I can now run the following PHP script: > > $id = 'some revision id' > > $rev = Revision::newFromId( $id ); > > $titleObj = $rev->getTitle(); > > $pageObj = WikiPage::factory( $titleObj ); > > > > $context = RequestContext::newExtraneousContext($titleObj); > > > > $popts = ParserOptions::newFromContext($context); > > $pout = $pageObj->getParserOutput($popts); > > > > var_dump($pout); > > > > The mText property of $pout contains the parsed output, but it is full > of > > stuff like this: > > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > class="new" > > title="Template:Date (page does not exist)">Template:Date</a> > > > > > > I feel like I'm missing a step here. I tried importing the

templatelinks

> > SQL dump, but it also did not fix anything. It also did not include

any >> > header or footer which would be useful. >> > >> > Any insight or help is much appreciated, thank you. >> > >> > --alex >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l(a)lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

> _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

7 p.m.

A few notes: 1) It sounds like you're recreating all the logic of importing a dump into a SQL database, which may be introducing problems if you have bugs in your code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools for importing a dump, such as mwdumper or xml2sql: https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper 2) Make sure you've got a dump that includes the templates and lua modules etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly. 3) Make sure you've got all the necessary extensions to replicate the wiki you're using a dump from, such as Lua. Many templates on Wikipedia call Lua modules, and won't work without them. 4) Not sure what "not web friendly" means regarding titles? -- brion On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote:

...

v0id null

7:09 p.m.

#1: mwdumper has not been updated in a very long time. I did try to use it, but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility. xml2sql comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel here. #2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table. #3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere? #4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores. thanks for the response --alex On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> wrote:

...

output

have warnings that templates couldn't be found. I'm not sure what I'm

doing

%s,

%s, %s)' 'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,

rev_timestamp,

class="new"

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

7:12 p.m.

On Mon, Sep 21, 2015 at 12:09 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

You would need to rebuild link tables if you need them for either mwdumper or xml2sql. For your case it doesn't sound like you'd need them.

...

#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.

As noted in previous reply, your import process is buggy and the page record's page_title field is incorrect, so it cannot be found. You need to correctly parse the incoming title into namespace and base title portions and store them correctly into page_namespace numeric ID and page_title text portion.

...

#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?

They are on the wiki, in the 'Module' namespace. Should be included with a complete dump. I have no idea about the 'articles' dump, but I would assume it *should* include them.

...

#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.

That's another thing your import process needs to do. I recommend using existing code that already has all this logic. :) -- brion

C. Scott Ananian

8:03 p.m.

Note that Kiwix's "mw-offliner" script ( http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner ) does a pretty good job of converting a bunch of wiki pages to HTML, although it starts from a live wiki instance (and a properly-configured Parsoid pointed at it) rather than an XML dump. Zim-format dumps (for example, from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ ) can also be unpacked into a directory tree of HTML files. There are also the "HTML dumps" that the service team is involved with. This following links have more information: https://phabricator.wikimedia.org/T88728 https://phabricator.wikimedia.org/T93396 Perhaps your use case could inform the ongoing design of that service. --scott On Mon, Sep 21, 2015 at 3:12 PM, Brion Vibber <bvibber(a)wikimedia.org> wrote:

...

On Mon, Sep 21, 2015 at 12:09 PM, v0id null <v0idnull(a)gmail.com> wrote:

You would need to rebuild link tables if you need them for either mwdumper or xml2sql. For your case it doesn't sound like you'd need them.

#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table.

#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere?

They are on the wiki, in the 'Module' namespace. Should be included with a complete dump. I have no idea about the 'articles' dump, but I would assume it *should* include them.

#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores.

That's another thing your import process needs to do. I recommend using existing code that already has all this logic. :) -- brion _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

gnosygnu

23 Sep 23 Sep

3:09 a.m.

Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list or off-list if you are interested. Thanks. On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

#2: Is there some way to figure it out? as I showed in a previous reply, the template that it can't find, is there in the page table. As brion indicated, you need to strip the namespace name. The XML dump

also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template").

...

#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they available to download somewhere? Yes, these are articles with a title starting with "Module:". They will be

in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto

...

#4: I'm not any expert on mediawiki, but it seems when that the titles in the xml dump need to be formatted, mainly replacing spaces with underscores. Yes, surprisingly, the only change you'll need to make is to replace

spaces with underscores. Hope this helps.

...

thanks for the response --alex On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> wrote:

A few notes: 1) It sounds like you're recreating all the logic of importing a dump

into

a SQL database, which may be introducing problems if you have bugs in

your

code. For instance you may be mistakenly treating namespaces as text strings instead of numbers, or failing to escape things, or missing something else. I would recommend using one of the many existing tools

for

importing a dump, such as mwdumper or xml2sql: https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper 2) Make sure you've got a dump that includes the templates and lua

modules

etc. It sounds like either you don't have the Template: pages or your import process does not handle namespaces correctly. 3) Make sure you've got all the necessary extensions to replicate the

wiki

you're using a dump from, such as Lua. Many templates on Wikipedia call

Lua

modules, and won't work without them. 4) Not sure what "not web friendly" means regarding titles? -- brion On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote: > Hello Everyone, > > I've been trying to write a python script that will take an XML dump,

and

generate all HTML, using Mediawiki itself to handle all the parsing/processing, but I've run into a problem where all the parsed

output

have warnings that templates couldn't be found. I'm not sure what I'm

doing > wrong. > > So I'll explain my steps: > > First I execute the SQL script maintenance/table.sql > > Then I remove some indexes from the tables to speed up insertion. > > Finally I go through the XML which will execute the following insert > statements: > > 'insert into page > (page_id, page_namespace, page_title, page_is_redirect,

page_is_new,

page_random, page_latest, page_len, page_content_model) values (%s, %s, %s, %s,

%s,

%s, %s)' 'insert into revision (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,

rev_timestamp, > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > > All IDs from the XML dump are kept. I noticed that the titles are not

web

> friendly. Thinking this was the problem I ran the > maintenance/cleanupTitles.php script but it didn't seem to fix any

thing.

> > Doing this, I can now run the following PHP script: > $id = 'some revision id' > $rev = Revision::newFromId( $id ); > $titleObj = $rev->getTitle(); > $pageObj = WikiPage::factory( $titleObj ); > > $context = RequestContext::newExtraneousContext($titleObj); > > $popts = ParserOptions::newFromContext($context); > $pout = $pageObj->getParserOutput($popts); > > var_dump($pout); > > The mText property of $pout contains the parsed output, but it is full

stuff like this: <a href="/index.php?title=Template:Date&action=edit&redlink=1"

class="new" > title="Template:Date (page does not exist)">Template:Date</a> > > > I feel like I'm missing a step here. I tried importing the

templatelinks

SQL dump, but it also did not fix anything. It also did not include any header or footer which would be useful. Any insight or help is much appreciated, thank you. --alex _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

v0id null

6:25 p.m.

Thanks for the input everyone. I was not aware that importing the XML dumps was so involved. In the end I used xml2sql, but it required two patches, and a bit more work on my end, to get it to work. I also had to strip out the <DiscussionThreading> tag from the xml dump. But nevertheless it is very fast. For those wondering, I'm toying around with an automated news categorizer and wanted to use Wikinews as a corpus. Not perfect, but this is just hobbyist level stuff here. I'm using nltk so I wanted to keep things python-centric, but I've written up a PHP script that runs as a simple tcp server that my python script can connect to and ask for the HTML output. My python script first downloads mediawiki, the right xml dump, unzips everything, sets up LocalSettings.php, compiles xml2sql, runs it then imports the sql into the database. So essentially automates making an offline installation of what I assume is any mediawiki xml dump. Then it starts that simple PHP server (using plain sockets), and just sends it page IDs and it responds with the fully rendered HTML including headers and footers. I figure this approach, I can run a few forks on the python and php side to speed up the process. then I use python to parse through the HTML to get whatever I need from the page, which for now are the categories and the article content, which I can then use to train classifiers from nltk. maybe not the easiest approach, but it does make it easy to use. I've looked at the python parsers but none of them seem like they will be as successful or as correct as using Mediawiki itself. ---alex On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <gnosygnu(a)gmail.com> wrote:

...

#1: mwdumper has not been updated in a very long time. I did try to use

it,

but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility.

xml2sql

comes with a warning about having to rebuild links. Considering that I'm just in a command line and passing in page IDs manually, do I really need to worry about it? I'd be thrilled not to have to reinvent the wheel

here.

> #2: Is there some way to figure it out? as I showed in a previous reply, > the template that it can't find, is there in the page table.

> As brion indicated, you need to strip the namespace name. The XML dump also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template").

#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they

available > to download somewhere?

> Yes, these are articles with a title starting with "Module:". They will be in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto > #4: I'm not any expert on mediawiki, but it seems when that the titles in > the xml dump need to be formatted, mainly replacing spaces with > underscores.

> Yes, surprisingly, the only change you'll need to make is to replace spaces with underscores. Hope this helps. > thanks for the response > --alex

> On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> > wrote:

> > A few notes: >

> > 1) It sounds like you're recreating all the logic of importing a dump > into > > a SQL database, which may be introducing problems if you have bugs in > your > > code. For instance you may be mistakenly treating namespaces as text > > strings instead of numbers, or failing to escape things, or missing > > something else. I would recommend using one of the many existing tools > for > > importing a dump, such as mwdumper or xml2sql: >

https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper >

> > 2) Make sure you've got a dump that includes the templates and lua > modules > > etc. It sounds like either you don't have the Template: pages or your > > import process does not handle namespaces correctly. >

> > 3) Make sure you've got all the necessary extensions to replicate the > wiki > > you're using a dump from, such as Lua. Many templates on Wikipedia call > Lua > > modules, and won't work without them. >

> > 4) Not sure what "not web friendly" means regarding titles? >

> > -- brion >

> > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote: >

> > > Hello Everyone, > >

> > > I've been trying to write a python script that will take an XML dump, > and > > > generate all HTML, using Mediawiki itself to handle all the > > > parsing/processing, but I've run into a problem where all the parsed > > output > > > have warnings that templates couldn't be found. I'm not sure what I'm > > doing > > > wrong. > >

> > > So I'll explain my steps: > >

> > > First I execute the SQL script maintenance/table.sql > >

> > > Then I remove some indexes from the tables to speed up insertion. > >

> > > Finally I go through the XML which will execute the following insert > > > statements: > >

> > > 'insert into page > > > (page_id, page_namespace, page_title, page_is_redirect, > page_is_new, > > > page_random, > > > page_latest, page_len, page_content_model) values (%s, %s, %s, %s, > > %s, > > > %s, %s, %s, %s)' > >

> > > 'insert into text (old_id, old_text) values (%s, %s)' > >

> > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,

> > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,

rc_last_oldid,

> > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, > > rc_deleted, > > rc_logid) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,

%s, > > %s, > > > %s, %s)' > >

> > > 'insert into revision > > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, > > rev_timestamp, > > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > >

> > > All IDs from the XML dump are kept. I noticed that the titles are not > web > > > friendly. Thinking this was the problem I ran the > > > maintenance/cleanupTitles.php script but it didn't seem to fix any > thing. > >

> > > Doing this, I can now run the following PHP script: > > > $id = 'some revision id' > > > $rev = Revision::newFromId( $id ); > > > $titleObj = $rev->getTitle(); > > > $pageObj = WikiPage::factory( $titleObj ); > >

> > > $context = RequestContext::newExtraneousContext($titleObj); > >

> > > $popts = ParserOptions::newFromContext($context); > > > $pout = $pageObj->getParserOutput($popts); > >

> > > var_dump($pout); > >

> > > The mText property of $pout contains the parsed output, but it is full > of > > > stuff like this: > >

> > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > > class="new" > > > title="Template:Date (page does not exist)">Template:Date</a> > >

> >

> > > I feel like I'm missing a step here. I tried importing the > templatelinks > > > SQL dump, but it also did not fix anything. It also did not include any > > > header or footer which would be useful. > >

> > > Any insight or help is much appreciated, thank you. > >

> > > --alex > > > _______________________________________________ > > > Wikitech-l mailing list > > > Wikitech-l(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >

> _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

6:49 p.m.

You might consider pointing a Parsoid instance at your "simple PHP server". Using the Parsoid-format HTML DOM has several benefits over using the output of the PHP parser directly. Categories are much easier to extract, for instance. See https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Conte… (recording at https://youtu.be/3WJID_WC7BQ) and https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more hints on running queries over the Parsoid DOM. --scott On Wed, Sep 23, 2015 at 2:25 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

#1: mwdumper has not been updated in a very long time. I did try to use

it,

but it did not seem to work properly. I don't entirely remember what the problem was but I believe it was related to schema incompatibility.

xml2sql

here.

> #2: Is there some way to figure it out? as I showed in a previous reply, > the template that it can't find, is there in the page table.

#3: Those lua modules, are they stock modules included with the mediawiki software, or something much more custom? If the latter, are they

available > to download somewhere?

> Yes, surprisingly, the only change you'll need to make is to replace spaces with underscores. Hope this helps. > thanks for the response > --alex

> On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> > wrote:

> > A few notes: >

https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper >

> > 4) Not sure what "not web friendly" means regarding titles? >

> > -- brion >

> > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> wrote: >

> > > Hello Everyone, > >

> > > So I'll explain my steps: > >

> > > First I execute the SQL script maintenance/table.sql > >

> > > Then I remove some indexes from the tables to speed up insertion. > >

> > > Finally I go through the XML which will execute the following insert > > > statements: > >

> > > 'insert into text (old_id, old_text) values (%s, %s)' > >

> > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,

> > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,

rc_last_oldid,

> > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, > > rc_deleted, > > rc_logid) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,

%s, > > %s, > > > %s, %s)' > >

> > > $context = RequestContext::newExtraneousContext($titleObj); > >

> > > $popts = ParserOptions::newFromContext($context); > > > $pout = $pageObj->getParserOutput($popts); > >

> > > var_dump($pout); > >

> > > The mText property of $pout contains the parsed output, but it is full > of > > > stuff like this: > >

> > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > > class="new" > > > title="Template:Date (page does not exist)">Template:Date</a> > >

> >

> > > Any insight or help is much appreciated, thank you. > >

> _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

v0id null

7:27 p.m.

Looking at https://www.mediawiki.org/wiki/Parsoid/Setup It seems that I need a web server set up for mediawiki, and nodejs and I'd have to go through the Parsoid API which I guess is going through Meidawiki anyhow. Right now I use xpath to find everything I need. Getting categories for example is as simple as: $xpath = new DOMXPath($dom); $contents = $xpath->query("//div[@id='mw-normal-catlinks']//li/a"); $categories = []; foreach ($contents as $el) { $categories[] = $el->textContent; } Is there information that Parsoid makes available that isn't available from Mediawiki output directly? thanks, -alex On Wed, Sep 23, 2015 at 2:49 PM, C. Scott Ananian <cananian(a)wikimedia.org> wrote:

...

Thanks for the input everyone. I was not aware that importing the XML

dumps

was so involved. In the end I used xml2sql, but it required two patches, and a bit more

work

on my end, to get it to work. I also had to strip out the <DiscussionThreading> tag from the xml dump. But nevertheless it is very fast. For those wondering, I'm toying around with an automated news categorizer and wanted to use Wikinews as a corpus. Not perfect, but this is just hobbyist level stuff here. I'm using nltk so I wanted to keep things python-centric, but I've written up a PHP script that runs as a simple

tcp

server that my python script can connect to and ask for the HTML output.

python script first downloads mediawiki, the right xml dump, unzips everything, sets up LocalSettings.php, compiles xml2sql, runs it then imports the sql into the database. So essentially automates making an offline installation of what I assume is any mediawiki xml dump. Then it starts that simple PHP server (using plain sockets), and just sends it

page

IDs and it responds with the fully rendered HTML including headers and footers. I figure this approach, I can run a few forks on the python and php side

speed up the process. then I use python to parse through the HTML to get whatever I need from

the

page, which for now are the categories and the article content, which I

can

then use to train classifiers from nltk. maybe not the easiest approach, but it does make it easy to use. I've looked at the python parsers but none of them seem like they will be as successful or as correct as using Mediawiki itself. ---alex On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <gnosygnu(a)gmail.com> wrote: > Hi alex. I added some notes below based on my experience. (I'm the > developer for XOWA (http://gnosygnu.github.io/xowa/) which generates > offline wikis from the Wikimedia XML dumps) Feel free to follow up

on-list

> or off-list if you are interested. Thanks. > > On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idnull(a)gmail.com> wrote: > > > #1: mwdumper has not been updated in a very long time. I did try to

use

> it, > > but it did not seem to work properly. I don't entirely remember what

the

> > problem was but I believe it was related to schema incompatibility. > xml2sql > > comes with a warning about having to rebuild links. Considering that

I'm

> > just in a command line and passing in page IDs manually, do I really

need

> > to worry about it? I'd be thrilled not to have to reinvent the wheel > here. > > > > > > #2: Is there some way to figure it out? as I showed in a previous

reply,

> > the template that it can't find, is there in the page table. > > > > As brion indicated, you need to strip the namespace name. The XML dump > also has a "namespaces" node near the beginning. It lists every

namespace

> in the wiki with "name" and "ID". You can use a rule like "if the title > starts with a namespace and a colon, strip it". Hence, a title like > "Template:Date" starts with "Template:" and goes into the page table

with a

> title of just "Date" and a namespace of "10" (the namespace id for > "Template"). > > > > #3: Those lua modules, are they stock modules included with the

mediawiki

> > software, or something much more custom? If the latter, are they > available > > to download somewhere? > > > > Yes, these are articles with a title starting with "Module:". They

will

> be > in the pages-articles.xml.bz2 dump. You should make sure you have

Scribunto

> set up on your wiki, or else it won't use them. See: > https://www.mediawiki.org/wiki/Extension:Scribunto > > > > #4: I'm not any expert on mediawiki, but it seems when that the

titles in

> > the xml dump need to be formatted, mainly replacing spaces with > > underscores. > > > > Yes, surprisingly, the only change you'll need to make is to replace > spaces with underscores. > > Hope this helps. > > > > thanks for the response > > --alex > > > > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org> > > wrote: > > > > > A few notes: > > > > > > 1) It sounds like you're recreating all the logic of importing a

dump

> > into > > > a SQL database, which may be introducing problems if you have bugs

> > your > > > code. For instance you may be mistakenly treating namespaces as text > > > strings instead of numbers, or failing to escape things, or missing > > > something else. I would recommend using one of the many existing

tools

> > for > > > importing a dump, such as mwdumper or xml2sql: > > > > > > >

https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper

> > > > > > 2) Make sure you've got a dump that includes the templates and lua > > modules > > > etc. It sounds like either you don't have the Template: pages or

your

> > > import process does not handle namespaces correctly. > > > > > > 3) Make sure you've got all the necessary extensions to replicate

the

> > wiki > > > you're using a dump from, such as Lua. Many templates on Wikipedia

call

> > Lua > > > modules, and won't work without them. > > > > > > 4) Not sure what "not web friendly" means regarding titles? > > > > > > -- brion > > > > > > > > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com> > wrote: > > > > > > > Hello Everyone, > > > > > > > > I've been trying to write a python script that will take an XML

dump,

> > and > > > > generate all HTML, using Mediawiki itself to handle all the > > > > parsing/processing, but I've run into a problem where all the

parsed

> > > output > > > > have warnings that templates couldn't be found. I'm not sure what

I'm

> > > doing > > > > wrong. > > > > > > > > So I'll explain my steps: > > > > > > > > First I execute the SQL script maintenance/table.sql > > > > > > > > Then I remove some indexes from the tables to speed up insertion. > > > > > > > > Finally I go through the XML which will execute the following

insert

> > > > statements: > > > > > > > > 'insert into page > > > > (page_id, page_namespace, page_title, page_is_redirect, > > page_is_new, > > > > page_random, > > > > page_latest, page_len, page_content_model) values (%s, %s,

%s,

> %s, > > > %s, > > > > %s, %s, %s, %s)' > > > > > > > > 'insert into text (old_id, old_text) values (%s, %s)' > > > > > > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, > rc_user_text, > > > > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, > rc_last_oldid, > > > > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len,

rc_new_len,

> > > > rc_deleted, > > > > rc_logid) > > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, > %s, > > > %s, > > > > %s, %s)' > > > > > > > > 'insert into revision > > > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, > > > rev_timestamp, > > > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id,

rev_sha1)

> > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > > > > > > > > All IDs from the XML dump are kept. I noticed that the titles are

not

> > web > > > > friendly. Thinking this was the problem I ran the > > > > maintenance/cleanupTitles.php script but it didn't seem to fix any > > thing. > > > > > > > > Doing this, I can now run the following PHP script: > > > > $id = 'some revision id' > > > > $rev = Revision::newFromId( $id ); > > > > $titleObj = $rev->getTitle(); > > > > $pageObj = WikiPage::factory( $titleObj ); > > > > > > > > $context = RequestContext::newExtraneousContext($titleObj); > > > > > > > > $popts = ParserOptions::newFromContext($context); > > > > $pout = $pageObj->getParserOutput($popts); > > > > > > > > var_dump($pout); > > > > > > > > The mText property of $pout contains the parsed output, but it is > full > > of > > > > stuff like this: > > > > > > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > > > class="new" > > > > title="Template:Date (page does not exist)">Template:Date</a> > > > > > > > > > > > > I feel like I'm missing a step here. I tried importing the > > templatelinks > > > > SQL dump, but it also did not fix anything. It also did not

include

any

> > header or footer which would be useful. > > > > Any insight or help is much appreciated, thank you. > > > > --alex > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

8:25 p.m.

On Wed, Sep 23, 2015 at 3:27 PM, v0id null <v0idnull(a)gmail.com> wrote:

...

Is there information that Parsoid makes available that isn't available from Mediawiki output directly?

Yes, certainly. https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec should give you an idea. One (of many) examples are comments in wikitext, which are stripped from PHP output but present in Parsoid HTML. Your question really is, "is there information *I need* that Parsoid makes available that isn't available from Mediawiki output directly?" I don't know the answer to that. Probably the right thing is to keep going with the implementation you've got, but if you get stuck keep in the back of your mind that switching to Parsoid might help. --scott -- (http://cscott.net)

3143

days inactive

3145

days old

wikitech-l@lists.wikimedia.org

Manage subscription

13 comments

5 participants

tags (0)

participants (5)

Brion Vibber
C. Scott Ananian
gnosygnu
John
v0id null