Thanks for the input everyone. I was not aware that importing the XML dumps
was so involved.
In the end I used xml2sql, but it required two patches, and a bit more work
on my end, to get it to work. I also had to strip out the
<DiscussionThreading> tag from the xml dump. But nevertheless it is very
fast.
For those wondering, I'm toying around with an automated news categorizer
and wanted to use Wikinews as a corpus. Not perfect, but this is just
hobbyist level stuff here. I'm using nltk so I wanted to keep things
python-centric, but I've written up a PHP script that runs as a simple tcp
server that my python script can connect to and ask for the HTML output. My
python script first downloads mediawiki, the right xml dump, unzips
everything, sets up LocalSettings.php, compiles xml2sql, runs it then
imports the sql into the database. So essentially automates making an
offline installation of what I assume is any mediawiki xml dump. Then it
starts that simple PHP server (using plain sockets), and just sends it page
IDs and it responds with the fully rendered HTML including headers and
footers.
I figure this approach, I can run a few forks on the python and php side to
speed up the process.
then I use python to parse through the HTML to get whatever I need from the
page, which for now are the categories and the article content, which I can
then use to train classifiers from nltk.
maybe not the easiest approach, but it does make it easy to use. I've
looked at the python parsers but none of them seem like they will be as
successful or as correct as using Mediawiki itself.
---alex
On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <gnosygnu(a)gmail.com> wrote:
Hi alex. I added some notes below based on my
experience. (I'm the
developer for XOWA (
http://gnosygnu.github.io/xowa/) which generates
offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idnull(a)gmail.com> wrote:
#1: mwdumper has not been updated in a very long
time. I did try to use
it,
but it did not seem to work properly. I don't
entirely remember what the
problem was but I believe it was related to schema incompatibility.
xml2sql
comes with a warning about having to rebuild
links. Considering that I'm
just in a command line and passing in page IDs manually, do I really need
to worry about it? I'd be thrilled not to have to reinvent the wheel
here.
> #2: Is there some way to figure it out? as I showed in a previous reply,
> the template that it can't find, is there in the page table.
> As brion indicated, you need to strip
the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every namespace
in the wiki with "name" and "ID". You can use a rule like "if
the title
starts with a namespace and a colon, strip it". Hence, a title like
"Template:Date" starts with "Template:" and goes into the page table
with a
title of just "Date" and a namespace of "10" (the namespace id for
"Template").
#3: Those lua modules, are they stock modules
included with the mediawiki
software, or something much more custom? If the latter, are they
available
> to download somewhere?
> Yes, these are articles with a title
starting with "Module:". They will
be
in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
set up on your wiki, or else it won't use them. See:
https://www.mediawiki.org/wiki/Extension:Scribunto
> #4: I'm not any expert on mediawiki, but it seems when that the titles in
> the xml dump need to be formatted, mainly replacing spaces with
> underscores.
> Yes, surprisingly, the only change
you'll need to make is to replace
spaces with underscores.
Hope this helps.
> thanks for the response
> --alex
> On Mon, Sep 21, 2015 at 3:00 PM,
Brion Vibber <bvibber(a)wikimedia.org>
> wrote:
> > A few notes:
>
> > 1) It sounds like
you're recreating all the logic of importing a dump
> into
> > a SQL database, which may be introducing problems if you have bugs in
> your
> > code. For instance you may be mistakenly treating namespaces as text
> > strings instead of numbers, or failing to escape things, or missing
> > something else. I would recommend using one of the many existing tools
> for
> > importing a dump, such as mwdumper or xml2sql:
>
>
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
>
> > 2) Make sure you've got
a dump that includes the templates and lua
> modules
> > etc. It sounds like either you don't have the Template: pages or your
> > import process does not handle namespaces correctly.
>
> > 3) Make sure you've got
all the necessary extensions to replicate the
> wiki
> > you're using a dump from, such as Lua. Many templates on Wikipedia call
> Lua
> > modules, and won't work without them.
>
> > 4) Not sure what "not
web friendly" means regarding titles?
>
> > -- brion
>
>
> > On Mon, Sep 21, 2015 at 11:50 AM, v0id null
<v0idnull(a)gmail.com>
wrote:
>
> > > Hello Everyone,
> >
> > > I've been
trying to write a python script that will take an XML dump,
> and
> > > generate all HTML, using Mediawiki itself to handle all the
> > > parsing/processing, but I've run into a problem where all the parsed
> > output
> > > have warnings that templates couldn't be found. I'm not sure what
I'm
> > doing
> > > wrong.
> >
> > > So I'll
explain my steps:
> >
> > > First I execute
the SQL script maintenance/table.sql
> >
> > > Then I remove
some indexes from the tables to speed up insertion.
> >
> > > Finally I go
through the XML which will execute the following insert
> > > statements:
> >
> > > 'insert into
page
> > > (page_id, page_namespace, page_title, page_is_redirect,
> page_is_new,
> > > page_random,
> > > page_latest, page_len, page_content_model) values (%s, %s, %s,
%s,
> > %s,
> > > %s, %s, %s, %s)'
> >
> > > 'insert into
text (old_id, old_text) values (%s, %s)'
> >
> > > 'insert into
recentchanges (rc_id, rc_timestamp, rc_user,
rc_user_text,
> > rc_title, rc_minor, rc_bot,
rc_cur_id, rc_this_oldid,
rc_last_oldid,
> > rc_type, rc_source, rc_patrolled,
rc_ip, rc_old_len, rc_new_len,
> > rc_deleted,
> > rc_logid)
> > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s,
> > %s,
> > > %s, %s)'
> >
> > > 'insert into
revision
> > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> > rev_timestamp,
> > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
> > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
> >
> > > All IDs from the
XML dump are kept. I noticed that the titles are not
> web
> > > friendly. Thinking this was the problem I ran the
> > > maintenance/cleanupTitles.php script but it didn't seem to fix any
> thing.
> >
> > > Doing this, I can
now run the following PHP script:
> > > $id = 'some revision id'
> > > $rev = Revision::newFromId( $id );
> > > $titleObj = $rev->getTitle();
> > > $pageObj = WikiPage::factory( $titleObj );
> >
> > > $context =
RequestContext::newExtraneousContext($titleObj);
> >
> > > $popts =
ParserOptions::newFromContext($context);
> > > $pout = $pageObj->getParserOutput($popts);
> >
> > >
var_dump($pout);
> >
> > > The mText
property of $pout contains the parsed output, but it is
full
> of
> > > stuff like this:
> >
> > > <a
href="/index.php?title=Template:Date&action=edit&redlink=1"
> > class="new"
> > > title="Template:Date (page does not
exist)">Template:Date</a>
> >
> >
> > > I feel like I'm missing a step here. I tried
importing the
> templatelinks
> > > SQL dump, but it also did not fix anything. It also did not include
any
> > > header or footer which would be useful.
> >
> > > Any insight or
help is much appreciated, thank you.
> >
> > > --alex
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l(a)lists.wikimedia.org
> > >
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l(a)lists.wikimedia.org
> >
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l