It seems that I need a web server set up for mediawiki, and nodejs and I'd
have to go through the Parsoid API which I guess is going through Meidawiki
anyhow.
Right now I use xpath to find everything I need. Getting categories for
example is as simple as:
$xpath = new DOMXPath($dom);
$contents =
$xpath->query("//div[@id='mw-normal-catlinks']//li/a");
$categories = [];
foreach ($contents as $el) {
$categories[] = $el->textContent;
}
Is there information that Parsoid makes available that isn't available from
Mediawiki output directly?
thanks,
-alex
On Wed, Sep 23, 2015 at 2:49 PM, C. Scott Ananian <cananian(a)wikimedia.org>
wrote:
You might consider pointing a Parsoid instance at your
"simple PHP
server". Using the Parsoid-format HTML DOM has several benefits over
using the output of the PHP parser directly. Categories are much
easier to extract, for instance.
See
https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Conte…
(recording at
https://youtu.be/3WJID_WC7BQ) and
https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more
hints on running queries over the Parsoid DOM.
--scott
On Wed, Sep 23, 2015 at 2:25 PM, v0id null <v0idnull(a)gmail.com> wrote:
Thanks for the input everyone. I was not aware
that importing the XML
dumps
was so involved.
In the end I used xml2sql, but it required two patches, and a bit more
work
on my end, to get it to work. I also had to strip
out the
<DiscussionThreading> tag from the xml dump. But nevertheless it is very
fast.
For those wondering, I'm toying around with an automated news categorizer
and wanted to use Wikinews as a corpus. Not perfect, but this is just
hobbyist level stuff here. I'm using nltk so I wanted to keep things
python-centric, but I've written up a PHP script that runs as a simple
tcp
server that my python script can connect to and
ask for the HTML output.
My
python script first downloads mediawiki, the
right xml dump, unzips
everything, sets up LocalSettings.php, compiles xml2sql, runs it then
imports the sql into the database. So essentially automates making an
offline installation of what I assume is any mediawiki xml dump. Then it
starts that simple PHP server (using plain sockets), and just sends it
page
IDs and it responds with the fully rendered HTML
including headers and
footers.
I figure this approach, I can run a few forks on the python and php side
to
speed up the process.
then I use python to parse through the HTML to get whatever I need from
the
page, which for now are the categories and the
article content, which I
can
then use to train classifiers from nltk.
maybe not the easiest approach, but it does make it easy to use. I've
looked at the python parsers but none of them seem like they will be as
successful or as correct as using Mediawiki itself.
---alex
On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <gnosygnu(a)gmail.com> wrote:
> Hi alex. I added some notes below based on my experience. (I'm the
> developer for XOWA (
http://gnosygnu.github.io/xowa/) which generates
> offline wikis from the Wikimedia XML dumps) Feel free to follow up
on-list
> or off-list if you are interested. Thanks.
>
> On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idnull(a)gmail.com> wrote:
>
> > #1: mwdumper has not been updated in a very long time. I did try to
use
> it,
> > but it did not seem to work properly. I don't entirely remember what
the
> > problem was but I believe it was related
to schema incompatibility.
> xml2sql
> > comes with a warning about having to rebuild links. Considering that
I'm
> > just in a command line and passing in
page IDs manually, do I really
need
> > to worry about it? I'd be thrilled
not to have to reinvent the wheel
> here.
> >
>
>
> > #2: Is there some way to figure it out? as I showed in a previous
reply,
> > the template that it can't find, is
there in the page table.
> >
> > As brion indicated, you need to strip the namespace name. The XML dump
> also has a "namespaces" node near the beginning. It lists every
namespace
> in the wiki with "name" and
"ID". You can use a rule like "if the title
> starts with a namespace and a colon, strip it". Hence, a title like
> "Template:Date" starts with "Template:" and goes into the page
table
with a
> title of just "Date" and a
namespace of "10" (the namespace id for
> "Template").
>
>
> > #3: Those lua modules, are they stock modules included with the
mediawiki
> > software, or something much more custom?
If the latter, are they
> available
> > to download somewhere?
> >
> > Yes, these are articles with a title starting with "Module:". They
will
> be
> in the pages-articles.xml.bz2 dump. You should make sure you have
Scribunto
> set up on your wiki, or else it won't use
them. See:
>
https://www.mediawiki.org/wiki/Extension:Scribunto
>
>
> > #4: I'm not any expert on mediawiki, but it seems when that the
titles in
> > the xml dump need to be formatted,
mainly replacing spaces with
> > underscores.
> >
> > Yes, surprisingly, the only change you'll need to make is to replace
> spaces with underscores.
>
> Hope this helps.
>
>
> > thanks for the response
> > --alex
> >
> > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvibber(a)wikimedia.org>
> > wrote:
> >
> > > A few notes:
> > >
> > > 1) It sounds like you're recreating all the logic of importing a
dump
> > into
> > > a SQL database, which may be introducing problems if you have bugs
in
> > your
> > > code. For instance you may be mistakenly treating namespaces as text
> > > strings instead of numbers, or failing to escape things, or missing
> > > something else. I would recommend using one of the many existing
tools
> > for
> > > importing a dump, such as mwdumper or xml2sql:
> > >
> > >
>
https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
> > >
> > > 2) Make sure you've got a dump that includes the templates and lua
> > modules
> > > etc. It sounds like either you don't have the Template: pages or
your
> > > import process does not handle
namespaces correctly.
> > >
> > > 3) Make sure you've got all the necessary extensions to replicate
the
> > wiki
> > > you're using a dump from, such as Lua. Many templates on Wikipedia
call
> > Lua
> > > modules, and won't work without them.
> > >
> > > 4) Not sure what "not web friendly" means regarding titles?
> > >
> > > -- brion
> > >
> > >
> > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idnull(a)gmail.com>
> wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I've been trying to write a python script that will take an XML
dump,
> > and
> > > > generate all HTML, using Mediawiki itself to handle all the
> > > > parsing/processing, but I've run into a problem where all the
parsed
> > > output
> > > > have warnings that templates couldn't be found. I'm not sure
what
I'm
> > > doing
> > > > wrong.
> > > >
> > > > So I'll explain my steps:
> > > >
> > > > First I execute the SQL script maintenance/table.sql
> > > >
> > > > Then I remove some indexes from the tables to speed up insertion.
> > > >
> > > > Finally I go through the XML which will execute the following
insert
> > > > statements:
> > > >
> > > > 'insert into page
> > > > (page_id, page_namespace, page_title, page_is_redirect,
> > page_is_new,
> > > > page_random,
> > > > page_latest, page_len, page_content_model) values (%s, %s,
%s,
> %s,
> > > %s,
> > > > %s, %s, %s, %s)'
> > > >
> > > > 'insert into text (old_id, old_text) values (%s, %s)'
> > > >
> > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user,
> rc_user_text,
> > > > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
> rc_last_oldid,
> > > > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len,
rc_new_len,
> > > > rc_deleted,
> > > > rc_logid)
> > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> %s,
> > > %s,
> > > > %s, %s)'
> > > >
> > > > 'insert into revision
> > > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> > > rev_timestamp,
> > > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id,
rev_sha1)
> > > > values (%s, %s, %s, %s,
%s, %s, %s, %s, %s, %s, %s)'
> > > >
> > > > All IDs from the XML dump are kept. I noticed that the titles are
not
> > web
> > > > friendly. Thinking this was the problem I ran the
> > > > maintenance/cleanupTitles.php script but it didn't seem to fix
any
> > thing.
> > > >
> > > > Doing this, I can now run the following PHP script:
> > > > $id = 'some revision id'
> > > > $rev = Revision::newFromId( $id );
> > > > $titleObj = $rev->getTitle();
> > > > $pageObj = WikiPage::factory( $titleObj );
> > > >
> > > > $context = RequestContext::newExtraneousContext($titleObj);
> > > >
> > > > $popts = ParserOptions::newFromContext($context);
> > > > $pout = $pageObj->getParserOutput($popts);
> > > >
> > > > var_dump($pout);
> > > >
> > > > The mText property of $pout contains the parsed output, but it is
> full
> > of
> > > > stuff like this:
> > > >
> > > > <a
href="/index.php?title=Template:Date&action=edit&redlink=1"
> > > class="new"
> > > > title="Template:Date (page does not
exist)">Template:Date</a>
> > > >
> > > >
> > > > I feel like I'm missing a step here. I tried importing the
> > templatelinks
> > > > SQL dump, but it also did not fix anything. It also did not
include
any
> > header or footer which would be
useful.
> >
> > Any insight or help is much appreciated, thank you.
> >
> > --alex
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l(a)lists.wikimedia.org
> >
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
(
http://cscott.net)
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l