[Mediawiki-l] Mass Import

Wolfe, Jeff Jeff_Wolfe at intuit.com
Wed Apr 13 21:28:37 UTC 2005


Hi John,

I'm not familiar with LWP (though I google'd it and get the basic idea), but
I'll take any help I can get.  One could almost build a command line SDK
that way for instances where you didn't want to hit the db directly.

I was thinking about just pushing into cur, category, and searchindex, but I
think you have an excellent point.  I really like being able to attribute
the author, source, etc.  Have you considered trying to use some of the php
scripts from the command-line as an alternative?

I would indeed appreciate your scripts if you don't mind.

Thanks,
Jeff


-----Original Message-----
From: mediawiki-l-bounces at Wikimedia.org
[mailto:mediawiki-l-bounces at Wikimedia.org] On Behalf Of John Blumel
Sent: Wednesday, April 13, 2005 3:57 PM
To: MediaWiki announcements and site admin list
Subject: Re: [Mediawiki-l] Mass Import

On Apr 13, 2005, at 4:23pm, Wolfe, Jeff wrote:

> I'm seeking a way to mass import lots of data into a MediaWiki.  I can 
> massage my data in most reasonable ways and have direct access to the 
> database.  I can use existing PHP, generate fake URLS, or hit the SQL 
> database directly.  Does anyone have a suggestion?

I'm working on a similar issue and decided to load the data through
MediaWiki's web interface, using a bot written in Perl (using LWP). I went
that way for a couple of reason's, chiefly because I want the original
submission attributable to a specific source (depending on the user name I
give the bot) and I want all the file updates that normally take place
(category assignment, recent changes, etc.) to occur without me having to
worry about what exactly the MediaWiki code does and when it does it.

One of my sources has about 900 entries and there are several others that
are smaller, so it's a lot less work than creating all these entries
manually, even though some of the sources are non-trivial to parse, and I
expect fewer errors in the final text using this method. 
I'm also creating category info off the extracted data and will insert that
into the final wiki text before it is uploaded so that the submitted entries
will be assigned to specific categories

The bot, in this case, simply does the work of submitting the generated
entries and I'm creating individual scripts to parse the various source
materials. The next step is to generate HTML output (1 file per entry) from
the data files I've generated (also individual scripts since the sources
contain different types of information) and then convert that to wiki text
for the bot to upload. (I could skip the HTML but I'd like to be able to
"preview" a sampling of the entries before I start uploading them and it's
not that much more work.) I'll probably also create a second bot to delete a
set of entries, just so that I can get rid of the entries resulting from
"test runs" on a test wiki I set up.

You're welcome to the scripts I'm working on, although, none of them is
completely finished at the moment, other than a couple of parsing scripts
that wouldn't be of much use to you.


John Blumel

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l at Wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l




More information about the MediaWiki-l mailing list