On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <stevagewp(a)gmail.com> wrote:
Hi all,
I've been tasked with setting up a local copy of the English
Wikipedia for researchers - sort of like another Toolserver. I'm not
having much luck, and wondered if anyone has done this recently, and
what approach they used? We only really need the current article text
- history and meta pages aren't needed.
Things I have tried:
1) Downloading and mounting the SQL dumps
No good because they don't contain article text
2) Downloading and mounting other SQL "research dumps" (eg
ftp://ftp.rediris.es/mirror/WKP_research)
No good because they're years out of date
3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
No good because they decompress to astronomically large. I got about
halfway through decompressing them and was over 7Tb.
Also, WikiXRay appears to be old and out of date (although
interestingly its author Felipe Ortega has just committed to the
gitorious repository[1] on Monday for the first time in over a year)
4) Using MWDumper (
http://www.mediawiki.org/wiki/Manual:MWDumper)
No good because it's old and out of date: it only supports export
version 0.3, and the current dumps are 0.6
5) Using importDump.php on a latest-pages-articles.xml dump [2]
No good because it just spews out 7.6Gb of this output:
PHP Warning: xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning: xml_parse(): Unable to call handler out_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning: xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning: xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
...
So, any suggestions for approaches that might work? Or suggestions for
fixing the errors in step 5?
Steve
[1]
http://gitorious.org/wikixray
[2]
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.b…
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org