Hi all, I've been tasked with setting up a local copy of the English Wikipedia for researchers - sort of like another Toolserver. I'm not having much luck, and wondered if anyone has done this recently, and what approach they used? We only really need the current article text - history and meta pages aren't needed.
Things I have tried: 1) Downloading and mounting the SQL dumps
No good because they don't contain article text
2) Downloading and mounting other SQL "research dumps" (eg ftp://ftp.rediris.es/mirror/WKP_research)
No good because they're years out of date
3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
No good because they decompress to astronomically large. I got about halfway through decompressing them and was over 7Tb.
Also, WikiXRay appears to be old and out of date (although interestingly its author Felipe Ortega has just committed to the gitorious repository[1] on Monday for the first time in over a year)
4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
No good because it's old and out of date: it only supports export version 0.3, and the current dumps are 0.6
5) Using importDump.php on a latest-pages-articles.xml dump [2]
No good because it just spews out 7.6Gb of this output:
PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler out_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 ...
So, any suggestions for approaches that might work? Or suggestions for fixing the errors in step 5?
Steve
[1] http://gitorious.org/wikixray [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz...
On 2012-06-12 23:19, Steve Bennett wrote:
I've been tasked with setting up a local copy of the English Wikipedia for researchers - sort of like another Toolserver. I'm not having much luck,
Have your researchers learn Icelandic. Importing the small Icelandic Wikipedia is fast. They can test their theories and see if their hypotheses make any sense. When they've done their research on Icelandic, have them learn Danish, then Norwegian, Swedish, Dutch, before going to German and finally English. There's a fine spiral of language sizes around the North Sea.
It's when they are frustrated waiting for an analysis taking 15 minutes for Norwegian, that they will find smarter algorithms that will enable them to take on the larger languages.
mwdumper seems to work for recent dumps: http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevagewp@gmail.com wrote:
Hi all, I've been tasked with setting up a local copy of the English Wikipedia for researchers - sort of like another Toolserver. I'm not having much luck, and wondered if anyone has done this recently, and what approach they used? We only really need the current article text
- history and meta pages aren't needed.
Things I have tried:
- Downloading and mounting the SQL dumps
No good because they don't contain article text
- Downloading and mounting other SQL "research dumps" (eg
ftp://ftp.rediris.es/mirror/WKP_research)
No good because they're years out of date
- Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
No good because they decompress to astronomically large. I got about halfway through decompressing them and was over 7Tb.
Also, WikiXRay appears to be old and out of date (although interestingly its author Felipe Ortega has just committed to the gitorious repository[1] on Monday for the first time in over a year)
- Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
No good because it's old and out of date: it only supports export version 0.3, and the current dumps are 0.6
- Using importDump.php on a latest-pages-articles.xml dump [2]
No good because it just spews out 7.6Gb of this output:
PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler out_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 ...
So, any suggestions for approaches that might work? Or suggestions for fixing the errors in step 5?
Steve
[1] http://gitorious.org/wikixray [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I ran into this problem recently. A python script is available at https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport..., that will convert .xml.bz2 dumps into flat fast-import files which can be loaded into most databases. Sorry this tool is still alpha quality.
Feel free to contact with problems.
-Adam Wight
jc@sahnwaldt.de:
mwdumper seems to work for recent dumps: http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevagewp@gmail.com wrote:
Hi all, I've been tasked with setting up a local copy of the English Wikipedia for researchers - sort of like another Toolserver. I'm not having much luck, and wondered if anyone has done this recently, and what approach they used? We only really need the current article text
- history and meta pages aren't needed.
Things I have tried:
- Downloading and mounting the SQL dumps
No good because they don't contain article text
- Downloading and mounting other SQL "research dumps" (eg
ftp://ftp.rediris.es/mirror/WKP_research)
No good because they're years out of date
- Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
No good because they decompress to astronomically large. I got about halfway through decompressing them and was over 7Tb.
Also, WikiXRay appears to be old and out of date (although interestingly its author Felipe Ortega has just committed to the gitorious repository[1] on Monday for the first time in over a year)
- Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
No good because it's old and out of date: it only supports export version 0.3, and the current dumps are 0.6
- Using importDump.php on a latest-pages-articles.xml dump [2]
No good because it just spews out 7.6Gb of this output:
PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler out_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 ...
So, any suggestions for approaches that might work? Or suggestions for fixing the errors in step 5?
Steve
[1] http://gitorious.org/wikixray [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks, I'm trying this. It consumes phenomenal amounts of memory though - I keep getting a "Killed" message from Ubuntu, even with a 20Gb swap file. Will keep trying with an even bigger one.
I'll also give mwdumper another go.
Steve
On Wed, Jun 13, 2012 at 3:03 PM, Adam Wight spam@ludd.net wrote:
I ran into this problem recently. A python script is available at https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport..., that will convert .xml.bz2 dumps into flat fast-import files which can be loaded into most databases. Sorry this tool is still alpha quality.
Feel free to contact with problems.
-Adam Wight
jc@sahnwaldt.de:
mwdumper seems to work for recent dumps: http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett stevagewp@gmail.com wrote:
Hi all, I've been tasked with setting up a local copy of the English Wikipedia for researchers - sort of like another Toolserver. I'm not having much luck, and wondered if anyone has done this recently, and what approach they used? We only really need the current article text
- history and meta pages aren't needed.
Things I have tried:
- Downloading and mounting the SQL dumps
No good because they don't contain article text
- Downloading and mounting other SQL "research dumps" (eg
ftp://ftp.rediris.es/mirror/WKP_research)
No good because they're years out of date
- Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
No good because they decompress to astronomically large. I got about halfway through decompressing them and was over 7Tb.
Also, WikiXRay appears to be old and out of date (although interestingly its author Felipe Ortega has just committed to the gitorious repository[1] on Monday for the first time in over a year)
- Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
No good because it's old and out of date: it only supports export version 0.3, and the current dumps are 0.6
- Using importDump.php on a latest-pages-articles.xml dump [2]
No good because it just spews out 7.6Gb of this output:
PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler out_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 PHP Warning: xml_parse(): Unable to call handler in_() in /usr/share/mediawiki/includes/Import.php on line 437 ...
So, any suggestions for approaches that might work? Or suggestions for fixing the errors in step 5?
Steve
[1] http://gitorious.org/wikixray [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org