Turns out some only some download tools can read the whole 5.4GB file.
This is the status: * 3.0GB (pages-articles) file though the checksum is wrong!! I've tried many download tools.
Problem 1: I managed to get most of 3.0GB on into mysql, but got an error message in the middle - Exception in thread "main" java.io.IOException: An invalid XML character (Unicode: 0x2) was found in the element content of the document. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
Problem 2: I got exactly 42000 rows. How could that be?
* 5.4GB file (pages-meta-current), with the correct checksum
Problem 3: I tried to load the 5.4 GB data to mysql (to a clean database of course) and got only 4000 rows! I don't understand why everything is so difficult. While it seems file fine in the command line, it doesn't look like that in the database: D:\Projects\wikipedia> java -jar mwdumper.jar --format=sql:1.5 F:\Datasets\Wikipedia\enwiki-20071018-pages-meta-current.xml.bz2 | mysql -u <username> -p wikipedia --default-character-set=utf8
... 10,632,000 pages (778.271/sec), 10,632,000 revs (778.271/sec) 10,633,000 pages (778.294/sec), 10,633,000 revs (778.294/sec) 10,633,249 pages (778.301/sec), 10,633,249 revs (778.301/sec)
Thanks
P Please consider the environment before printing this e-mail P Please consider the environment before printing this e-mail
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A. Desrosiers Sent: Tuesday, October 30, 2007 11:17 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Dump is small
On Tue, 2007-10-30 at 23:21 +0900, Fawad Nazir wrote:
The problem is not with downloading. The problem is with loading the data into mysql.
Fawad, you must be replying to another post, because I was replying to Osnat, who was complaining that the 5.4G dump was 1.5G. I just disproved that.
On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:
When I try to download pages-meta-current the size is 1.5GB, instead of 5.4GB. When I download pages-articles the size is 3GB like it should be.
Now, I have all of the relevant dumps locally... so let me try to unpack and import them into MySQL, and see if they continue to work. I suspect that they will, because they always have for me.
I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.
I'll post back with my results when that is done.