Turns out some only some download tools can read the whole 5.4GB file.
This is the status:
* 3.0GB (pages-articles) file though the checksum is wrong!! I've tried
many download tools.
Problem 1: I managed to get most of 3.0GB on into mysql, but got an
error message in the middle -
Exception in thread "main" java.io.IOException: An invalid XML character
(Unicode: 0x2) was found in the element content of the document.
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
Problem 2: I got exactly 42000 rows. How could that be?
* 5.4GB file (pages-meta-current), with the correct checksum
Problem 3: I tried to load the 5.4 GB data to mysql (to a clean database
of course) and got only 4000 rows! I don't understand why everything is
so difficult.
While it seems file fine in the command line, it doesn't look like that
in the database:
D:\Projects\wikipedia> java -jar mwdumper.jar --format=sql:1.5
F:\Datasets\Wikipedia\enwiki-20071018-pages-meta-current.xml.bz2 | mysql
-u <username> -p wikipedia --default-character-set=utf8
...
10,632,000 pages (778.271/sec), 10,632,000 revs (778.271/sec)
10,633,000 pages (778.294/sec), 10,633,000 revs (778.294/sec)
10,633,249 pages (778.301/sec), 10,633,249 revs (778.301/sec)
Thanks
P Please consider the environment before printing this e-mail P Please
consider the environment before printing this e-mail
-----Original Message-----
From: wikitech-l-bounces(a)lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A.
Desrosiers
Sent: Tuesday, October 30, 2007 11:17 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Dump is small
On Tue, 2007-10-30 at 23:21 +0900, Fawad Nazir wrote:
The problem is not with downloading. The problem is
with loading the
data into mysql.
Fawad, you must be replying to another post, because I was replying to
Osnat, who was complaining that the 5.4G dump was 1.5G. I just disproved
that.
On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:
When I try to download pages-meta-current the size is
1.5GB, instead
of 5.4GB. When I download pages-articles the size is 3GB like it
should be.
Now, I have all of the relevant dumps locally... so let me try to unpack
and import them into MySQL, and see if they continue to work. I suspect
that they will, because they always have for me.
I have the whole process of fetch, unpack, import scripted to happen
unattended and aside from initial debugging, it has not failed yet in
the last year or more.
I'll post back with my results when that is done.
--
David A. Desrosiers
desrod(a)gnu-designs.com
setuid(a)gmail.com
http://projects.plkr.org/
Skype...: 860-967-3820
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l