Dump is small

List overview All Threads
Download

newer

older

Unable to edit pages while logged...

New SVN users

Osnat Etgar

29 Oct 2007 29 Oct '07

4:49 p.m.

The last pages-meta-current.xml.bz2 in 20071018 is said to be 5.4GB, but when I downloaded it, it was only 1.5 GB. Why is that? Is there a problem with this dump? I saw there was a lot of discussion about it. What should I do? Thanks

P Please consider the environment before printing this e-mail

Show replies by date

Fawad Nazir

29 Oct 29 Oct

4:56 p.m.

...

The last pages-meta-current.xml.bz2 in 20071018 is said to be 5.4GB, but when I downloaded it, it was only 1.5 GB. Why is that? Is there a problem with this dump? I saw there was a lot of discussion about it. What should I do? Thanks

Are you looking for enwiki-20070908-stub-meta-history.xml.gz, this is 5.4GB. Try downloading this one.

fawad@cyprus:~/wiki$ ls -lh -r-------- 1 fawad fawad 5.4G 2007-10-21 09:29 enwiki-20070908-stub-meta-history.xml.gz

-- Fawad Nazir http://www.geocities.com/nazir_fawad/

Osnat Etgar

5:47 p.m.

I don't want all the history. I just want the current articles, so I am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2 When I try to download pages-meta-current the size is 1.5GB, instead of 5.4GB. When I download pages-articles the size is 3GB like it should be. Where else can I get the pages-meta-current? The previous dump? When I look for the previous one, I can only find a status.html file. Maybe I don't really need the pages-meta-current if I only want the current articles?

P Please consider the environment before printing this e-mail

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Fawad Nazir Sent: Monday, October 29, 2007 11:57 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Dump is small

...

The last pages-meta-current.xml.bz2 in 20071018 is said to be 5.4GB,

but

...

when I downloaded it, it was only 1.5 GB. Why is that? Is there a problem with this dump? I saw there was a lot

...

discussion about it. What should I do? Thanks

Are you looking for enwiki-20070908-stub-meta-history.xml.gz, this is 5.4GB. Try downloading this one.

fawad@cyprus:~/wiki$ ls -lh -r-------- 1 fawad fawad 5.4G 2007-10-21 09:29 enwiki-20070908-stub-meta-history.xml.gz

-- Fawad Nazir http://www.geocities.com/nazir_fawad/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

David A. Desrosiers

8:02 p.m.

On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:

...

I don't want all the history. I just want the current articles, so I am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2

You don't want the history, but you want all of the discussion and user pages? Are you sure?

I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.

...

Where else can I get the pages-meta-current? The previous dump? When I look for the previous one, I can only find a status.html file. Maybe I don't really need the pages-meta-current if I only want the current articles?

The server claims to have the right amount of bytes, so let's see what happens when my download completes:

Server: Wikimedia dump service 20050523 (lighttpd) Content-Length: 5780471837

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

Osnat Etgar

8:48 p.m.

Hi, Thanks for your replay. I am a newbie in this field.

1. I only want the articles. No history, no user information, no discussions. I do want articles, lists and disambiguation. Maybe I understood it wrong and I don't need the page-meta-current? Only the pages-articles? 2. I have some data from page-meta-current in my database 2.1. I got an error in the middle, after over a million pages where extracted. Exception in thread "main" java.io.IOException: An invalid XML character (Unicode: 0x2) was found in the element content of the document. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)

While we are here, I'd like to ask some more questions, if you don't mind: 1. How do I read the data from MySQL? I don't understand how entries are connected to one another and how I should read it. 2. Do I have to clean up MySQL tables every time I want to insert another dump? Either an update file or totally different one? 3. Is there a way to get only the delta file instead of the whole dump again? 4. How do I add .sql.gz files to MySQL? Thanks a lot for your answers Osnat

P Please consider the environment before printing this e-mail

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A. Desrosiers Sent: Monday, October 29, 2007 3:03 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Dump is small

On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:

...

I don't want all the history. I just want the current articles, so I am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2

You don't want the history, but you want all of the discussion and user pages? Are you sure?

I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.

...

Where else can I get the pages-meta-current? The previous dump? When I look for the previous one, I can only find a status.html file. Maybe I don't really need the pages-meta-current if I only want the current articles?

The server claims to have the right amount of bytes, so let's see what happens when my download completes:

Server: Wikimedia dump service 20050523 (lighttpd) Content-Length: 5780471837

Dave Sigafoos

8:58 p.m.

For your first question .. "How do I read .. " look at this link http://www.mediawiki.org/wiki/Manual:Database_layout posted by Paul Coghlan just this morning in answer to a similar question.

DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Osnat Etgar Sent: Monday, October 29, 2007 6:48 To: David A. Desrosiers; Wikimedia developers Subject: Re: [Wikitech-l] Dump is small

Hi, Thanks for your replay. I am a newbie in this field.

P Please consider the environment before printing this e-mail

On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:

...

I don't want all the history. I just want the current articles, so I am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2

You don't want the history, but you want all of the discussion and user pages? Are you sure?

I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.

...

Where else can I get the pages-meta-current? The previous dump? When I look for the previous one, I can only find a status.html file. Maybe I don't really need the pages-meta-current if I only want the current articles?

The server claims to have the right amount of bytes, so let's see what happens when my download completes:

Server: Wikimedia dump service 20050523 (lighttpd) Content-Length: 5780471837

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Fawad Nazir

8:59 p.m.

I am also new but i can answer few of your questions.

...

How do I read the data from MySQL? I don't understand how entries are

connected to one another and how I should read it.

Have a look at this: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables

...

Do I have to clean up MySQL tables every time I want to insert

another dump? Either an update file or totally different one?

I did this :(, eats up a lot of my time. If anyone can help that would be grt.

...

Is there a way to get only the delta file instead of the whole dump

again?

...

How do I add .sql.gz files to MySQL?

$gnuzip <filename>.sql.gz $mysql -u (username) -p (db-name) < (filename).sql

Hope this help.

Regards,

-- Fawad Nazir http://www.geocities.com/nazir_fawad/

Bryan Tong Minh

9:06 p.m.

On 10/29/07, Fawad Nazir fawad.nazir@gmail.com wrote:

...

...

How do I add .sql.gz files to MySQL?

$gnuzip <filename>.sql.gz $mysql -u (username) -p (db-name) < (filename).sql

mysql -u username -p db-name | gunzip --to-stdout filename.sql

Should be faster and eating up less disk space.

Bryan

Bryan Tong Minh

9:07 p.m.

On 10/29/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

On 10/29/07, Fawad Nazir fawad.nazir@gmail.com wrote:

...
...

How do I add .sql.gz files to MySQL?

$gnuzip <filename>.sql.gz $mysql -u (username) -p (db-name) < (filename).sql

Or

mysql -u username -p db-name | gunzip --to-stdout filename.sql

Should be faster and eating up less disk space.

Bryan

Argh, the other way round

gunzip --to-stdout filename.sql | mysql -u username -p db-name

David A. Desrosiers

30 Oct 30 Oct

9 p.m.

On Mon, 2007-10-29 at 22:59 +0900, Fawad Nazir wrote:

...

I did this :(, eats up a lot of my time. If anyone can help that would be grt.

How does it eat up a lot of your time? You just set it to import the data, and walk away. I have mine automated, so that I dump and reload the top 9 languages of each project into MySQL.

The whole process, for all languages, takes just about an hour.. with enwiki taking the bulk of that time. My machine is a lowly dual-core AMD64 machine with 4gb of RAM.

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

Nick Jenkins

6 Nov 6 Nov

7:59 a.m.

...

I have mine automated, so that I dump and reload the top 9 languages of each project into MySQL.

[..snip..]

...

I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.

Care to post a link to the script? Might be useful to others and save reinventing the wheel, especially if you can say "I want languages X, Y, Z, these are my db names, my db hostname, my username, and my password", and then it can be added to /etc/cron.daily/ , and will do a daily poll to see if there are updated successful dumps in any of those langs, and if so download, test archive validity, check MD5sum, if both look okay then unpack, import, etc - or even if it does most of the above it could be a useful starting point for people who want to have a semi-recent local copy of some Wikipedia sites that will auto-update.

-- All the best, Nick.

David A. Desrosiers

8:17 a.m.

On Tue, 2007-11-06 at 11:59 +1100, Nick Jenkins wrote:

...

Care to post a link to the script?

Sorry, the script is not public.. at the moment. I'm currently using it to create and drive the content backend for my Plucker Wikipedia projects (see url in sig).

But my mwdumper mods will certainly be made available, when I'm sure they're working properly, and don't go breaking things :D

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

David A. Desrosiers

30 Oct 30 Oct

9:03 p.m.

On Mon, 2007-10-29 at 09:02 -0400, David A. Desrosiers wrote:

...

I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.

It took awhile on my lowly home DSL connection, but here's the results using plain wget:

$ wget -c http://download.wikimedia.org/enwiki/20071018/enwiki-20071018-pages-meta-cur... --20:52:54-- http://download.wikimedia.org/enwiki/20071018/enwiki-20071018-pages-meta-cur... => `enwiki-20071018-pages-meta-current.xml.bz2' Resolving download.wikimedia.org... 66.230.200.212 Connecting to download.wikimedia.org|66.230.200.212|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 5,780,471,837 (5.4G) [application/x-bzip]

100%[==================================>] 5,780,471,837 158.53K/s ETA 00:00

07:03:59 (153.96 KB/s) - `enwiki-20071018-pages-meta-current.xml.bz2' saved [5780471837/5780471837]

$ du -sch enwiki-20071018-pages-meta-current.xml.bz2 5.4G enwiki-20071018-pages-meta-current.xml.bz2 5.4G total

$ md5sum enwiki-20071018-pages-meta-current.xml.bz2 faa7e18e372103d861c8c933efd782e2 enwiki-20071018-pages-meta-current.xml.bz2

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

Fawad Nazir

9:21 p.m.

Hi David,

The problem is not with downloading. The problem is with loading the data into mysql.

-Fawad.

On 10/30/07, David A. Desrosiers desrod@gnu-designs.com wrote:

...

On Mon, 2007-10-29 at 09:02 -0400, David A. Desrosiers wrote:

...
I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.

It took awhile on my lowly home DSL connection, but here's the results using plain wget:

$ wget -c http://download.wikimedia.org/enwiki/20071018/enwiki-20071018-pages-meta-cur... --20:52:54-- http://download.wikimedia.org/enwiki/20071018/enwiki-20071018-pages-meta-cur... => `enwiki-20071018-pages-meta-current.xml.bz2' Resolving download.wikimedia.org... 66.230.200.212 Connecting to download.wikimedia.org|66.230.200.212|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 5,780,471,837 (5.4G) [application/x-bzip]

100%[==================================>] 5,780,471,837 158.53K/s ETA 00:00

07:03:59 (153.96 KB/s) - `enwiki-20071018-pages-meta-current.xml.bz2' saved [5780471837/5780471837]

$ du -sch enwiki-20071018-pages-meta-current.xml.bz2 5.4G enwiki-20071018-pages-meta-current.xml.bz2 5.4G total

$ md5sum enwiki-20071018-pages-meta-current.xml.bz2 faa7e18e372103d861c8c933efd782e2 enwiki-20071018-pages-meta-current.xml.bz2

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Fawad Nazir http://www.geocities.com/nazir_fawad/

David A. Desrosiers

31 Oct 31 Oct

4:16 a.m.

On Tue, 2007-10-30 at 23:21 +0900, Fawad Nazir wrote:

...

The problem is not with downloading. The problem is with loading the data into mysql.

Fawad, you must be replying to another post, because I was replying to Osnat, who was complaining that the 5.4G dump was 1.5G. I just disproved that.

On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:

...

When I try to download pages-meta-current the size is 1.5GB, instead of 5.4GB. When I download pages-articles the size is 3GB like it should be.

Now, I have all of the relevant dumps locally... so let me try to unpack and import them into MySQL, and see if they continue to work. I suspect that they will, because they always have for me.

I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.

I'll post back with my results when that is done.

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

David A. Desrosiers

2 Nov 2 Nov

1:10 a.m.

On Tue, 2007-10-30 at 17:16 -0400, David A. Desrosiers wrote:

...

I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.

...

I'll post back with my results when that is done.

The full import of page, revision and text for Wikipedia, Wikiquote and Wiktionary in 9 languages (en fr de pl it pt sv es nl) took:

real 91m19.910s user 15m16.073s sys 1m43.810s

The number of records found for the Wikipedia versions of these languages are:

for lang in en fr de pl it pt sv es nl; do mysql mediawiki -e "select count(*) from ep_${lang}_text"; done;

+----------+ | count(*) | +----------+ <- English Wikipedia | 5836166 | +----------+

+----------+ | count(*) | +----------+ <- French | 1147662 | +----------+

+----------+ | count(*) | +----------+ <- German | 1344476 | +----------+

+----------+ | count(*) | +----------+ <- Polish | 628225 | +----------+

+----------+ | count(*) | +----------+ <- Italian | 688577 | +----------+

+----------+ | count(*) | +----------+ <- Portuguese | 726593 | +----------+

+----------+ | count(*) | +----------+ <- Swedish | 445601 | +----------+

+----------+ | count(*) | +----------+ <- Spanish | 556088 | +----------+

+----------+ | count(*) | +----------+ <- Dutch | 616810 | +----------+

Everything looks to be working fine for me.

-- David A. Desrosiers desrod@gnu-designs.com setuid@gmail.com http://projects.plkr.org/ Skype...: 860-967-3820

Osnat Etgar

5 Nov 5 Nov

8:30 p.m.

Turns out some only some download tools can read the whole 5.4GB file.

This is the status: * 3.0GB (pages-articles) file though the checksum is wrong!! I've tried many download tools.

Problem 1: I managed to get most of 3.0GB on into mysql, but got an error message in the middle - Exception in thread "main" java.io.IOException: An invalid XML character (Unicode: 0x2) was found in the element content of the document. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)

Problem 2: I got exactly 42000 rows. How could that be?

* 5.4GB file (pages-meta-current), with the correct checksum

Problem 3: I tried to load the 5.4 GB data to mysql (to a clean database of course) and got only 4000 rows! I don't understand why everything is so difficult. While it seems file fine in the command line, it doesn't look like that in the database: D:\Projects\wikipedia> java -jar mwdumper.jar --format=sql:1.5 F:\Datasets\Wikipedia\enwiki-20071018-pages-meta-current.xml.bz2 | mysql -u <username> -p wikipedia --default-character-set=utf8

... 10,632,000 pages (778.271/sec), 10,632,000 revs (778.271/sec) 10,633,000 pages (778.294/sec), 10,633,000 revs (778.294/sec) 10,633,249 pages (778.301/sec), 10,633,249 revs (778.301/sec)

Thanks

P Please consider the environment before printing this e-mail P Please consider the environment before printing this e-mail

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A. Desrosiers Sent: Tuesday, October 30, 2007 11:17 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Dump is small

On Tue, 2007-10-30 at 23:21 +0900, Fawad Nazir wrote:

...

The problem is not with downloading. The problem is with loading the data into mysql.

Fawad, you must be replying to another post, because I was replying to Osnat, who was complaining that the 5.4G dump was 1.5G. I just disproved that.

On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:

...

When I try to download pages-meta-current the size is 1.5GB, instead of 5.4GB. When I download pages-articles the size is 3GB like it should be.

Now, I have all of the relevant dumps locally... so let me try to unpack and import them into MySQL, and see if they continue to work. I suspect that they will, because they always have for me.

I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.

I'll post back with my results when that is done.

6230

Age (days ago)

6238

Last active (days ago)

wikitech-l@lists.wikimedia.org

16 comments

6 participants

tags (0)

participants (6)

Bryan Tong Minh
Dave Sigafoos
David A. Desrosiers
Fawad Nazir
Nick Jenkins
Osnat Etgar