Hi, Thanks for your replay. I am a newbie in this field.
1. I only want the articles. No history, no user information, no discussions. I do want articles, lists and disambiguation. Maybe I understood it wrong and I don't need the page-meta-current? Only the pages-articles? 2. I have some data from page-meta-current in my database 2.1. I got an error in the middle, after over a million pages where extracted. Exception in thread "main" java.io.IOException: An invalid XML character (Unicode: 0x2) was found in the element content of the document. at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)
While we are here, I'd like to ask some more questions, if you don't mind: 1. How do I read the data from MySQL? I don't understand how entries are connected to one another and how I should read it. 2. Do I have to clean up MySQL tables every time I want to insert another dump? Either an update file or totally different one? 3. Is there a way to get only the delta file instead of the whole dump again? 4. How do I add .sql.gz files to MySQL? Thanks a lot for your answers Osnat
P Please consider the environment before printing this e-mail
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A. Desrosiers Sent: Monday, October 29, 2007 3:03 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Dump is small
On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:
I don't want all the history. I just want the current articles, so I am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2
You don't want the history, but you want all of the discussion and user pages? Are you sure?
I'm testing a download of the -meta-current.xml.bz2 right now to see if it does indeed work, but it will take 1/2 day to get it all. I'll post back and let you know what happens.
Where else can I get the pages-meta-current? The previous dump? When I look for the previous one, I can only find a status.html file. Maybe I don't really need the pages-meta-current if I only want the current articles?
The server claims to have the right amount of bytes, so let's see what happens when my download completes:
Server: Wikimedia dump service 20050523 (lighttpd) Content-Length: 5780471837