Re: [Wikitech-l] Dump is small

29 Oct 2007


      Hi, 
Thanks for your replay. I am a newbie in this field.
1. I only want the articles. No history, no user information, no
discussions. 
I do want articles, lists and disambiguation.
Maybe I understood it wrong and I don't need the page-meta-current? Only
the pages-articles?
2. I have some data from page-meta-current in my database
2.1. I got an error in the middle, after over a million pages where
extracted.
    Exception in thread "main" java.io.IOException: An invalid XML
character (Unicode: 0x2) was found in the element content of the
document.
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)
While we are here, I'd like to ask some more questions, if you don't
mind:
1. How do I read the data from MySQL? I don't understand how entries are
connected to one another and how I should read it.
2. Do I have to clean up MySQL tables every time I want to insert
another dump? Either an update file or totally different one?
3. Is there a way to get only the delta file instead of the whole dump
again?
4. How do I add .sql.gz files to MySQL?
Thanks a lot for your answers
Osnat
P Please consider the environment before printing this e-mail
-----Original Message-----
From: wikitech-l-bounces@lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of David A.
Desrosiers
Sent: Monday, October 29, 2007 3:03 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] Dump is small
On Mon, 2007-10-29 at 12:47 +0200, Osnat Etgar wrote:
...
I don't want all the history. I just want the current articles, so I
am downloading pages-meta-current.xml.bz2 and pages-articles.xml.bz2
You don't want the history, but you want all of the discussion and user
pages? Are you sure?
I'm testing a download of the -meta-current.xml.bz2 right now to see if
it does indeed work, but it will take 1/2 day to get it all. I'll post
back and let you know what happens.
...
Where else can I get the pages-meta-current? The previous dump? When I
look for the previous one, I can only find a status.html file.
Maybe I don't really need the pages-meta-current if I only want the
current articles?
The server claims to have the right amount of bytes, so let's see what
happens when my download completes:
Server: Wikimedia dump service 20050523 (lighttpd)
    Content-Length: 5780471837
-- 
David A. Desrosiers
desrod@gnu-designs.com
setuid@gmail.com
http://projects.plkr.org/
Skype...: 860-967-3820


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dump is small