Providing simpler dump format (raw, SQL or CSV)? - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Providing simpler dump format (raw, SQL or CSV)?

Re: [Wikitech-l] Planning to...

Splitting up...

howard chen

31 Mar 2009 31 Mar '09

1:57 p.m.

Hello, Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files? I think there are quite still many people using MySQL and prefer these format instead of XML (as you know XML generation and parsing really take time...) Howard

Reply

Show replies by thread

Daniel Kinzler

31 Mar 31 Mar

2 p.m.

howard chen schrieb:

Hello, Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files?

Because that would contain private data. It needs to be filtered first. And for that, the text blobs have to be uncompressed and pulled apart first, so individual revisions can be handeled. When all that is done, it can just as well be written as XML. XML as such is not the problem. -- daniel

Reply

Christensen, Courtney

2:02 p.m.

-----Original Message----- Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files? Howard _______________________________________________ Howard, Can't you get the SQL files from running mysqldump from the command line? Why does something new need to be created? I hope I'm not being dense, but I don't understand what new niche you are asking to fill. Thanks! -Courtney

Reply

Gregory Maxwell

2:16 p.m.

On Tue, Mar 31, 2009 at 10:02 AM, Christensen, Courtney <ChristensenC(a)battelle.org> wrote:

-----Original Message----- Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files?

Howard, Can't you get the SQL files from running mysqldump from the command line? Why does something new need to be created? I hope I'm not being dense, but I don't understand what new niche you are asking to fill.

Because the data (text) isn't in a single database, even for a single project, it is spread across a large number of machines. It's also in a mixture of bizarre internal formats. The file format it pretty much irrelevant to the 'cost' of producing a dump.

Reply

Christensen, Courtney

2:22 p.m.

howard chen schrieb:

Hello, Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files?

Ooooh! You meant for Wikipedia, not for individual MediaWiki installations? Well that makes a lot more sense. We call DumpHTML our dump process and it isn't running smoothly for us ATM and clients want flattened HTML from the wikis to take around to trade shows and whatever. Sorry for being confused! -Courtney

Reply

Domas Mituzas

3:21 p.m.

(as you know XML generation and parsing really take time...)

I didn't know that. Ever tried SAX? -- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Reply

Trevor Parscal

4:17 p.m.

On 3/31/09 8:21 AM, Domas Mituzas wrote:

(as you know XML generation and parsing really take time...)

I didn't know that. Ever tried SAX?

Indeed, not all XML software is slow... http://dotnot.org/blog/archives/2008/02/ - Trevor

Reply

Ed Summers

1 Apr 1 Apr

4:07 a.m.

On Tue, Mar 31, 2009 at 9:57 AM, howard chen <howachen(a)gmail.com> wrote:

Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files?

Please pardon this newbie question: is there a succinct explanation of what the problem is with the current Wikipedia dump process? //Ed

Reply

Aryeh Gregor

1:28 p.m.

On Wed, Apr 1, 2009 at 12:07 AM, Ed Summers <ehs(a)pobox.com> wrote:

Please pardon this newbie question: is there a succinct explanation of what the problem is with the current Wikipedia dump process?

"needs a rewrite"?

Reply

John Doe

7:17 p.m.

what the problem is with the current Wikipedia dump process?

its choking with almost 300 million revisions. it wasnt designed for a wiki this size and needed re-written two years ago On Wed, Apr 1, 2009 at 9:28 AM, Aryeh Gregor <Simetrical+wikilist@gmail.com<Simetrical%2Bwikilist@gmail.com>

wrote:

> On Wed, Apr 1, 2009 at 12:07 AM, Ed Summers <ehs(a)pobox.com

wrote:

> > Please pardon this newbie question: is there a succinct explanation of > > what the problem is with the current Wikipedia dump process? > > "needs a rewrite"? > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >

Reply

Brion Vibber

3 Apr 3 Apr

midnight

Ed Summers wrote:

On Tue, Mar 31, 2009 at 9:57 AM, howard chen <howachen(a)gmail.com> wrote:

Given that the current dump process is having problem, why not provide a simple fix such as providing raw table format , SQL files or even CSV files?

Please pardon this newbie question: is there a succinct explanation of what the problem is with the current Wikipedia dump process?

http://wikitech.wikimedia.org/view/Data_dump_redesign -- brion

Reply

5508

days inactive

5511

days old

wikitech-l@lists.wikimedia.org

Manage subscription

10 comments

10 participants

tags (0)

participants (10)

Aryeh Gregor
Brion Vibber
Christensen, Courtney
Daniel Kinzler
Domas Mituzas
Ed Summers
Gregory Maxwell
howard chen
John Doe
Trevor Parscal