On good faith: why don't we use mysqldump?

List overview All Threads
Download

newer

older

Fwd: [Wikimania-l] [[WM2008]]...

real name to be displayed instead...

Felipe Ortega

19 Apr 2008 19 Apr '08

1:19 a.m.

Hello.

Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.

Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps? Could a mysqldump on every language edition slow down the Wikipedia MySQL server?

I guess some problem could arise, and that's why we don't use it. Otherwise, perhaps we could consider creating such mysqldump, to speed up the import process back to our local servers, instead of having to parse a huge XML file.

That's specially true for the very large meta-history.xml versions. And you still can filter out sensible tables (user, etc.).

Regards,

Felipe.

---------------------------------

Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.

Show replies by date

Brion Vibber

19 Apr 19 Apr

1:45 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Felipe Ortega wrote:

...

Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.

Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?

1) The format is relatively stable, unlike our database schema.

2) Our databases are spread over dozens of servers, in mixes of internal binary compression formats whose interpretation is dependent on our configuration and custom code.

3) Our internal databases mix public and private information, which we have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.

Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkgI7FcACgkQwRnhpk1wk46jWwCfSEAayLMoFIokCrEMuvdlcBUC ht4An3M+t1Xo0kjv6vS6NRTOsYkYPi+G =2bU3 -----END PGP SIGNATURE-----

Felipe Ortega

4:37 a.m.

Brion Vibber brion@wikimedia.org escribió: 1) The format is relatively stable, unlike our database schema.

Sure, that's why we also have to update tables.sql (sometimes) to load the data back to the server :) .

2) Our databases are spread over dozens of servers, in mixes of internal binary compression formats whose interpretation is dependent on our configuration and custom code.

3) Our internal databases mix public and private information, which we have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.

That makes sense, Brion. Thank you for this clarification.

Regards,

Felipe.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

---------------------------------

Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.

howard chen

1:35 p.m.

On Sat, Apr 19, 2008 at 2:45 AM, Brion Vibber brion@wikimedia.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Felipe Ortega wrote:

...
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.

Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?

The format is relatively stable, unlike our database schema.

Our databases are spread over dozens of servers, in mixes of internal

binary compression formats whose interpretation is dependent on our configuration and custom code.

Our internal databases mix public and private information, which we

have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.

Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.

I agree dump to SQL statements is a little bit useless, but how about CSV ?

mysqldump allow you to dump to CSV file instead of raw sql statements (you can specify the fieds your want), they are pretty safe, and storage efficient for download.

Even better, mysqlimport can import those CSV at a very high speed.

Of course many people are already using the XML file already, so I am not asking you to change, but provide another set of dump in CSV format, which can save many people in term of file downloading, XML parsing ect.

What do you think?

Thanks. Howard

MinuteElectron

4:03 p.m.

howard chen wrote:

...

Hi

On Sat, Apr 19, 2008 at 2:45 AM, Brion Vibber brion@wikimedia.org wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Felipe Ortega wrote:

...
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.

Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?

The format is relatively stable, unlike our database schema.

Our databases are spread over dozens of servers, in mixes of internal

binary compression formats whose interpretation is dependent on our configuration and custom code.

Our internal databases mix public and private information, which we

have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.

Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.

I agree dump to SQL statements is a little bit useless, but how about CSV ?

mysqldump allow you to dump to CSV file instead of raw sql statements (you can specify the fieds your want), they are pretty safe, and storage efficient for download.

Even better, mysqlimport can import those CSV at a very high speed.

Issue 1, 2 and 3 that apply to SQL also apply to any other form of dump done via MySQL, including CVS. There is no feasible way of providing a CVS dump for the same reasons that a SQL one cannot be. The problem here is not the format, but the process it is created via.

MinuteElectron.

Leon Weber

4:03 p.m.

On 19.04.2008 14:35:18, howard chen wrote:

...

I agree dump to SQL statements is a little bit useless, but how about CSV ?

Bah, CSV is deprecated. XML is a much more flexible, and even human-readable format. It is so flexible that if you want CSV, you can easily transform the XML into CSV using XSLT.

Leon

-- Leon Weber, leon@leonweber.de 0x8E04D7FC blog: https://leonweber.de/blog jabber: leon@jabber.ccc.de (icq: 261067046) -- Sagt der Richter: Die Zeugin hat entbunden. Sie kann neu geladen werden.

6107

Age (days ago)

6108

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Felipe Ortega
howard chen
Leon Weber
MinuteElectron