Hello.
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.
Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps? Could a mysqldump on every language edition slow down the Wikipedia MySQL server?
I guess some problem could arise, and that's why we don't use it. Otherwise, perhaps we could consider creating such mysqldump, to speed up the import process back to our local servers, instead of having to parse a huge XML file.
That's specially true for the very large meta-history.xml versions. And you still can filter out sensible tables (user, etc.).
Regards,
Felipe.
---------------------------------
Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Felipe Ortega wrote:
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.
Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?
1) The format is relatively stable, unlike our database schema.
2) Our databases are spread over dozens of servers, in mixes of internal binary compression formats whose interpretation is dependent on our configuration and custom code.
3) Our internal databases mix public and private information, which we have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.
Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber brion@wikimedia.org escribió: 1) The format is relatively stable, unlike our database schema.
Sure, that's why we also have to update tables.sql (sometimes) to load the data back to the server :) .
2) Our databases are spread over dozens of servers, in mixes of internal binary compression formats whose interpretation is dependent on our configuration and custom code.
3) Our internal databases mix public and private information, which we have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.
That makes sense, Brion. Thank you for this clarification.
Regards,
Felipe.
- -- brion vibber (brion @ wikimedia.org)
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
Enviado desde Correo Yahoo! La bandeja de entrada más inteligente.
Hi
On Sat, Apr 19, 2008 at 2:45 AM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Felipe Ortega wrote:
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.
Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?
The format is relatively stable, unlike our database schema.
Our databases are spread over dozens of servers, in mixes of internal
binary compression formats whose interpretation is dependent on our configuration and custom code.
- Our internal databases mix public and private information, which we
have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.
Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.
I agree dump to SQL statements is a little bit useless, but how about CSV ?
mysqldump allow you to dump to CSV file instead of raw sql statements (you can specify the fieds your want), they are pretty safe, and storage efficient for download.
Even better, mysqlimport can import those CSV at a very high speed.
Of course many people are already using the XML file already, so I am not asking you to change, but provide another set of dump in CSV format, which can save many people in term of file downloading, XML parsing ect.
What do you think?
Thanks. Howard
howard chen wrote:
Hi
On Sat, Apr 19, 2008 at 2:45 AM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Felipe Ortega wrote:
Yesterday, I was moving around mysqldump files of our processed databases from parsed Wikipedia dumps, and this simple question came to my mind.
Is there any special reason to use an "ad-hoc" XML schema for Wikipedia dumps?
The format is relatively stable, unlike our database schema.
Our databases are spread over dozens of servers, in mixes of internal
binary compression formats whose interpretation is dependent on our configuration and custom code.
- Our internal databases mix public and private information, which we
have to separate for external dumps. Thus only completely public tables are dumped with mysqldump.
Thus, we use a stable, safe data schema for public page dumps. Dumping raw SQL of these tables would be unstable, insecure, and useless for most people.
I agree dump to SQL statements is a little bit useless, but how about CSV ?
mysqldump allow you to dump to CSV file instead of raw sql statements (you can specify the fieds your want), they are pretty safe, and storage efficient for download.
Even better, mysqlimport can import those CSV at a very high speed.
Issue 1, 2 and 3 that apply to SQL also apply to any other form of dump done via MySQL, including CVS. There is no feasible way of providing a CVS dump for the same reasons that a SQL one cannot be. The problem here is not the format, but the process it is created via.
MinuteElectron.
On 19.04.2008 14:35:18, howard chen wrote:
I agree dump to SQL statements is a little bit useless, but how about CSV ?
Bah, CSV is deprecated. XML is a much more flexible, and even human-readable format. It is so flexible that if you want CSV, you can easily transform the XML into CSV using XSLT.
Leon
wikitech-l@lists.wikimedia.org