Hey,
Are there any ideas for some kind of incremental dump? Or are bandwith and disk storage no problem compared to the more complex implementation of incremental dumps? And are there any statistics about how much bandwidth is used by downloads and database dumps (i.e. not normal visitor traffic)?
Incremental dumps have been always constant rave by new developers. Maybe it would be not so sophisticated to provide streams of changed articles, but doing proper sync yet is possible only by replicating SQL commands.
We do have currently incremental dumps (eh, MySQL binlogs), but those, unlike public dump service, contain internal tables (with all sensitive information). Providing those into public we either would have to set yet another mysql slave, with limited replication, or somehow filter binlogs, which is PITA as well.
There's yet another issue, with 1.5 major database redesign will happen, which will force us to provide non-SQL dumps. We're even not sure if revision texts will be kept in MySQL in future.
Cheers, Domas
Domas Mituzas wrote:
There's yet another issue, with 1.5 major database redesign will happen, which will force us to provide non-SQL dumps. We're even not sure if revision texts will be kept in MySQL in future.
I've heard this before, but I don't really get it. What is the problem with storing article texts in the DB? It would seem that LiveJournal is storing *way* more text in their DB than Wikipedia, so what is different about Wikipedia that makes it infeasible to follow their example?
Timwi
Timwi wrote in gmane.science.linguistics.wikipedia.technical:
What is the problem with storing article texts in the DB? It would seem that LiveJournal is storing *way* more text in their DB than Wikipedia, so what is different about Wikipedia that makes it infeasible to follow their example?
one thing to consider is that LJ is able to cluster text storage over several databases, separated by journal. we cannot split en: between more than one database, at least until if/when MySQL Cluster becomes useful.
(well, some homegrown solution can be used, but then that's just external text storage that happens to use MySQL).
in any case, just because LJ does it one way doesn't mean that's the best way to do it...
Timwi
kate.
Kate Turner wrote:
Timwi wrote in gmane.science.linguistics.wikipedia.technical:
What is the problem with storing article texts in the DB? It would seem that LiveJournal is storing *way* more text in their DB than Wikipedia, so what is different about Wikipedia that makes it infeasible to follow their example?
one thing to consider is that LJ is able to cluster text storage over several databases, separated by journal. we cannot split en: between more than one database, at least until if/when MySQL Cluster becomes useful. (well, some homegrown solution can be used, but then that's just external text storage that happens to use MySQL).
Of course we can split en: (by namespace, by first letter, by whatever you want).
in any case, just because LJ does it one way doesn't mean that's the best way to do it...
What was the purpose of this dismissive remark? It is clearly better than Wikipedia's current state, judging from the server speed and the frequency of database error messages.
Timwi
wikitech-l@lists.wikimedia.org