Replication-friendly upgrader - Wikitech-l

25 Jun 2005


      Sj:
...
What is it that will take a lot of time to update, rewriting the queries?
There are some SQL-friendly people around here who would like to help with
that.
I know SQL. Right now the scripts are parsing the dumps directly, doing a
lot of encoding, splitting, sorting and merging to process the files within
cramped memory.
SQL would be much easier. I do however worry about run time. I asked Brion
at Berlin, and he had no idea either.
Heavy ad hoc SQL queries have already often been forcefully aborted or
forbidden alltogether, no doubt for good reasons, so what to expect for this
monster job? It already runs 10 hours on English Wikipedia alone, over 24
hours for all Wikipedias. And parsing the dumps should still be much faster
than SQL, with all its extra I/O. Of course some tests could shed light on
this.
Apart from that I do lack the energy for any serious programming. All that
is left is for my day job. So I won't work on migrating to SQL of write new
encoding, splitting, sorting and merging code anytime soon. If I do some
perl hacking during my holiday it should be targeted towards EasyTimeline
where I have made promises that are half a year overdue (mainly unicode
support). If someone else would want to step in for wikistats code, that
would of course be great.
Erik Zachte