Sj:
What is it that will take a lot of time to update, rewriting the queries? There are some SQL-friendly people around here who would like to help with
that.
I know SQL. Right now the scripts are parsing the dumps directly, doing a lot of encoding, splitting, sorting and merging to process the files within cramped memory.
SQL would be much easier. I do however worry about run time. I asked Brion at Berlin, and he had no idea either. Heavy ad hoc SQL queries have already often been forcefully aborted or forbidden alltogether, no doubt for good reasons, so what to expect for this monster job? It already runs 10 hours on English Wikipedia alone, over 24 hours for all Wikipedias. And parsing the dumps should still be much faster than SQL, with all its extra I/O. Of course some tests could shed light on this.
Apart from that I do lack the energy for any serious programming. All that is left is for my day job. So I won't work on migrating to SQL of write new encoding, splitting, sorting and merging code anytime soon. If I do some perl hacking during my holiday it should be targeted towards EasyTimeline where I have made promises that are half a year overdue (mainly unicode support). If someone else would want to step in for wikistats code, that would of course be great.
Erik Zachte
wikitech-l@lists.wikimedia.org