Hi all
a) Are we now allowed - from a toolserver account - to iterate over german and english articles (first all, then only the new ones) - or not?
In theory yes, in practice no. As this would currently mean to pull every single article via HTTP, this is discouraged, because it creates a lot of load. Doing it trough WikiProxy would mean that each revision is only loaded once, which makes this a bit better. But it's still slow (more than 1 sec per article).
Currently, the best way to bulk-process article text is to read from an XML dump. You can adopt the exiting importers to fit your purpose, code is available in PHP, Java and C#, I believe.
b) Is there a technical solution (in PHP, WikiProxy?) to solve our problem trying to access all pages - even those residing in external storage?
WikiProxy solves the problem of accessing external storage, for any page you want. It does not solve it very efficiently, so it should not be use to access *all* pages in a run.
c) In order to mirror those pages on toolserver can perhaps Kate or Brion come to rescue?
Again, in theory, yes. In practice, both are quite busy, maybe we should try asking someone else (like, I don't know... JeLuF, perhaps?). I imagine this would involve setting up a second mysql server instance, and replication for that. There are probably some other tricky things to take care of. Perhaps we should officially request technical help with this from the e.V. I have already talked to elian about it.
On a slightly related note: we still do not get updates for any data on the Asian cluster (the databases we have are stuck in October). Apparently, it would be possible to resolve this, but it's tricky. The *real* solution would be to to have multi-master replication, which (i am told) is expected to be supported by MySQL 5.2.
Regards, -- Daniel