Hi all
a) Are we now allowed - from a toolserver account - to
iterate over german
and english articles (first all, then only the new ones) - or not?
In theory yes, in practice no. As this would currently mean to pull
every single article via HTTP, this is discouraged, because it creates a
lot of load. Doing it trough WikiProxy would mean that each revision is
only loaded once, which makes this a bit better. But it's still slow
(more than 1 sec per article).
Currently, the best way to bulk-process article text is to read from an
XML dump. You can adopt the exiting importers to fit your purpose, code
is available in PHP, Java and C#, I believe.
b) Is there a technical solution (in PHP, WikiProxy?)
to solve our problem
trying to access all pages - even those residing in external storage?
WikiProxy solves the problem of accessing external storage, for any page
you want. It does not solve it very efficiently, so it should not be use
to access *all* pages in a run.
c) In order to mirror those pages on toolserver can
perhaps Kate or Brion
come to rescue?
Again, in theory, yes. In practice, both are quite busy, maybe we should
try asking someone else (like, I don't know... JeLuF, perhaps?). I
imagine this would involve setting up a second mysql server instance,
and replication for that. There are probably some other tricky things to
take care of. Perhaps we should officially request technical help with
this from the e.V. I have already talked to elian about it.
On a slightly related note: we still do not get updates for any data on
the Asian cluster (the databases we have are stuck in October).
Apparently, it would be possible to resolve this, but it's tricky. The
*real* solution would be to to have multi-master replication, which (i
am told) is expected to be supported by MySQL 5.2.
Regards,
-- Daniel
--
Homepage:
http://brightbyte.de