Currently, the
best way to bulk-process article text is to read from an
XML dump. You can adopt the exiting importers to fit your purpose, code
is available in PHP, Java and C#, I believe.
Well, I think this means that Stefan's team has to recode a lot. Pulling
the titles and texts out of the XML dump is easy but you only get a new
dump every 1 or 2 month. On the other hand XML is more robust while the
database structure will change with every MediaWiki version - for
instance I was not aware of the external text before.
XML dumps should be handled by the Wiki. Not only for the monthly dumps, but
for the Special:Export, which also uses the same format. Queries done
through it are supposed to be better for the server load as it only needs
one query for getting many articles.
Well, you'd also need some kind of guessing about which articles will be
queried after this to optimize it. Or you could get the asked article plus
the next X pages on the DB that need http query.
Leo, you should also watch on that direction, as it is easier for the
programmer to know the total amount of articles to be queried, not having to
rely on the getting layer to guess the improvements.
Maybe you could have another parameter on the wikiproxy for the articles i
want too, to make the wikiproxy aware of it?
The most accurate way would be to have the layer acting asyncronously, so it
would get a query and not really do it through http unless a) a parameter
'notwait' is set; b) the query queue is X long; c) it's Y seconds old (a
wait timeout). Then it solves all the queries at the same time. However, it
makes more difficult the client part, as client programs tend to use a
ask-process-ask-process-loop