Re: [Toolserver-l] Troubles with reading Articles

25 Mar 2006


      ...
...
Currently, the best way to bulk-process article text is to read from an
XML dump. You can adopt the exiting importers to fit your purpose, code
is available in PHP, Java and C#, I believe.
Well, I think this means that Stefan's team has to recode a lot. Pulling
the titles and texts out of the XML dump is easy but you only get a new
dump every 1 or 2 month. On the other hand XML is more robust while the
database structure will change with every MediaWiki version - for
instance I was not aware of the external text before.
XML dumps should be handled by the Wiki. Not only for the monthly dumps, but 
for the Special:Export, which also uses the same format. Queries done 
through it are supposed to be better for the server load as it only needs 
one query for getting many articles.
Well, you'd also need some kind of guessing about which articles will be 
queried after this to optimize it. Or you could get the asked article plus 
the next X pages on the DB that need http query.
Leo, you should also watch on that direction, as it is easier for the 
programmer to know the total amount of articles to be queried, not having to 
rely on the getting layer to guess the improvements.
Maybe you could have another parameter on the wikiproxy for the articles i 
want too, to make the wikiproxy aware of it?
The most accurate way would be to have the layer acting asyncronously, so it 
would get a query and not really do it through http unless a) a parameter 
'notwait' is set; b) the query queue is X long; c) it's Y seconds old (a 
wait timeout). Then it solves all the queries at the same time. However, it 
makes more difficult the client part, as client programs tend to use a 
ask-process-ask-process-loop

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Troubles with reading Articles