[Toolserver-l] Troubles with reading Articles

Stefan F. Keller sfkeller at hsr.ch
Sat Mar 25 13:38:24 UTC 2006


On Saturday, March 25, 2006 1:48 PM Daniel wrote:
> Right now, no token is necessary to use WikiProxy - the "lock" will
> become active when I update my tools next time. Then, you can get an
> access token by asking me :)
> 
> Note that you will *never* need a token to access WikiProxy locally from
> the toolserver. The IPs are whitelisted.

Thank you and Jakob!

> > Well, I think this means that Stefan's team has to recode a lot. Pulling
> > the titles and texts out of the XML dump is easy but you only get a new
> > dump every 1 or 2 month. On the other hand XML is more robust while the
[...]
> For the analysis of large volumes of texts doing it "live" isn't really
> an option anyway, I think. And being able to handle XML dumps is a good
> idea anyway :)

If you assume that one has to repeat this process repeatedly you are right.
In our case we only need to run it _once_ (Ok, I admit: twice, because of
some testing). After that we try to visit only those articles which have
changed since, say, one or several days. We would even bare lowest priority
while our process runs.

On the other hand: A dump could serve for this first 'full access', but only
if it's a recent one... ('cause we'll try then to iterate only on the delta
since the timestamp of the dump). 

And: Pulling text out of the XML dump is not that easy really; needs lots of
additional code (dumps from several tables and re-indexing, etc.) compared
to online-access. And it's not repeatable up to now, as I'm aware, e.g.
either the path/filename to the most recent dewiki dump needs to be constant
or the online request of the XML dump should tell us its timestamp.

-- Stefan




More information about the Toolserver-l mailing list