I did some data gathering last fall that is more or less the same as Claudia is asking about. Looking up the bot flag, or checking the username is often regarded as a reasonable way of filtering out the bots. I chose to apply both, if there's no bot flag we look for a typical bot signature in the username (regex: "bot$| ", username either ends with bot or a part of it does), and used a case-insensitive match since some users have usernames like "FoObOt".
Checking the edit history to find when interwiki links were first added can be time-consuming if the page had lots of activity. I therefore chose to use a binary search, halving the distance between two test points until either the actual edit is found, or we're down to so few edits that all can be efficiently grabbed through the API (e.g. using Pywikibot's PreloadingGenerator). Otherwise you might be examining thousands of edits for no reason.
Having Toolserver access simplifies the process a lot since all the metadata is more easily accessible, but the revision text will still have to be grabbed from the API.
Hope some of this helps, let me know if there's any questions.
Cheers, Morten
On 8 May 2012 08:39, Bináris wikiposta@gmail.com wrote:
2012/5/8 Merlijn van Deen valhallasw@arctus.nl
This is not completely true - the bot flag is also a property of the user account. You can query e.g.
http://nl.wikipedia.org/w/index.php?title=Speciaal:Gebruikerslijst&offse...
Yes, that's true. And if you want to be quite accurate, you must also determine the date of acquiring the bot flag from bureau logs and compare it to the page history. :-)
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l