Ø  Erik’s manually compiled list of bots (data/erikZ.bots)

 

My list is not manual really, except for a few names

 

Wikistats has several criteria for bot detection:

 

1) Is the bot flag set in the user group table?

 

2) Does it sound like a bot? (adds hundreds of names)

(nowadays only bot names are allowed to sound like a bot)

To be precise: does ‘bot’ occur at end of name or before non alpha char?

 

3) Is it known to be an unregistered bot ?  (WIkipedia has a list of false negatives at http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/Unflagged_bots )

4) Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project

 

5) Three names that sound like bot are hard coded exemptions (people who wrote about it)

 

Erik

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dario Taraborelli
Sent: Tuesday, November 05, 2013 7:00 AM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Analytics slave for s6 not replicating

 

Hey Christian,

 

I went through the geowiki source and I have a couple of questions on the data. 

 

Active Editor definition

Can you confirm that:

• edits are limited to ns0 (cuc_namespace = 0)

• you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes

• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots)

• you are not applying the countable page and content namespace filters

• the caveat about overcounting users editing from multiple countries still applies (it looks like it does given that the data is generated by counts aggregated by country / date / project stored in the staging DB)

• edits to redirect pages are included

 

Geolookup issues

What happens to unresolved IP addresses? I’ve been told by a number of folks that the geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to a specific country may have changed over time. How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts? 

 

Anomalies in the data

Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script? 

 

Longer term, if we’re not interested in country-level data, I think we should generate this data directly from the revision tables unless there’s a strong reason to use cu_changes (which I might be missing). This will avoid over-reporting due to multiple-country editor counting, avoid potential issues with changes in the geoip DB (like the unconfirmed ones that I mentioned above) and also make the whole data replicable (right now historical data from geowiki cannot be reproduced from scratch from the DBs, due to the 3-month lifecycle of cu_changes).

 

Dario