Ø Eriks manually compiled list of bots (data/erikZ.bots)
My list is not manual really, except for a few names
Wikistats has several criteria for bot detection:
1) Is the bot flag set in the user group table?
2) Does it sound like a bot? (adds hundreds of names)
(nowadays only bot names are allowed to sound like a bot)
To be precise: does bot occur at end of name or before non alpha char?
3) Is it known to be an unregistered bot ? (WIkipedia has a list of false
negatives at
http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edit
s/Unflagged_bots )
4) Is a name flagged as a bot on at least 10 wikis than treat it so on any
wiki within the project
5) Three names that sound like bot are hard coded exemptions (people who
wrote about it)
Erik
From: analytics-bounces(a)lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dario
Taraborelli
Sent: Tuesday, November 05, 2013 7:00 AM
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] Analytics slave for s6 not replicating
Hey Christian,
I went through the geowiki source and I have a couple of questions on the
data.
Active Editor definition
Can you confirm that:
edits are limited to ns0 (cuc_namespace = 0)
youre applying a 30-day look-back window for each day excluding the first
30 days in cu_changes
youre filtering out bots using a union of user_group = bot and Eriks
manually compiled list of bots (data/erikZ.bots)
you are not applying the countable page and content namespace filters
the caveat about overcounting users editing from multiple countries still
applies (it looks like it does given that the data is generated by counts
aggregated by country / date / project stored in the staging DB)
edits to redirect pages are included
Geolookup issues
What happens to unresolved IP addresses? Ive been told by a number of folks
that the geoip DB had several issues lately, meaning that the volume of IPs
that do not resolve to a specific country may have changed over time. How
likely do you think is the possibility of artifacts in the data, inflating
or deflating 5+ counts?
Anomalies in the data
Are the anomalies in series such as enwiki (for example, the one starting on
2013-01-09) caused by geoip issues or by temporary disruptions in the job
that runs the geowiki script?
Longer term, if were not interested in country-level data, I think we
should generate this data directly from the revision tables unless theres a
strong reason to use cu_changes (which I might be missing). This will avoid
over-reporting due to multiple-country editor counting, avoid potential
issues with changes in the geoip DB (like the unconfirmed ones that I
mentioned above) and also make the whole data replicable (right now
historical data from geowiki cannot be reproduced from scratch from the DBs,
due to the 3-month lifecycle of cu_changes).
Dario