Hey Christian,
I went through the geowiki source and I have a couple of questions on the data.
Active Editor definition
Can you confirm that:
• edits are limited to ns0 (cuc_namespace = 0)
• you’re applying a 30-day look-back window for each day excluding the first 30 days in
cu_changes
• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually
compiled list of bots (data/erikZ.bots)
• you are not applying the countable page and content namespace filters
• the caveat about overcounting users editing from multiple countries still applies (it
looks like it does given that the data is generated by counts aggregated by country / date
/ project stored in the staging DB)
• edits to redirect pages are included
Geolookup issues
What happens to unresolved IP addresses? I’ve been told by a number of folks that the
geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to
a specific country may have changed over time. How likely do you think is the possibility
of artifacts in the data, inflating or deflating 5+ counts?
Anomalies in the data
Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09)
caused by geoip issues or by temporary disruptions in the job that runs the geowiki
script?
Longer term, if we’re not interested in country-level data, I think we should generate
this data directly from the revision tables unless there’s a strong reason to use
cu_changes (which I might be missing). This will avoid over-reporting due to
multiple-country editor counting, avoid potential issues with changes in the geoip DB
(like the unconfirmed ones that I mentioned above) and also make the whole data replicable
(right now historical data from geowiki cannot be reproduced from scratch from the DBs,
due to the 3-month lifecycle of cu_changes).
Dario