Hey Christian,
I went through the geowiki source and I have a couple of questions on the data.
Active Editor definition Can you confirm that: • edits are limited to ns0 (cuc_namespace = 0) • you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes • you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots) • you are not applying the countable page and content namespace filters • the caveat about overcounting users editing from multiple countries still applies (it looks like it does given that the data is generated by counts aggregated by country / date / project stored in the staging DB) • edits to redirect pages are included
Geolookup issues What happens to unresolved IP addresses? I’ve been told by a number of folks that the geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to a specific country may have changed over time. How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Anomalies in the data Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
Longer term, if we’re not interested in country-level data, I think we should generate this data directly from the revision tables unless there’s a strong reason to use cu_changes (which I might be missing). This will avoid over-reporting due to multiple-country editor counting, avoid potential issues with changes in the geoip DB (like the unconfirmed ones that I mentioned above) and also make the whole data replicable (right now historical data from geowiki cannot be reproduced from scratch from the DBs, due to the 3-month lifecycle of cu_changes).
Dario