Can you confirm that:
• edits are limited to ns0 (cuc_namespace = 0)
• you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes
• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots)
• you are not applying the countable page and content namespace filters
• the caveat about overcounting users editing from multiple countries still applies (it looks like it does given that the data is generated by counts aggregated by country / date / project stored in the staging DB)
• edits to redirect pages are included
Geolookup issues
What happens to unresolved IP addresses? I’ve been told by a number of folks that the geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to a specific country may have changed over time. How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Anomalies in the data
Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
Longer term, if we’re not interested in country-level data, I think we should generate this data directly from the revision tables unless there’s a strong reason to use cu_changes (which I might be missing). This will avoid over-reporting due to multiple-country editor counting, avoid potential issues with changes in the geoip DB (like the unconfirmed ones that I mentioned above) and also make the whole data replicable (right now historical data from geowiki cannot be reproduced from scratch from the DBs, due to the 3-month lifecycle of cu_changes).