Comments inline
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Christian Aistleitner Sent: Tuesday, November 05, 2013 3:47 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Analytics slave for s6 not replicating
Hi Dario,
On Mon, Nov 04, 2013 at 10:00:25PM -0800, Dario Taraborelli wrote:
I went through the geowiki source and I have a couple of questions on the data.
I am happy to share what I learnt about the geowiki code base :-)
Active Editor definition Can you confirm that: • edits are limited to ns0 (cuc_namespace = 0)
Yes.
• you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes
[EZ]: Suggestion: if we make that 28 instead of 30 the weekly ripple is gone (some days 30 day history includes 5 weekends, most days 4)
As I learnt that some people have and use direct access to the geowiki tables of the staging database, we have to split cases.
In the staging database, there are different “look-back window”s as well” (say 14-days). But for the files in the geowiki-data repository, and for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts we only consider 30 day “look-back window”s.
However, I do not understand “excluding the first 30 days in cu_changes”. Up to my knowledge, timewise geowiki only filters for the “look-back window”.
• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots)
[EZ]: see previous mail on bots criteria
(Assuming “user_group” should be read as „ug_group”)
Yes.
However, consider that data/erikZ.bots is stale and in dire need of an update. So, bot detection of new bots mostly relies on ug_groups.
[EZ]: Up to date files are on stat1002. My earlier suggestion to send you these files (one per project) on a regular basis still holds.
• you are not applying the countable page and content namespace filters
Yes. We do not limit to Countable Pages. There is a bug for changing that.
We do not limit to Content Namespaces, only namespace 0. As that matches the current definition of “Active editor” [1], I do not think adding such a filter to geowiki is on the agenda. But Toby, or Diederik might know better if such a change has been scheduled.
[EZ]: Definition page: "Some day Wikistats may dynamically establish countable namespaces per wiki via the API". Well that happened a few months ago. I updated the definition page.
• the caveat about overcounting users editing from multiple countries still applies [...]
That depends a bit on which part of geowiki you're looking at.
For per project country breakdowns this observation should not hold.
But for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts your observation is accurate. It is also noted in the graph's description.
Note however that those graphs come with a “Tentative” in the title. So we do know that those graphs come with many problems. But they allow to at least to expose immediate trends, which proved to be a pressing need. So it's better to have those tentative graphs online then showing nothing.
But yes, it's unsatisfactory.
Since you walked through the code base already: Patches welcome!
Especially patches that have been coordinated with consumers of the graphs ;-)
• edits to redirect pages are included
You anticipate a discussion that I wanted to start since some time: Analytics' edit definition is bound to wikistats [2], and is vague in many directions (Is page creation an edit? How to treat redirects?...).
[EZ]: Why would page creation not be an edit? [EZ]: Redirects are not counted. Quote from definition page: "In the context of wikistats countable pages are pages which contain an internal link (aka wikilink) or category link, and are not a redirect page. This conforms to the traditional definition of an 'article' within the Wikimedia community." Any other 'vagueness' left?
Due to other pressing issues, this discussion has not yet been started, and geowiki still uses the definition used by the original author: Each row in cu_changes is considered an edit.
Geolookup issues What happens to unresolved IP addresses?
That depends on the nature of “unresolved” and on whether you're interested in the database or the generated csvs, and on whether you're interested in aggregated counts, country breakdowns or city breakdowns.
So let me assume you're asking in the context of the graph urls above.
If the IP address does not look like an IPv4 (yes :-( ) address or the geoip module returns an empty result, the edits get thrown in fallback buckets, which are considered when aggregating across countries. So the edits/editors get counted in the graphs at the above urls.
For the edit to get ignored, the geoip module would need to throw an exception. However, according to our logs this did not happen a single time in the recent weeks at all.
How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Are you referring to reports of http://geoiplookup.wikimedia.org/ timeouts [3]?
If so, it is highly unlikely that it affects geowiki, as geowiki is not relying on that service, but uses the GeoIP databases directly.
If you are not referring to above timeouts, but different reports, could you please provide more details?
(geowiki logs show no geolocation problems.) (Graphs do not show obvious artifacts.)
Anomalies in the data Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
As that was long before I joined the team, and could not find any documentation about this anomaly, I can only speculate about it.
So I'll have to leave definite answers to those who know first hand.
However, the drop you mention seems to be limited to enwiki. And the drop is linear downwards over several days. The size of the drop/day is roughly the number of new active editors that we'd expect per day. So it looks just like no new rows being added to cu_changes, while older ones move out of the “look-back window”. So when only looking at the graph, database issues (e.g.: replication stuck) on the analytics slave for s1 might be a plausible explanation. This would also match other characteristics of the drop.
Longer term, if we’re not interested in country-level data, [...]
There are voices that strongly request per country break downs :-)
That said, the overreporting is of course a problem. But as argued above, it's better to have the overreporting, tentative graphs that at least allows to exhibit trends than no graph at all ;-)
As there seems some larger demand for daily metrics, I am not sure whether generating those graphs from within geowiki will be the long term solution. Others need to decide that.
Have fun, Christian
[1] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [2] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [3] E.g.: October 19 on https://wikitech.wikimedia.org/wiki/Server_Admin_Log 07:13 ori-l: reports of geoiplookup timing out in AU at enwiki VPT: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#geoiplo...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------