Comments inline
Erik
-----Original Message-----
From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org]
On Behalf Of Christian Aistleitner
Sent: Tuesday, November 05, 2013 3:47 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Subject: Re: [Analytics] Analytics slave for s6 not replicating
Hi Dario,
On Mon, Nov 04, 2013 at 10:00:25PM -0800, Dario Taraborelli wrote:
I went through the geowiki source and I have a couple
of questions on
the data.
I am happy to share what I learnt about the geowiki code base :-)
Active Editor definition
Can you confirm that:
• edits are limited to ns0 (cuc_namespace = 0)
Yes.
• you’re applying a 30-day look-back window for each
day excluding the
first 30 days in cu_changes
[EZ]: Suggestion: if we make that 28 instead of 30 the weekly ripple is gone (some days
30 day history includes 5 weekends, most days 4)
As I learnt that some people have and use direct access to the geowiki tables of the
staging database, we have to split cases.
In the staging database, there are different “look-back window”s as well” (say 14-days).
But for the files in the geowiki-data repository, and for graphs as
http://gp.wmflabs.org/graphs/active_editors_total
http://gp.wmflabs.org/graphs/enwiki_editor_counts
we only consider 30 day “look-back window”s.
However, I do not understand “excluding the first 30 days in cu_changes”. Up to my
knowledge, timewise geowiki only filters for the “look-back window”.
• you’re filtering out bots using a union of
user_group = ‘bot’ and
Erik’s manually compiled list of bots (data/erikZ.bots)
[EZ]: see previous mail on bots criteria
(Assuming “user_group” should be read as „ug_group”)
Yes.
However, consider that data/erikZ.bots is stale and in dire need of an update. So, bot
detection of new bots mostly relies on ug_groups.
[EZ]: Up to date files are on stat1002. My earlier suggestion to send you these files (one
per project) on a regular basis still holds.
• you are not applying the countable page and content
namespace
filters
Yes.
We do not limit to Countable Pages. There is a bug for changing that.
We do not limit to Content Namespaces, only namespace 0. As that matches the current
definition of “Active editor” [1], I do not think adding such a filter to geowiki is on
the agenda. But Toby, or Diederik might know better if such a change has been scheduled.
[EZ]: Definition page: "Some day Wikistats may dynamically establish countable
namespaces per wiki via the API". Well that happened a few months ago. I updated the
definition page.
• the caveat about overcounting users editing from
multiple countries
still applies [...]
That depends a bit on which part of geowiki you're looking at.
For per project country breakdowns this observation should not hold.
But for graphs as
http://gp.wmflabs.org/graphs/active_editors_total
http://gp.wmflabs.org/graphs/enwiki_editor_counts
your observation is accurate. It is also noted in the graph's description.
Note however that those graphs come with a “Tentative” in the title. So we do know that
those graphs come with many problems. But they allow to at least to expose immediate
trends, which proved to be a pressing need. So it's better to have those tentative
graphs online then showing nothing.
But yes, it's unsatisfactory.
Since you walked through the code base already: Patches welcome!
Especially patches that have been coordinated with consumers of the graphs ;-)
• edits to redirect pages are included
You anticipate a discussion that I wanted to start since some time:
Analytics' edit definition is bound to wikistats [2], and is vague in many directions
(Is page creation an edit? How to treat redirects?...).
[EZ]: Why would page creation not be an edit?
[EZ]: Redirects are not counted. Quote from definition page: "In the context of
wikistats countable pages are pages which contain an internal link (aka wikilink) or
category link, and are not a redirect page. This conforms to the traditional definition of
an 'article' within the Wikimedia community." Any other 'vagueness'
left?
Due to other pressing issues, this discussion has not yet been started, and geowiki still
uses the definition used by the original
author:
Each row in cu_changes is considered an edit.
Geolookup issues
What happens to unresolved IP addresses?
That depends on the nature of “unresolved” and on whether you're interested in the
database or the generated csvs, and on whether you're interested in aggregated counts,
country breakdowns or city breakdowns.
So let me assume you're asking in the context of the graph urls above.
If the IP address does not look like an IPv4 (yes :-( ) address or the geoip module
returns an empty result, the edits get thrown in fallback buckets, which are considered
when aggregating across countries. So the edits/editors get counted in the graphs at the
above urls.
For the edit to get ignored, the geoip module would need to throw an exception. However,
according to our logs this did not happen a single time in the recent weeks at all.
How likely do you think is the possibility of
artifacts in the data,
inflating or deflating 5+ counts?
Are you referring to reports of
http://geoiplookup.wikimedia.org/ timeouts [3]?
If so, it is highly unlikely that it affects geowiki, as geowiki is not relying on that
service, but uses the GeoIP databases directly.
If you are not referring to above timeouts, but different reports, could you please
provide more details?
(geowiki logs show no geolocation problems.) (Graphs do not show obvious artifacts.)
Anomalies in the data
Are the anomalies in series such as enwiki (for example, the one
starting on 2013-01-09) caused by geoip issues or by temporary
disruptions in the job that runs the geowiki script?
As that was long before I joined the team, and could not find any documentation about this
anomaly, I can only speculate about it.
So I'll have to leave definite answers to those who know first hand.
However, the drop you mention seems to be limited to enwiki. And the drop is linear
downwards over several days. The size of the drop/day is roughly the number of new active
editors that we'd expect per day.
So it looks just like no new rows being added to cu_changes, while older ones move out of
the “look-back window”. So when only looking at the graph, database issues (e.g.:
replication stuck) on the analytics slave for s1 might be a plausible explanation. This
would also match other characteristics of the drop.
Longer term, if we’re not interested in country-level
data, [...]
There are voices that strongly request per country break downs :-)
That said, the overreporting is of course a problem. But as argued above, it's better
to have the overreporting, tentative graphs that at least allows to exhibit trends than no
graph at all ;-)
As there seems some larger demand for daily metrics, I am not sure whether generating
those graphs from within geowiki will be the long term solution. Others need to decide
that.
Have fun,
Christian
[1]
https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&am…
[2]
https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&am…
[3] E.g.: October 19 on
https://wikitech.wikimedia.org/wiki/Server_Admin_Log
07:13 ori-l: reports of geoiplookup timing out in AU at enwiki VPT:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#geoipl…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------