Hi,
just a quick heads up that the analytics slave for s6 (frwiki, jawiki, ruwiki) seems to not be replicating since 2013-10-28 ~07:15. I filed an RT ticket. But if your scripts rely on the slave, expect the numbers to be off until the problem has been fixed.
We know that at least the following graphs (and corresponding CSVs) are affected: http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/frwiki_editor_counts http://gp.wmflabs.org/graphs/jawiki_editor_counts http://gp.wmflabs.org/graphs/ruwiki_editor_counts
We'll rerun data aggregation for them after the problem has been fixed. So expect the numbers for >=2013-10-28 to jump after the fix.
Best regards, Christian
Thanks Christian -- appreciate the work and the heads-up.
-Toby
On Fri, Nov 1, 2013 at 4:12 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
just a quick heads up that the analytics slave for s6 (frwiki, jawiki, ruwiki) seems to not be replicating since 2013-10-28 ~07:15. I filed an RT ticket. But if your scripts rely on the slave, expect the numbers to be off until the problem has been fixed.
We know that at least the following graphs (and corresponding CSVs) are affected: http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/frwiki_editor_counts http://gp.wmflabs.org/graphs/jawiki_editor_counts http://gp.wmflabs.org/graphs/ruwiki_editor_counts
We'll rerun data aggregation for them after the problem has been fixed. So expect the numbers for >=2013-10-28 to jump after the fix.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I did a quick check on tools-login to see if labs replication is affected - and it's oddly not.
frwiki, ruwiki, and jawiki have data from a few hours ago in a couple of tables. So maybe people could use labs db for a while until this is fixed. This also means that wikimetrics is unaffected.
On Fri, Nov 1, 2013 at 7:14 PM, Toby Negrin tnegrin@wikimedia.org wrote:
Thanks Christian -- appreciate the work and the heads-up.
-Toby
On Fri, Nov 1, 2013 at 4:12 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
just a quick heads up that the analytics slave for s6 (frwiki, jawiki, ruwiki) seems to not be replicating since 2013-10-28 ~07:15. I filed an RT ticket. But if your scripts rely on the slave, expect the numbers to be off until the problem has been fixed.
We know that at least the following graphs (and corresponding CSVs) are affected: http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/frwiki_editor_counts http://gp.wmflabs.org/graphs/jawiki_editor_counts http://gp.wmflabs.org/graphs/ruwiki_editor_counts
We'll rerun data aggregation for them after the problem has been fixed. So expect the numbers for >=2013-10-28 to jump after the fix.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On Sat, Nov 02, 2013 at 12:12:00AM +0100, Christian Aistleitner wrote:
the analytics slave for s6 (frwiki, jawiki, ruwiki) seems to not be replicating since 2013-10-28 ~07:15.
ops fixed the problem in the meantime. Yay! Thanks.
We know that at least the following graphs (and corresponding CSVs) are affected: http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/frwiki_editor_counts http://gp.wmflabs.org/graphs/jawiki_editor_counts http://gp.wmflabs.org/graphs/ruwiki_editor_counts
Jobs have been rerun, and the above urls show the expected data again.
Best regards, Christian
Hey Christian,
I went through the geowiki source and I have a couple of questions on the data.
Active Editor definition Can you confirm that: • edits are limited to ns0 (cuc_namespace = 0) • you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes • you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots) • you are not applying the countable page and content namespace filters • the caveat about overcounting users editing from multiple countries still applies (it looks like it does given that the data is generated by counts aggregated by country / date / project stored in the staging DB) • edits to redirect pages are included
Geolookup issues What happens to unresolved IP addresses? I’ve been told by a number of folks that the geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to a specific country may have changed over time. How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Anomalies in the data Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
Longer term, if we’re not interested in country-level data, I think we should generate this data directly from the revision tables unless there’s a strong reason to use cu_changes (which I might be missing). This will avoid over-reporting due to multiple-country editor counting, avoid potential issues with changes in the geoip DB (like the unconfirmed ones that I mentioned above) and also make the whole data replicable (right now historical data from geowiki cannot be reproduced from scratch from the DBs, due to the 3-month lifecycle of cu_changes).
Dario
Hi Dario,
On Mon, Nov 04, 2013 at 10:00:25PM -0800, Dario Taraborelli wrote:
I went through the geowiki source and I have a couple of questions on the data.
I am happy to share what I learnt about the geowiki code base :-)
Active Editor definition Can you confirm that: • edits are limited to ns0 (cuc_namespace = 0)
Yes.
• you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes
As I learnt that some people have and use direct access to the geowiki tables of the staging database, we have to split cases.
In the staging database, there are different “look-back window”s as well” (say 14-days). But for the files in the geowiki-data repository, and for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts we only consider 30 day “look-back window”s.
However, I do not understand “excluding the first 30 days in cu_changes”. Up to my knowledge, timewise geowiki only filters for the “look-back window”.
• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots)
(Assuming “user_group” should be read as „ug_group”)
Yes.
However, consider that data/erikZ.bots is stale and in dire need of an update. So, bot detection of new bots mostly relies on ug_groups.
• you are not applying the countable page and content namespace filters
Yes. We do not limit to Countable Pages. There is a bug for changing that.
We do not limit to Content Namespaces, only namespace 0. As that matches the current definition of “Active editor” [1], I do not think adding such a filter to geowiki is on the agenda. But Toby, or Diederik might know better if such a change has been scheduled.
• the caveat about overcounting users editing from multiple countries still applies [...]
That depends a bit on which part of geowiki you're looking at.
For per project country breakdowns this observation should not hold.
But for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts your observation is accurate. It is also noted in the graph's description.
Note however that those graphs come with a “Tentative” in the title. So we do know that those graphs come with many problems. But they allow to at least to expose immediate trends, which proved to be a pressing need. So it's better to have those tentative graphs online then showing nothing.
But yes, it's unsatisfactory.
Since you walked through the code base already: Patches welcome!
Especially patches that have been coordinated with consumers of the graphs ;-)
• edits to redirect pages are included
You anticipate a discussion that I wanted to start since some time: Analytics' edit definition is bound to wikistats [2], and is vague in many directions (Is page creation an edit? How to treat redirects?...).
Due to other pressing issues, this discussion has not yet been started, and geowiki still uses the definition used by the original author: Each row in cu_changes is considered an edit.
Geolookup issues What happens to unresolved IP addresses?
That depends on the nature of “unresolved” and on whether you're interested in the database or the generated csvs, and on whether you're interested in aggregated counts, country breakdowns or city breakdowns.
So let me assume you're asking in the context of the graph urls above.
If the IP address does not look like an IPv4 (yes :-( ) address or the geoip module returns an empty result, the edits get thrown in fallback buckets, which are considered when aggregating across countries. So the edits/editors get counted in the graphs at the above urls.
For the edit to get ignored, the geoip module would need to throw an exception. However, according to our logs this did not happen a single time in the recent weeks at all.
How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Are you referring to reports of http://geoiplookup.wikimedia.org/ timeouts [3]?
If so, it is highly unlikely that it affects geowiki, as geowiki is not relying on that service, but uses the GeoIP databases directly.
If you are not referring to above timeouts, but different reports, could you please provide more details?
(geowiki logs show no geolocation problems.) (Graphs do not show obvious artifacts.)
Anomalies in the data Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
As that was long before I joined the team, and could not find any documentation about this anomaly, I can only speculate about it.
So I'll have to leave definite answers to those who know first hand.
However, the drop you mention seems to be limited to enwiki. And the drop is linear downwards over several days. The size of the drop/day is roughly the number of new active editors that we'd expect per day. So it looks just like no new rows being added to cu_changes, while older ones move out of the “look-back window”. So when only looking at the graph, database issues (e.g.: replication stuck) on the analytics slave for s1 might be a plausible explanation. This would also match other characteristics of the drop.
Longer term, if we’re not interested in country-level data, [...]
There are voices that strongly request per country break downs :-)
That said, the overreporting is of course a problem. But as argued above, it's better to have the overreporting, tentative graphs that at least allows to exhibit trends than no graph at all ;-)
As there seems some larger demand for daily metrics, I am not sure whether generating those graphs from within geowiki will be the long term solution. Others need to decide that.
Have fun, Christian
[1] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [2] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [3] E.g.: October 19 on https://wikitech.wikimedia.org/wiki/Server_Admin_Log 07:13 ori-l: reports of geoiplookup timing out in AU at enwiki VPT: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#geoiplo...
Comments inline
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Christian Aistleitner Sent: Tuesday, November 05, 2013 3:47 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Analytics slave for s6 not replicating
Hi Dario,
On Mon, Nov 04, 2013 at 10:00:25PM -0800, Dario Taraborelli wrote:
I went through the geowiki source and I have a couple of questions on the data.
I am happy to share what I learnt about the geowiki code base :-)
Active Editor definition Can you confirm that: • edits are limited to ns0 (cuc_namespace = 0)
Yes.
• you’re applying a 30-day look-back window for each day excluding the first 30 days in cu_changes
[EZ]: Suggestion: if we make that 28 instead of 30 the weekly ripple is gone (some days 30 day history includes 5 weekends, most days 4)
As I learnt that some people have and use direct access to the geowiki tables of the staging database, we have to split cases.
In the staging database, there are different “look-back window”s as well” (say 14-days). But for the files in the geowiki-data repository, and for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts we only consider 30 day “look-back window”s.
However, I do not understand “excluding the first 30 days in cu_changes”. Up to my knowledge, timewise geowiki only filters for the “look-back window”.
• you’re filtering out bots using a union of user_group = ‘bot’ and Erik’s manually compiled list of bots (data/erikZ.bots)
[EZ]: see previous mail on bots criteria
(Assuming “user_group” should be read as „ug_group”)
Yes.
However, consider that data/erikZ.bots is stale and in dire need of an update. So, bot detection of new bots mostly relies on ug_groups.
[EZ]: Up to date files are on stat1002. My earlier suggestion to send you these files (one per project) on a regular basis still holds.
• you are not applying the countable page and content namespace filters
Yes. We do not limit to Countable Pages. There is a bug for changing that.
We do not limit to Content Namespaces, only namespace 0. As that matches the current definition of “Active editor” [1], I do not think adding such a filter to geowiki is on the agenda. But Toby, or Diederik might know better if such a change has been scheduled.
[EZ]: Definition page: "Some day Wikistats may dynamically establish countable namespaces per wiki via the API". Well that happened a few months ago. I updated the definition page.
• the caveat about overcounting users editing from multiple countries still applies [...]
That depends a bit on which part of geowiki you're looking at.
For per project country breakdowns this observation should not hold.
But for graphs as http://gp.wmflabs.org/graphs/active_editors_total http://gp.wmflabs.org/graphs/enwiki_editor_counts your observation is accurate. It is also noted in the graph's description.
Note however that those graphs come with a “Tentative” in the title. So we do know that those graphs come with many problems. But they allow to at least to expose immediate trends, which proved to be a pressing need. So it's better to have those tentative graphs online then showing nothing.
But yes, it's unsatisfactory.
Since you walked through the code base already: Patches welcome!
Especially patches that have been coordinated with consumers of the graphs ;-)
• edits to redirect pages are included
You anticipate a discussion that I wanted to start since some time: Analytics' edit definition is bound to wikistats [2], and is vague in many directions (Is page creation an edit? How to treat redirects?...).
[EZ]: Why would page creation not be an edit? [EZ]: Redirects are not counted. Quote from definition page: "In the context of wikistats countable pages are pages which contain an internal link (aka wikilink) or category link, and are not a redirect page. This conforms to the traditional definition of an 'article' within the Wikimedia community." Any other 'vagueness' left?
Due to other pressing issues, this discussion has not yet been started, and geowiki still uses the definition used by the original author: Each row in cu_changes is considered an edit.
Geolookup issues What happens to unresolved IP addresses?
That depends on the nature of “unresolved” and on whether you're interested in the database or the generated csvs, and on whether you're interested in aggregated counts, country breakdowns or city breakdowns.
So let me assume you're asking in the context of the graph urls above.
If the IP address does not look like an IPv4 (yes :-( ) address or the geoip module returns an empty result, the edits get thrown in fallback buckets, which are considered when aggregating across countries. So the edits/editors get counted in the graphs at the above urls.
For the edit to get ignored, the geoip module would need to throw an exception. However, according to our logs this did not happen a single time in the recent weeks at all.
How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Are you referring to reports of http://geoiplookup.wikimedia.org/ timeouts [3]?
If so, it is highly unlikely that it affects geowiki, as geowiki is not relying on that service, but uses the GeoIP databases directly.
If you are not referring to above timeouts, but different reports, could you please provide more details?
(geowiki logs show no geolocation problems.) (Graphs do not show obvious artifacts.)
Anomalies in the data Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
As that was long before I joined the team, and could not find any documentation about this anomaly, I can only speculate about it.
So I'll have to leave definite answers to those who know first hand.
However, the drop you mention seems to be limited to enwiki. And the drop is linear downwards over several days. The size of the drop/day is roughly the number of new active editors that we'd expect per day. So it looks just like no new rows being added to cu_changes, while older ones move out of the “look-back window”. So when only looking at the graph, database issues (e.g.: replication stuck) on the analytics slave for s1 might be a plausible explanation. This would also match other characteristics of the drop.
Longer term, if we’re not interested in country-level data, [...]
There are voices that strongly request per country break downs :-)
That said, the overreporting is of course a problem. But as argued above, it's better to have the overreporting, tentative graphs that at least allows to exhibit trends than no graph at all ;-)
As there seems some larger demand for daily metrics, I am not sure whether generating those graphs from within geowiki will be the long term solution. Others need to decide that.
Have fun, Christian
[1] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [2] https://www.mediawiki.org/w/index.php?title=Analytics/Metric_definitions&... [3] E.g.: October 19 on https://wikitech.wikimedia.org/wiki/Server_Admin_Log 07:13 ori-l: reports of geoiplookup timing out in AU at enwiki VPT: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#geoiplo...
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------
Hi Erik,
[ rearranged quotations due to Outlook's lack of quotation marks ]
On Tue, Nov 05, 2013 at 04:15:58PM +0100, Erik Zachte wrote:
Christian Aistleitner wrote:
However, consider that data/erikZ.bots is stale and in dire need of an update. So, bot detection of new bots mostly relies on ug_groups.
Up to date files are on stat1002. My earlier suggestion to send you these files (one per project) on a regular basis still holds.
Thanks for renewing that offer. However, we cannot consume those files 1:1, but need to do some minor conversions, and further checks etc. As that'll take some time, we cannot do it adhoc, but need to schedule it.
We do not limit to Content Namespaces, only namespace 0. As that matches the current definition of “Active editor” [...]
Definition page: "Some day Wikistats may dynamically establish countable namespaces per wiki via the API". Well that happened a few months ago. I updated the definition page.
Thanks for updating the documentation on “content namespaces”. I hope the veterans in our team can vet this change.
However, “active editor” is defined as “[...] person [...] who makes 5 or more edits [...] in /mainspace/”.
Whatever additional namespaces “content namespace” might allow, the definition of “active editor” strips it again down to only mainspace == namespace 0.
You anticipate a discussion that I wanted to start since some time: Analytics' edit definition is bound to wikistats [2], and is vague in many directions (Is page creation an edit? How to treat redirects?...).
Why would page creation not be an edit?
I'd hope that page creation is considered an edit. I really do. It seems the natural choice. But from reading only the definition, I do not know.
The definition only mentions “updates”. And to me, creation is not an update.
And Mediawiki itself distinguishes them as well: define( 'RC_EDIT', 0 ); define( 'RC_NEW', 1 );
But as said above, I am not trying to start the “What's an edit”-discussion right now.
Best regards, Christian
Ø Eriks manually compiled list of bots (data/erikZ.bots)
My list is not manual really, except for a few names
Wikistats has several criteria for bot detection:
1) Is the bot flag set in the user group table?
2) Does it sound like a bot? (adds hundreds of names)
(nowadays only bot names are allowed to sound like a bot)
To be precise: does bot occur at end of name or before non alpha char?
3) Is it known to be an unregistered bot ? (WIkipedia has a list of false negatives at http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edit s/Unflagged_bots )
4) Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project
5) Three names that sound like bot are hard coded exemptions (people who wrote about it)
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dario Taraborelli Sent: Tuesday, November 05, 2013 7:00 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Analytics slave for s6 not replicating
Hey Christian,
I went through the geowiki source and I have a couple of questions on the data.
Active Editor definition
Can you confirm that:
edits are limited to ns0 (cuc_namespace = 0)
youre applying a 30-day look-back window for each day excluding the first 30 days in cu_changes
youre filtering out bots using a union of user_group = bot and Eriks manually compiled list of bots (data/erikZ.bots)
you are not applying the countable page and content namespace filters
the caveat about overcounting users editing from multiple countries still applies (it looks like it does given that the data is generated by counts aggregated by country / date / project stored in the staging DB)
edits to redirect pages are included
Geolookup issues
What happens to unresolved IP addresses? Ive been told by a number of folks that the geoip DB had several issues lately, meaning that the volume of IPs that do not resolve to a specific country may have changed over time. How likely do you think is the possibility of artifacts in the data, inflating or deflating 5+ counts?
Anomalies in the data
Are the anomalies in series such as enwiki (for example, the one starting on 2013-01-09) caused by geoip issues or by temporary disruptions in the job that runs the geowiki script?
Longer term, if were not interested in country-level data, I think we should generate this data directly from the revision tables unless theres a strong reason to use cu_changes (which I might be missing). This will avoid over-reporting due to multiple-country editor counting, avoid potential issues with changes in the geoip DB (like the unconfirmed ones that I mentioned above) and also make the whole data replicable (right now historical data from geowiki cannot be reproduced from scratch from the DBs, due to the 3-month lifecycle of cu_changes).
Dario