How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipediahttp://stats.wikimedia.org/EN/TablesWikipediaSV.htm and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it says it was 139 in December and 144 in November. How is that possible?
Two screendumps can be found on these links: https://dl.dropbox.com/u/8363895/WPSV%20Dec%202012.png https://dl.dropbox.com/u/8363895/WPSV%20Jan%202013.png
And while I am asking, is this table available in .csv or .ods so we can play around with it easily? * Jan Ainali*
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
On 17 March 2013 13:55, Jan Ainali jan.ainali@wikimedia.se wrote:
How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipedia and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it says it was 139 in December and 144 in November. How is that possible?
"New Wikipedians" counts people who have edited at least ten times since registering an account; however, they don't have to make all ten edits in that month.
http://stats.wikimedia.org/EN/TablesWikipediansNew.htm
So if someone registers in December and makes two edits, they won't show up on that list; if they come back in January and make eight more, then when the stats are recalculated they'll show up as a new user for December.
-- - Andrew Gray andrew.gray@dunelm.org.uk
Well, that makes sense when there is an increase. But looking on my screenshot from December I can see that in October 2012 there was 152 new Wikipedians. One month later they have decreased to 151. Can that be due to deleted articles that "sends them below the bar" again?
*Med vänliga hälsningar, Jan Ainali*
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
2013/3/17 Andrew Gray andrew.gray@dunelm.org.uk
On 17 March 2013 13:55, Jan Ainali jan.ainali@wikimedia.se wrote:
How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipedia and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it
says
it was 139 in December and 144 in November. How is that possible?
"New Wikipedians" counts people who have edited at least ten times since registering an account; however, they don't have to make all ten edits in that month.
http://stats.wikimedia.org/EN/TablesWikipediansNew.htm
So if someone registers in December and makes two edits, they won't show up on that list; if they come back in January and make eight more, then when the stats are recalculated they'll show up as a new user for December.
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes, that can be due to deleted articles. The stats are calculated on the current dumps, and these do not include deleted articles.
2013/3/17 Jan Ainali jan.ainali@wikimedia.se
Well, that makes sense when there is an increase. But looking on my screenshot from December I can see that in October 2012 there was 152 new Wikipedians. One month later they have decreased to 151. Can that be due to deleted articles that "sends them below the bar" again?
*Med vänliga hälsningar, Jan Ainali*
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
2013/3/17 Andrew Gray andrew.gray@dunelm.org.uk
On 17 March 2013 13:55, Jan Ainali jan.ainali@wikimedia.se wrote:
How come the historical statistics vary over time? Last month I looked
at
the statistics for Swedish Wikipedia and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it
says
it was 139 in December and 144 in November. How is that possible?
"New Wikipedians" counts people who have edited at least ten times since registering an account; however, they don't have to make all ten edits in that month.
http://stats.wikimedia.org/EN/TablesWikipediansNew.htm
So if someone registers in December and makes two edits, they won't show up on that list; if they come back in January and make eight more, then when the stats are recalculated they'll show up as a new user for December.
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes, probably deletions. The rationale for "rewriting history", last time I asked Erik, is that if the pages got deleted they should arguably never have been created.
Andrew Gray, 17/03/2013 15:52:
On 17 March 2013 13:55, Jan Ainali wrote:
How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipedia and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it says it was 139 in December and 144 in November. How is that possible?
"New Wikipedians" counts people who have edited at least ten times since registering an account; however, they don't have to make all ten edits in that month.
http://stats.wikimedia.org/EN/TablesWikipediansNew.htm
So if someone registers in December and makes two edits, they won't show up on that list; if they come back in January and make eight more, then when the stats are recalculated they'll show up as a new user for December.
In this case, you should file a bug, because they're supposed to show up as new users in January: https://www.mediawiki.org/wiki/Analytics/Metric_definitions#Contributor (disclaimer: I added that line myself).
Nemo
Thanks, then I understand. Regarding my second question, is the data avilable in any easier format than start digging in the dump myself or trying to convert that html file?
*Med vänliga hälsningar, Jan Ainali*
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
2013/3/17 Federico Leva (Nemo) nemowiki@gmail.com
Yes, probably deletions. The rationale for "rewriting history", last time I asked Erik, is that if the pages got deleted they should arguably never have been created.
Andrew Gray, 17/03/2013 15:52:
On 17 March 2013 13:55, Jan Ainali wrote:
How come the historical statistics vary over time? Last month I looked at the statistics for Swedish Wikipedia and it had 132 new Wikipedians in December 2012 and 142 for November. When the January stats came now it says it was 139 in December and 144 in November. How is that possible?
"New Wikipedians" counts people who have edited at least ten times since registering an account; however, they don't have to make all ten edits in that month.
http://stats.wikimedia.org/EN/**TablesWikipediansNew.htmhttp://stats.wikimedia.org/EN/TablesWikipediansNew.htm
So if someone registers in December and makes two edits, they won't show up on that list; if they come back in January and make eight more, then when the stats are recalculated they'll show up as a new user for December.
In this case, you should file a bug, because they're supposed to show up as new users in January: <https://www.mediawiki.org/** wiki/Analytics/Metric_**definitions#Contributorhttps://www.mediawiki.org/wiki/Analytics/Metric_definitions#Contributor> (disclaimer: I added that line myself).
Nemo
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
Jan Ainali, 17/03/2013 16:59:
Thanks, then I understand. Regarding my second question, is the data avilable in any easier format than start digging in the dump myself or trying to convert that html file?
I didn't answer because I can't help, much here, but the documentation is here: https://meta.wikimedia.org/wiki/Wikistats_csv It's very sparse, please everyone add every little bit of knowledge you have on the topic (absolute 0 for me).
Nemo
That is exactly what I was looking for, thanks!
Except, the link for the Wikipedia csv does not work. That has also been noted on the discussion page 3 March 2012.
*Jan Ainali*
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
2013/3/17 Federico Leva (Nemo) nemowiki@gmail.com
Jan Ainali, 17/03/2013 16:59:
Thanks, then I understand. Regarding my second question, is the data
avilable in any easier format than start digging in the dump myself or trying to convert that html file?
I didn't answer because I can't help, much here, but the documentation is here: https://meta.wikimedia.org/**wiki/Wikistats_csvhttps://meta.wikimedia.org/wiki/Wikistats_csv It's very sparse, please everyone add every little bit of knowledge you have on the topic (absolute 0 for me).
Nemo
Jan,
I fixed the external links.
BTW The page is massively incomplete. Many files are not yet documented. I just added a section about Editor Counts.
For general background see also http://www.mediawiki.org/wiki/Analytics/Wikistats
Erik Zachte
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Jan Ainali Sent: Sunday, March 17, 2013 6:58 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Stat variances over time
That is exactly what I was looking for, thanks!
Except, the link for the Wikipedia csv does not work. That has also been noted on the discussion page 3 March 2012.
Jan Ainali
Wikimedia Sverige http://se.wikimedia.org/wiki/Huvudsida 076-2122776
2013/3/17 Federico Leva (Nemo) nemowiki@gmail.com
Jan Ainali, 17/03/2013 16:59:
Thanks, then I understand. Regarding my second question, is the data avilable in any easier format than start digging in the dump myself or trying to convert that html file?
I didn't answer because I can't help, much here, but the documentation is here: https://meta.wikimedia.org/wiki/Wikistats_csv It's very sparse, please everyone add every little bit of knowledge you have on the topic (absolute 0 for me).
Nemo
TL;DR static monthly counts will be costly to implement and cure the lesser problem, while possibly making trends harder to assess
By coincidence this discussion is very well timed from my perspective. I was going to reflect on this issue of 'rewriting history' today or tomorrow anyway as Erik Moeller asked me to reconsider the current setup. I see complications that I'd like to put forward here, so we can fully appreciate the consequences.
A) Expectations differ depending on context
First general remarks:
It is indeed somewhat confusing to see numbers for any given month change in subsequent publications. We are not used to that in the daily news. We don't expect to read in tomorrow's newspaper: "The inflation rate for 2011 has increased with 0.2%." We expect these numbers to be established once and for all.
At the other hand in science incremental insights are totally the norm. Over the decades we have seen the age of the earth change with substantive amounts. We wouldn't be that amazed to read that the average yearly temperature of the last ice age has been reassessed again. Many scientists would love to go back in time, and redo centuries old climatic measurements with today's super accurate equipment.
B) Quantifying impact of current approach
As for the scale of the readjustments, it is fairly modest I would say.
Typically active editor counts drop up to 2% in the subsequent month (when they are one-but-newest figure in the list). Then a second adjustment is around 0.5% and from then on numbers are pretty stable. This is my general impression, I haven't researched that in depth, it might vary per wiki. I assume this pattern will be less consistent on small wikis with intermittent burst of (delete) activity.
Incidentally the phenomenon has been dubbed 'deletion drift' but I think that overemphasizes its impact. Drift suggest continuous change (like in continental drift).
C) Practical complications could outweigh benefits
If we freeze the first assessment for any month and make that canonical I see several practical complications:
C1) Our environment is not static, definitions and conventions change.
C1a) From time to time users of some wiki decide to declare an extra namespace countable, long after that namespace was first established. [1]
C1b) Definition of what constitutes an article can even change more profoundly:
Long ago some large wiktionaries (or all?) decided to abandon the requirement for an internal link. For some time dummy links were added for wikistats, but those haven been removed again. I learnt about this only much later (and again wikistats is behind in accommodating this change.)
Both these changes can have more substantial effect on trends than the 'deletion drift'. Either we would have to rule out any definition changes. Or live with the fact that trend lines can contain substantial discontinuities, which would all have to be annotated, making long term trending really hard to do.
C2) We might want to rerun the counts anyway because a script bug has been found and fixed
To only undo the effects of the buggy algorithm we would need to reprocess all data exactly as we did before.
With current dump scheme that would be impractical to say the least: - Dumps are only retained for one or two years - We would have to rerun each dump and only process the last month of data from each dump
Unless ... we build a new dump file format, which only grows every month but where data are never taken away from. That would have to be a private dump (some articles/revisions are deleted from today's dumps for privacy reasons) While doable that will be a large investment to cater for a relatively minor issue.
C3) A third complication is that article counts now depend on which dump was processed.
For wikipedias we process stub dumps (full archive dumps take too long). For smaller project we may do full archive dumps. In stub dumps there is no raw article text so we can't assess existence of an internal link.
Regardless from C1) and C2) it would be great if we could add missing meta info to stub dumps, so we can forget about full archive dumps all-together in a wikistats context, and be more consistent . See https://bugzilla.wikimedia.org/show_bug.cgi?id=42318 comment 5
[1] BTW Wikistats is still behind on including some namespaces into article counts. Technically the scripts are ready to automate this fully, but I haven't put it up for decision yet.
Erik Zachte
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Federico Leva (Nemo) Sent: Sunday, March 17, 2013 5:17 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Stat variances over time
Jan Ainali, 17/03/2013 16:59:
Thanks, then I understand. Regarding my second question, is the data avilable in any easier format than start digging in the dump myself or trying to convert that html file?
I didn't answer because I can't help, much here, but the documentation is here: https://meta.wikimedia.org/wiki/Wikistats_csv It's very sparse, please everyone add every little bit of knowledge you have on the topic (absolute 0 for me).
Nemo
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Erik Zachte, 17/03/2013 20:15:
TL;DR static monthly counts will be costly to implement and cure the lesser problem, while possibly making trends harder to assess
Thanks Erik for this lovely email. :) Erm, dumb question: if the use-case is "journalist is confused by current stats not matching previous claims", can't we just set up archives of the HTML reports?
[...] Regardless from C1) and C2) it would be great if we could add missing meta info to stub dumps, so we can forget about full archive dumps all-together in a wikistats context, and be more consistent . See https://bugzilla.wikimedia.org/show_bug.cgi?id=42318 comment 5
+1 (shameless plug for own bug).
[1] BTW Wikistats is still behind on including some namespaces into article counts. Technically the scripts are ready to automate this fully, but I haven't put it up for decision yet.
Nemo
Erm, dumb question: if the use-case is "journalist is confused by current stats not matching previous claims", can't we just set up archives of the HTML reports?
Wouldn't that add to the confusion? I'd prefer a simple explanation in the introduction.
Also 800 wikis x 27 languages x so many static tables is already 20 GB per month.
Erik Zachte
-----Original Message----- From: Federico Leva (Nemo) [mailto:nemowiki@gmail.com] Sent: Sunday, March 17, 2013 8:53 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: Erik Zachte Subject: Re: [Analytics] Stat variances over time
Erik Zachte, 17/03/2013 20:15:
TL;DR static monthly counts will be costly to implement and cure the lesser problem, while possibly making trends harder to assess
Thanks Erik for this lovely email. :) Erm, dumb question: if the use-case is "journalist is confused by current stats not matching previous claims", can't we just set up archives of the HTML reports?
[...] Regardless from C1) and C2) it would be great if we could add missing meta info to stub dumps, so we can forget about full archive dumps all-together in a wikistats context, and be more consistent . See https://bugzilla.wikimedia.org/show_bug.cgi?id=42318 comment 5
+1 (shameless plug for own bug).
[1] BTW Wikistats is still behind on including some namespaces into article counts. Technically the scripts are ready to automate this fully, but I haven't put it up for decision yet.
Nemo
Nemo, after rethinking, I probably misunderstood your suggestion.
You probably mean one extra set of html files which is rebuilt every month with one row added and existing rows unchanged. Rather than a full set of html files published every month and all those monthly editions fully kept online.
I'm still not sure whether two sets of data wouldn't add to the confusion, but disk size would not be an issue really.
Erik Zachte
-----Original Message----- From: Erik Zachte [mailto:ezachte@wikimedia.org] Sent: Sunday, March 17, 2013 9:06 PM To: 'Federico Leva (Nemo)'; 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.' Subject: RE: [Analytics] Stat variances over time
Erm, dumb question: if the use-case is "journalist is confused by current stats not matching previous claims", can't we just set up archives of the HTML reports?
Wouldn't that add to the confusion? I'd prefer a simple explanation in the introduction.
Also 800 wikis x 27 languages x so many static tables is already 20 GB per month.
Erik Zachte
-----Original Message----- From: Federico Leva (Nemo) [mailto:nemowiki@gmail.com] Sent: Sunday, March 17, 2013 8:53 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: Erik Zachte Subject: Re: [Analytics] Stat variances over time
Erik Zachte, 17/03/2013 20:15:
TL;DR static monthly counts will be costly to implement and cure the lesser problem, while possibly making trends harder to assess
Thanks Erik for this lovely email. :) Erm, dumb question: if the use-case is "journalist is confused by current stats not matching previous claims", can't we just set up archives of the HTML reports?
[...] Regardless from C1) and C2) it would be great if we could add missing meta info to stub dumps, so we can forget about full archive dumps all-together in a wikistats context, and be more consistent . See https://bugzilla.wikimedia.org/show_bug.cgi?id=42318 comment 5
+1 (shameless plug for own bug).
[1] BTW Wikistats is still behind on including some namespaces into article counts. Technically the scripts are ready to automate this fully, but I haven't put it up for decision yet.
Nemo
Erik Zachte, 17/03/2013 21:26:
Nemo, after rethinking, I probably misunderstood your suggestion.
You probably mean one extra set of html files which is rebuilt every month with one row added and existing rows unchanged. Rather than a full set of html files published every month and all those monthly editions fully kept online.
No, your first interpretation was correct. I don't know what would be confusing and surely I didn't imagine the 20 GB figure, which however should not be the worst of the problems (there's disk space available for some dozen years of that): I was just trying to imagine what's the most sensible/efficient solution (in terms of coding and thinking required) for the problem. :)
I'm still not sure whether two sets of data wouldn't add to the confusion, but disk size would not be an issue really.
Yes, if you go for this new system it would perhaps be worth prividing also the "old way" report somewhere, unless it complicates the code or processing too much.
Nemo
On 03/17/2013 08:15 PM, Erik Zachte wrote:
C1b) Definition of what constitutes an article can even change more profoundly:
Recently the Swedish Wikipedia started to add bot created articles on a large scale, which has previously been done to the Dutch and some other Wikipedias. These articles are not bad, they cite sources and are accurate, so they should be counted among the existing articles. But they are not very popular, since they cover obscure topics.
This leads to the idea that perhaps we should count articles that are actually read. It's easy to identify those articles that are very short or don't cite sources, but in order to count articles that aren't read, we need to be sure that robots of all kinds are excluded.
In excluding robot accesses from the visitor statistics, it's also relevant to ask whether accesses from editors should be counted. If I'm a steam engine enthusiast and writes articles about every engineer and railroad, maybe I'm the only audience for those articles. When I want to know if my articles have any readership, I don't want to include myself in the audience count. If I'm only writing for my own reading, then I don't really need Wikipedia, so the usefulness of Wikipedia starts when the second human reader turns up.
Are there any ideas or strategies for a good audience count?
If, instead of page views, we were to count the number of different IP addresses, then each bot or editor would just count as one identity, and this would reduce their impact.
If we can define a good measurement for audience, then it would start a new statistics series and we would not have any problems with mismatch with any previous data.
Lars Aronsson, 18/03/2013 10:38:
Are there any ideas or strategies for a good audience count?
Erik proposed some sort of successor for https://meta.wikimedia.org/wiki/Wikipedia_article_depth : http://infodisiac.com/blog/2013/02/which-single-wikimedia-metric-would-inspi... Now that interwiki bots are on wikidata only, for your "audience" question we could maybe count how many articles have had an editor other that the creator. However, changing article count definition would require a broad discussion, and the main problem is making the count comparable between languages and if possible projects. The local decision of what's a countable namespace is the tool we currently use (but WikiStats doesn't compute it).
Nemo