Re: [Analytics] Stat variances over time

18 Mar 2013


      TL;DR static monthly counts will be costly to implement and cure the lesser problem, while possibly making trends harder to assess
By coincidence this discussion is very well timed from my perspective.
I was going to reflect on this issue of 'rewriting history' today or tomorrow anyway as Erik Moeller asked me to reconsider the current setup. 
I see complications that I'd like to put forward here, so we can fully appreciate the consequences.
A) Expectations differ depending on context
First general remarks:
It is indeed somewhat confusing to see numbers for any given month change in subsequent publications. 
We are not used to that in the daily news. 
We don't expect to read in tomorrow's newspaper: "The inflation rate for 2011 has increased with 0.2%."
We expect these numbers to be established once and for all.
At the other hand in science incremental insights are totally the norm. 
Over the decades we have seen the age of the earth change with substantive amounts.
We wouldn't be that amazed to read that the average yearly temperature of the last ice age has been reassessed again. 
Many scientists would love to go back in time, and redo centuries old climatic measurements with today's super accurate equipment.
B) Quantifying impact of current approach
As for the scale of the readjustments, it is fairly modest I would say.
Typically active editor counts drop up to 2% in the subsequent month (when they are one-but-newest figure in the list).
Then a second adjustment is around 0.5% and from then on numbers are pretty stable. 
This is my general impression, I haven't researched that in depth, it might vary per wiki.
I assume this pattern will be less consistent on small wikis with intermittent burst of (delete) activity.
Incidentally the phenomenon has been dubbed 'deletion drift' but I think that overemphasizes its impact.
Drift suggest continuous change (like in continental drift).
C) Practical complications could outweigh benefits
If we freeze the first assessment for any month and make that canonical I see several practical complications:
C1) Our environment is not static, definitions and conventions change.
C1a) From time to time users of some wiki decide to declare an extra namespace countable, long after that namespace was first established. [1]
C1b) Definition of what constitutes an article can even change more profoundly:
Long ago some large wiktionaries (or all?) decided to abandon the requirement for an internal link.
For some time dummy links were added for wikistats, but those haven been removed again. 
I learnt about this only much later (and again wikistats is behind in accommodating this change.)
Both these changes can have more substantial effect on trends than the 'deletion drift'. 
Either we would have to rule out any definition changes. 
Or live with the fact that trend lines can contain substantial discontinuities, 
which would all have to be annotated, making long term trending really hard to do.
C2) We might want to rerun the counts anyway because a script bug has been found and fixed
To only undo the effects of the buggy algorithm we would need to reprocess all data exactly as we did before.
With current dump scheme that would be impractical to say the least:
- Dumps are only retained for one or two years
- We would have to rerun each dump and only process the last month of data from each dump
Unless ... we build a new dump file format, which only grows every month but where data are never taken away from. 
That would have to be a private dump (some articles/revisions are deleted from today's dumps for privacy reasons)
While doable that will be a large investment to cater for a relatively minor issue.
C3) A third complication is that article counts now depend on which dump was processed.
For wikipedias we process stub dumps (full archive dumps take too long). For smaller project we may do full archive dumps.
In stub dumps there is no raw article text so we can't assess existence of an internal link.
Regardless from C1) and C2) it would be great if we could add missing meta info to stub dumps, 
so we can forget about full archive dumps all-together in a wikistats context, and be more consistent .
See https://bugzilla.wikimedia.org/show_bug.cgi?id=42318 comment 5
[1] BTW Wikistats is still behind on including some namespaces into article counts. 
Technically the scripts are ready to automate this fully, but I haven't put it up for decision yet.
Erik Zachte
-----Original Message-----
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Federico Leva (Nemo)
Sent: Sunday, March 17, 2013 5:17 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Stat variances over time
Jan Ainali, 17/03/2013 16:59:
...
Thanks, then I understand. Regarding my second question, is the data 
avilable in any easier format than start digging in the dump myself or 
trying to convert that html file?
I didn't answer because I can't help, much here, but the documentation is here: https://meta.wikimedia.org/wiki/Wikistats_csv
It's very sparse, please everyone add every little bit of knowledge you have on the topic (absolute 0 for me).
Nemo
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Stat variances over time