[Foundation-l] Analysis of statistics

Milos Rancic millosh at gmail.com
Sat Jul 25 06:04:30 UTC 2009


On Fri, Jul 24, 2009 at 5:55 PM, Felipe Ortega<glimmer_phoenix at yahoo.es> wrote:
> You can check more precise figures and graphs in my thesis about general statistics for survivability for all logged editors and core editors (the top 10% most active editors in each month), from the beginning until Dec. 2007, in the top-ten language versions (at that time).
>
> http://libresoft.es/Members/jfelipe/phd-thesis (page)
> http://libresoft.es/Members/jfelipe/thesis-wkp-quantanalysis (doc)
>
> As for the percentages of users by age, education level, etc. my impression is that opinions from experienced community members are often well oriented. But they're only opinions. Until we get the results of the general survey, we won't have a clear picture of the current "recruitment" targets for all versions.
>
> Nevertheless, according to our updates, it seems that the situation is not getting better from Jan 2008 onwards.

Great work, Felipe! I've seen mentioning of your work, but up to now,
I didn't read that. Now, I looked into the highlights of your thesis
and they are very informative. I am quoting some of the conclusions
here:

Q5: What is the average lifetime of Wikipedia volunteer authors in the
project?: The main conclusion we can infer from our survival analysis
performed on the community of authors in the top ten Wikipedias is
that there is an extraordinary high mortality rate in all languages.
Actually, we show that the monthly number of deaths of logged authors
in the top ten language versions surpassed the monthly number of new
logged authors coming to contribute for the first time in a certain
version. Therefore, the higher mortality rate, since the beginning of
2007, offers a possible explanation for the steady-state reached by
the monthly number of contributions and monthly number of active pages
in all versions during the same period. A significant proportion of
authors (more than 50% in all versions) abandons the project after
more than 200 days. Moreover, reaching the core group of very active
authors does not ensures that those authors will exhibit better
survivability since, in fact, more than 50% of them abandon that core
of very active authors after less than 100 days (less than 30 in the
case of the Portuguese and English Wikipedias). Complementing this
findings, the application of the Cox proportional hazards model let us
demonstrate that the participation of logged authors in FAs or talk
pages has a significant positive impact to enhance the survivability
of such contributors, being the contribution to both key types of
pages the one presenting the higher enhancement effect over the
average lifetime of authors.

Q7: Is it possible to infer, based on previous history data, any
sustainability conditions affecting the top-ten Wikipedias in due
course?: As a main conclusion, looking at the evolution of the key
parameters already identified as relevant to explain the progress in
time of the top ten Wikipedias and their communities, we find that
those statistics describing the activity of logged authors tend to
follow Pareto-like distributions that become, in general, more and
more log-linear as time elapses. On the other hand, metrics describing
articles has progressively lost the old Pareto-like shape for their
distribution, reaching a lognormal shape during 2007 (probably, as a
result of the stabilization of the number of logged authors in all
versions, as well). The analysis of the evolution in time of
contributions from the core of very active authors identified in each
month of history of a certain language version, reveals that former
core authors does not provide a comparable amount of effort to the
level offered by new, even more active members of the core.
Nevertheless, again the evolution parameters point out a somewhat
delicate situation, since the monthly inequality level of the
contributions from logged still maintains the same values as in
previous years. Thus, this indicates that either the inequality of the
distribution of revisions maintains the present level (in which case
the authors would not be able to address so many articles than in
previous years) or else, that the inequality level of this
distribution will continue to grow, until core authors begin to find
their natural limit in the maximum number of revisions performed and
number of different articles reviewed.

5.1.2 Sustainability conditions

The main conclusion that we can infer from the overall results of our
quantitative analysis is that there exists a severe risk in the
top-ten language versions of Wikipedia, about maintaining their
current activity level in due course. According to our graphs and
numbers, the inequality level of the contributions from logged authors
is becoming more and more biased towards the core of very active
authors. At the same time, the monthly Gini coefficients show that the
inequality level of contributions from logged authors has remained
stable over time, at the cost of demanding more and more contributions
from active authors to alleviate this deficit of monthly revisions.

Furthermore, we have seen that the distribution of the total number of
revisions per author follows an upper truncated Pareto distribution.
While more core authors begin to reach the upper limit of their human
contribution capacity, we will see a point in the future of this
language versions in which the steady-state of the monthly Gini
coefficient will start to decrease. This situation would not pose a
problem in itself, unless for the fact that we have demonstrated that
the most significant part of the content creation effort in Wikipedia
is not undertaken by casual, passing-by authors, but by members of the
core of very active contributors.

On top of that, the lack of new core members seriously threaten the
scalability of the top-ten language versions regarding the quality of
their content. We have demonstrated in the analysis previously
presented that the eldest, top-active contributors are responsible for
the majority of revisions in FAs, as well. Since the number of core
authors has reached a steady-state (due to the leverage in the total
number of active authors per month), the group of authors providing
the primary source of effort in the revision of quality articles has
stalled. Without new core members, the number of different articles
who would potentially become FAs can not expand, since we do not have
enough revisors for that content. Since the total number of quality
articles generated so far in the top-ten language editions is fairly
low, we can conclude that this approach will not contribute to
dynamize the creation of quality content in Wikipedia in due course.
It is true that Wikipedia has succeeded to compete with other
traditional encyclopaedias, namely Britannica [44], but if we do not
have a clear strategy for making the creation of quality content in
Wikipedia more agile, the project will not ever evolve from its
current character of “good starting point to look for a quick
introduction of a new topic, from which we can jump to more serious
information sources”.

To conclude this section, it would be disappointing to avoid offering
some insights about possible solutions for the top-ten Wikipedias to
improve their current trend. Nevertheless, some of the knowledge
needed to formulate such recommendations could be perfectly a matter
for a doctoral thesis on its own, namely the causes driving Wikipedia
authors to eventually join the core of very active users. Since we
have not answered such questions, we can simply settle for enumerating
direct countermeasures to alleviate these findings.

In the first place, incrementing the number of core authors should
become a priority for the project, and as a first step, Wikipedia
should focus increasing the number of monthly active authors. Indeed,
donations campaigns are necessary to aid in the financial support of
the project, but attracting new contributors or recovering older ones
should be an equally important goal, given the current situation.
Apparently, a lot of work still has to be done, not only to create new
articles, broadening Wikipedia coverage, but also revising current
articles to let them reach the FAs distinction at some point. Whether
the influence of featuring some of these quality articles in the main
page may have a direct influence in the number of revisions received,
it is undoubtly that content featured in the main page of every
language versions at least obtains superior visibility in the
community. A good idea could then promote “candidate articles” on the
main page, thus favoring the reception of new revisions. Many times,
users do not know about the existence of articles until they are
featured in the main page, or else, until they need to access them
explicitly. In the same way, we recommend to display a “randomly
selected” article (instead of the current approach of providing a
simple link), to try and increase the number of revisions received in
standard articles, as well.

Since the importance of the core of very active members has been
demonstrated, thinking about possible tools to further automate their
daily tasks, thus facilitating their normal activities, should also be
taken into account. We know about current useful tools made with this
goal in mind, but perhaps trying to recollect new ideas and
suggestions from these users could be another option. Since Wikipedia
is an open community, it would be quite difficult to further reduce
vandalism, and the access of trolls and other undesirable contributors
to articles and talk pages. Moreover, previous research works has
demonstrated that these acts of vandalism against content or the
community itself has been effectively controlled with the current
approaches.

Finally, we can not ignore the potential benefits of large scale
contributions coming from specific communities, specially from
educational institutions at all levels. The potential applications of
Wikipedia to learning environments has been also a matter of research,
and some authors have shown that direct contribution approaches may
have negative consequences for both the quality of content and the
willingness of young authors to continue to contribute if the get
strictly negative responses to their first revisions. All the same,
semi-controlled strategies like providing a final version of the
contribution, may have better effects for both the quality of content
and maintaining the implication of young contributors. In this regard,
providing special tools for highlighting these contributions could
facilitate the work of experienced Wikipedia authors, who could then
provide more focused comments.



More information about the foundation-l mailing list