Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
University of Minnesota
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Did anybody ever try to collect statistics about frequent section titles in
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
“We're living in pieces,
I want to live in peace.” – T. Moore
Wikitech-l mailing list
Jonathan T. Morgan
Senior Design Researcher
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
I been working on graphs to visualize the entire edit activity of in wiki
for some time now. I'm documenting all of it at
The graphs can be viewed at
https://cosmiclattes.github.io/wikigraphs/data/wikis.html. Currently only
graphs for 'en' have been put up, I'll add the graphs for the wikis soon.
- The editors are split into groups based on the month in which they
made their first edit.
- The active edit sessions (value or percentage etc) for the groups are
then plotted as stacked bars or as a matrix. I've used the canonical
definition of an active edit session. The value are + or - .1% of the
values on https://stats.wikimedia.org/
- There is a selector on each graph that lets you filter the data in the
graph. On moving the cursor to the left end of the selector you will get a
resize cursor. The selection can then are moved or redrawn.
- In graphs 1,2 the selector filters by percentage.
- In graphs 3,4,5 the selector filters by the age of the cohort.
- Longevity of editors fell drastically starting Jan 06 and has since
stabilized at levels from Jan 07.
Would you to hear what you guys think of the graphs & any ideas you would
have for me.
With 8% more editors contributing over 100 edits in June 2015 than in June
2014 <https://stats.wikimedia.org/EN/TablesWikipediaEN.htm>, we have now
had six consecutive months where this particular metric of the core
community is looking positive. One or two months could easily be a
statistical blip, especially when you compare calender months that may have
5 weekends in one year and four the next. But 6 months in a row does begin
to look like a change in pattern.
As far as caveats go I'm aware of several of the reasons why raw edit count
is a suspect measure, but I'm not aware of anything that has come in in
this year that would have artificially inflated edit counts and brought
more of the under 100 editors into the >100 group.
I know there was a recent speedup, which should increase subsequent edit
rates, and one of the edit filters got disabled in June, but neither of
those should be relevant to the Jan-May period.
Would anyone on this list be aware of something that would have otherwise
thrown that statistic?
Otherwise I'm considering submitting something to the Signpost.
Asaf Bartov has announced the WMF initiative: Community Capacity Development
on the Wikimedia-l mailing list. The thread starts here:
The initiative can be found here:
As experimentation is mentioned in parts of this, I have asked a question
about the freedom to experiment, the engineering resources to support
experiments, and (possibly involving some of you) the support of WMF for the
qualitative and quantitative collection and analysis of data arising from
We speculate a lot on this list about what might make a difference but
generally that's all we can do as we have no way to test an idea. So I am
genuinely curious to know if the resourcing is there to support
I also note that the on-wiki pages invite ideas. It occurred to me that
there might be scope for re-using ideas that have been put forward on this
There is a lot of knowledge on quality in online databases. It is known
that all of them have a certain error rate. This is true for Wikidata as
much as any other source.
My question is: is there a way to track Wikidata quality improvements over
time. One approach I blogged about . It is however only an approach to
improve quality not an approach to determine quality and track the
improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to
compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia
research. particularly in the light of Wikidata becoming more easily
available to Wikipedia?
I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and
I was chatting with some people about attribution of content to its authors
in a wiki.
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I
had written for this:
The way to use it is super simple (see below). The attribution object can
also be serialized and de-serialized to/from json (see documentation on
The idea behind the code is to attribute the content to the *earliest
revision *where the content was inserted, not the latest as diff tools
usually do. So if some piece of text is inserted, then deleted, then
re-inserted (in a revert or a normal edit), we still attribute it to the
earliest revision. This is somewhat similar to what we tried to do in
WikiTrust, but it's better done, and far more efficient.
The algorithm details can be found in
I hope this might be of interest!
a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4)
a.add_revision("I like to eat pasta".split(), revision_info="rev0")
a.add_revision("I like to eat pasta with tomato sauce".split(),
a.add_revision("I like to eat rice with tomato sauce".split(),
['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']