Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
http://en.wikipedia.org/wiki/Wikipedia:Research
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research
--
Bryan Song
GroupLens Research
University of Minnesota
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hi,
Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hi All,
I been working on graphs to visualize the entire edit activity of in wiki
for some time now. I'm documenting all of it at
https://meta.wikimedia.org/wiki/Research:Editor_Behaviour_Analysis_%26_Grap…
.
The graphs can be viewed at
https://cosmiclattes.github.io/wikigraphs/data/wikis.html. Currently only
graphs for 'en' have been put up, I'll add the graphs for the wikis soon.
Methodology
- The editors are split into groups based on the month in which they
made their first edit.
- The active edit sessions (value or percentage etc) for the groups are
then plotted as stacked bars or as a matrix. I've used the canonical
definition of an active edit session. The value are + or - .1% of the
values on https://stats.wikimedia.org/
Selector
- There is a selector on each graph that lets you filter the data in the
graph. On moving the cursor to the left end of the selector you will get a
resize cursor. The selection can then are moved or redrawn.
- In graphs 1,2 the selector filters by percentage.
- In graphs 3,4,5 the selector filters by the age of the cohort.
Preliminary Finding
- Longevity of editors fell drastically starting Jan 06 and has since
stabilized at levels from Jan 07.
https://meta.wikimedia.org/wiki/Research:Editor_Behaviour_Analysis_%26_Grap…
Would you to hear what you guys think of the graphs & any ideas you would
have for me.
Jeph
Hi,
With 8% more editors contributing over 100 edits in June 2015 than in June
2014 <https://stats.wikimedia.org/EN/TablesWikipediaEN.htm>, we have now
had six consecutive months where this particular metric of the core
community is looking positive. One or two months could easily be a
statistical blip, especially when you compare calender months that may have
5 weekends in one year and four the next. But 6 months in a row does begin
to look like a change in pattern.
As far as caveats go I'm aware of several of the reasons why raw edit count
is a suspect measure, but I'm not aware of anything that has come in in
this year that would have artificially inflated edit counts and brought
more of the under 100 editors into the >100 group.
I know there was a recent speedup, which should increase subsequent edit
rates, and one of the edit filters got disabled in June, but neither of
those should be relevant to the Jan-May period.
Would anyone on this list be aware of something that would have otherwise
thrown that statistic?
Otherwise I'm considering submitting something to the Signpost.
Regards
Jonathan
Asaf Bartov has announced the WMF initiative: Community Capacity Development
on the Wikimedia-l mailing list. The thread starts here:
https://lists.wikimedia.org/pipermail/wikimedia-l/2015-August/078954.html
The initiative can be found here:
https://meta.wikimedia.org/wiki/Community_Capacity_Development
As experimentation is mentioned in parts of this, I have asked a question
about the freedom to experiment, the engineering resources to support
experiments, and (possibly involving some of you) the support of WMF for the
qualitative and quantitative collection and analysis of data arising from
such experiments.
We speculate a lot on this list about what might make a difference but
generally that's all we can do as we have no way to test an idea. So I am
genuinely curious to know if the resourcing is there to support
experimentation.
I also note that the on-wiki pages invite ideas. It occurred to me that
there might be scope for re-using ideas that have been put forward on this
list.
Kerry
Hoi,
There is a lot of knowledge on quality in online databases. It is known
that all of them have a certain error rate. This is true for Wikidata as
much as any other source.
My question is: is there a way to track Wikidata quality improvements over
time. One approach I blogged about [1]. It is however only an approach to
improve quality not an approach to determine quality and track the
improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to
compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia
research. particularly in the light of Wikidata becoming more easily
available to Wikipedia?
Thanks,
GerardM
[1]
http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and…
Dear All,
I was yesterday at OpenSym (many thanks to Dirk for organizing this!), and
I was chatting with some people about attribution of content to its authors
in a wiki.
So I got inspired, and I cleaned up some code that Michael Shavlovsky and I
had written for this:
https://github.com/lucadealfaro/authorship-tracking
The way to use it is super simple (see below). The attribution object can
also be serialized and de-serialized to/from json (see documentation on
github).
The idea behind the code is to attribute the content to the *earliest
revision *where the content was inserted, not the latest as diff tools
usually do. So if some piece of text is inserted, then deleted, then
re-inserted (in a revert or a normal edit), we still attribute it to the
earliest revision. This is somewhat similar to what we tried to do in
WikiTrust, but it's better done, and far more efficient.
The algorithm details can be found in
http://www2013.wwwconference.org/proceedings/p343.pdf
I hope this might be of interest!
Luca
import authorship_attribution
a = authorship_attribution.AuthorshipAttribution.new_attribution_processor(N=4)
a.add_revision("I like to eat pasta".split(), revision_info="rev0")
a.add_revision("I like to eat pasta with tomato sauce".split(),
revision_info="rev1")
a.add_revision("I like to eat rice with tomato sauce".split(),
revision_info="rev3")print a.get_attribution()
['rev0', 'rev0', 'rev0', 'rev0', 'rev3', 'rev1', 'rev1', 'rev1']