We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all,
Does anyone know if data about gender of contributors on projects other
than English Wikipedia is available? In addition to other language
Wikipedias, it would be interesring if we have data about Commons,
Wikisource, Wikivoyage, Wiktionary, Wikispecies, etc.
Also, anecodotally I obeserve a relatively high percentage of female
participation in education and GLAM activities. Do we have data about the
gender of participants in education and GLAM, particularly in leadership
roles?
Thanks,
Pine
Hi,
Bob West, Jure Leskovec, and myself are organizing a workshop in ICWSM
focused on the challenges and opportunities of Wikipedia. You can find more
information about the workshop and call for papers below.
Looking forward to seeing many of you in person in the workshop.
Best,
Leila
*Call for Workshop Papers*
Workshop on Wikipedia, a Social Pedia: Research Challenges and Opportunities
May 26, Oxford, England
co-located with the 9th International Conference on Weblogs and Social
Media (ICWSM 2015)
http://snap.stanford.edu/wiki-icwsm15/
Deadline for papers: Tuesday, March 24, 2015, 23:59 AoE
Wikipedia is one of the most popular sites on the Web, a main source of
knowledge for a large fraction of Internet users, and, in the light of its
collaborative nature, an inherently social medium. Therefore, and since not
only all content but also many activity logs are available to the public,
Wikipedia has become an important object of study for researchers across
many subfields of the computational and social sciences, such as
social-network analysis, social psychology, education, anthropology,
political science, human-computer interaction, cognitive science,
artificial intelligence, linguistics, and natural-language processing.
This workshop is a venue for all researchers exploring social aspects of
Wikipedia. The workshop will feature high-profile speakers from academia
and the Wikimedia Foundation and aims to create a forum where participants
can connect both among each other and with researchers at the Wikimedia
Foundation.
Topics of interest include, but are not limited to:
- Collaborative content creation
- Consensus-finding and conflict resolution on editorial issues
- Content consumption on Wikipedia
- Participation in discussions and their dynamics
- Collaborative task management
- Evolution of hierarchies
- Wikipedia as a sensor for real-world events, culture, etc.
- Demographics of Wikipedia readers and editors
- Engagement and incentivization of editors
We invite the submission of regular research papers (6–8 pages) as well as
position papers (2–4 pages). Authors whose papers are accepted to the
workshop will have the opportunity to participate in a poster session.
*Submission instructions*
Regular and position papers should be formatted according to AAAI
formatting guidelines (http://www.aaai.org/Publications/Author/author.php).
Please submit papers using EasyChair at https://easychair.org/conferences/?
conf=wikiicwsm2015
*Review and the archival of papers*
Authors will be notified of acceptance or rejection on or before Tuesday,
March 31, 2015.
The accepted papers will be published on the workshop webpage (unless the
authors object), and authors whose papers are accepted will have the
opportunity to participate in a poster session.
*Organizing committee*
Robert West, Stanford University
Jure Leskovec, Stanford University
Leila Zia, Wikimedia Foundation
On Mon, Feb 16, 2015 at 10:41 PM, koltzenburg(a)w4w.net <koltzenburg(a)w4w.net>
wrote:
> ____Aaron wrote:
> "higher quality survey data"
> well, and how does one recognize low quality and how come it is so low?
> and "quality" by whose epistemological aims and standards?
>
> "causes and mechanisms that drive the gender gap (and related
> participation gaps)"
> which "related participation gaps" do you have in mind here?
>
Jane's response was helpful and similar to mine.
Based on existing surveys, there are demographic and social categories of
people who are underrepresented among current editors. I don't have
specifics off the top of my head, but if you look at WMF survey results for
US editors and compare the findings to US census data (for example), you
can get an idea of some categories. Women are underrepresented to an
extreme degree, but they are not the only population that does not seem to
edit en:WP. I am less knowledgeable about other WPs, but I suspect there
are other inequalities and gaps on other wikis.
> where would these gaps be situated in terms of areas of participation?
>
See above.
> and, again, in which language version(s)?
>
See above.
On Mon, Feb 16, 2015 at 11:38 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
>
> Speaking of which, the WMF doesn't have resources to appropriately
> process the 2012 survey data, so results aren't available yet. Did you
> consider offering them to take care of it, at least for the gendergap
> number? You would then be able to publish an update.
>
> https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_Editor_Survey_2012#…
As before, my understanding is that the method by which respondents were
selected to participate in the survey does not meet standard methods of
survey sampling (see this chunk
<https://meta.wikimedia.org/wiki/Research:Wikipedia_Editor_Survey_2012#When.…>
of
the description of the survey). As a result, I do not trust the results of
the 2012 survey to generate precise estimates of the gender gap or other
demographic details about participation. I've spoken to some very receptive
folks at the foundation about this and I hope that they/we will be able to
improve it in the future. I'm eager to help improve the survey data
collection procedures. Unfortunately, I do not have the capacity to analyze
the current survey data in greater depth.
The thing that allowed Mako and I to do the study that we published in
PLOSONE was the fact that (1) the old UNU-Merit & WMF survey sought to
include readers as well as editors; *and* (2) at the exact same time Pew
carried out a survey in which they asked a nearly identical question about
readership. We used the overlapping results about WP readership from both
surveys to generate a correction for the data about editorship. Without
similar data on readership and similar data from a representative sample of
some reference population (in the case of the pew survey, US adults), we
cannot perform the same correction. As a result, I do not feel comfortable
estimating how biased (or unbiased) the 2012 survey results may be.
a
On Tue, Feb 17, 2015 at 12:00 AM, Jane Darnell <jane023(a)gmail.com> wrote:
> Hi Claudial,
> I responded to your questions in the text - hope it's readable.
> Jane
>
> ____WereSpielChequers wrote:
> "the community is more abrasive towards women"
>
> I think he is simply referring to earlier discussions where the
> conclusion was "the community can be perceived to be abrasive" and this
> conclusion, in yet other discussions led to this conclusion, which should
> be rephrased as "the community is more often perceived as abrasive by women
> than by men"
>
> ____Kerry wrote:
> "But I would agree that if an organisation sets a target (25% women in this
> particular case) and then does not put in place a means of measuring the
> progress against that target, one has to question the point of
> establishing a
> target."
>
> ___Claudia (responding to Kerry):
> I think one has to question the point of not putting in place a means of
> measuring the progress...
> and also ask why, if the issue is a high priority (allegedly, one might
> add, in
> speeches at meetings, in interviews with the press...) this organisation
> does
> not fund any top level research... - or does it?
>
> I think here you are forgetting about the "holy shit graph" which shows
> a reduction in the number of active editors over time. This is much more of
> a direct threat to the Wikiverse than the gendergap, which, as has been
> stated before, is only one of many serious gaps in knowledge coverage.
> Oddly, I think it is one of the easiest of all "participatory gaps" to
> measure, but we seem to constantly get stranded in objections to ways that
> previous editor surveys have been held, leading to the strange situation of
> never actually being able to run even one editor survey twice. Since we
> have not yet been able to establish any trend at all, we are only comparing
> apples to oranges.
>
> ____Aaron wrote:
> "higher quality survey data"
> __Claudia (responding to Aaron): ...how does one recognize low quality..?
> Hmm. I just looked and I couldn't find the criticism of the various editor
> surveys. Is this stashed somewhere on meta? Or do we need to sift through
> reams of emails until we find all the various objections? Objections
> galore, as I recall.
>
> ___Claudia: which "related participation gaps" do you have in mind here?
> Off the top of my head, some of these would be
>
> 1) lack of geographical editor coverage such as active editors in rural
> areas or even in whole states such as Wyoming or South Dakota and the whole
> "Global South participation problem" (the Global South participation
> problem is even helped along inadvertently by the new read-only
> "Wikipedia-zero" effect);
> 2) lack of topical expertise on subjects that technically don't lend
> themselves well to the Wikiverse, such as auditory fields (musical
> production) or visual fields (how to paint, how to make movies, how to
> choreograph motion)
> 3) lack of topical expertise on subjects that legally don't lend
> themselves well to the Wikiverse, such as articles about artworks under
> copyright that cannot be illustrated in an article;
> 4) lack of topical editor coverage on subjects previously shut out - there
> is still unwillingness by a whole group to re-enter the Wikiverse after
> being banned (earlier shut-outs such as blocking whole institution-wide ip
> ranges for vandalism or whole areas of expertise such as groups of writers
> for their COI editing, carry with them a history of anti-Wikipedia
> sentiment that lasts a long time in various enclaves)
>
> ___Claudia:
> and, again, in which language version(s)?
> That's easy - the languages that we can technically support but don't yet
> have Wikipedias for and the languages for which we don't even have the
> fonts to display them.
>
> best,
> Claudia
>
>
> On Tue, Feb 17, 2015 at 7:41 AM, <koltzenburg(a)w4w.net> wrote:
>
>> Hi WereSpielChequers, Kerry, Aaron and all,
>>
>> ____WereSpielChequers wrote:
>> "the community is more abrasive towards women"
>>
>> this may be stats expert discourse, but let me show you how the question
>> itself has a gendered slant.
>> imagine what would happen - also in your research design - if it read:
>> "the
>> community is less abrasive towards men" - how does this compare to the
>> first question re who are "the community"?
>>
>> and again, re phasing ten years in 2011 and four years on, which language
>> version(s) are hypotheses based on?
>>
>> ____Kerry wrote:
>> "But I would agree that if an organisation sets a target (25% women in
>> this
>> particular case) and then does not put in place a means of measuring the
>> progress against that target, one has to question the point of
>> establishing a
>> target."
>>
>> I think one has to question the point of not putting in place a means of
>> measuring the progress...
>> and also ask why, if the issue is a high priority (allegedly, one might
>> add, in
>> speeches at meetings, in interviews with the press...) this organisation
>> does
>> not fund any top level research... - or does it?
>>
>> ____Aaron wrote:
>> "higher quality survey data"
>> well, and how does one recognize low quality and how come it is so low?
>> and "quality" by whose epistemological aims and standards?
>>
>> "causes and mechanisms that drive the gender gap (and related
>> participation gaps)"
>> which "related participation gaps" do you have in mind here?
>> where would these gaps be situated in terms of areas of participation?
>> and, again, in which language version(s)?
>>
>> best,
>> Claudia
>>
>> ---------- Original Message -----------
>> From:aaron shaw <aaronshaw(a)northwestern.edu>
>> To:Research into Wikimedia content and communities <wiki-research-
>> l(a)lists.wikimedia.org>
>> Sent:Mon, 16 Feb 2015 20:50:17 -0800
>> Subject:Re: [Wiki-research-l] a cautious note on gender stats Re: Fwd:
>> [Gendergap] Wikipedia readers
>>
>> > Hi all!
>> >
>> > Thanks, Jeremy & Dariusz for following up.
>> >
>> > On Mon, Feb 16, 2015 at 5:58 AM, Dariusz
>> > Jemielniak <darekj(a)alk.edu.pl> wrote:
>> >
>> > > As far as I recall, they did a follow-up on this topic, and maybe a
>> > > publication coming up?
>> >
>> > Sadly, no follow ups at the moment.
>> >
>> > If we want to have a more precise sense of the
>> > demographics of participants the biggest need in
>> > this space is simply higher quality survey data.
>> > My paper with Mako has a lot of detail about why
>> > the 2008 editor survey (and all subsequent editor
>> > surveys, to my knowledge) has some profound limitations.
>> >
>> > The identification and estimation of the effects
>> > of particular causes and mechanisms that drive the
>> > gender gap (and related participation gaps)
>> > presents an even tougher challenge for
>> > researchers and is an area of active inquiry.
>> >
>> > all the best,
>> > Aaron
>> ------- End of Original Message -------
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
Hey all!
We've released a highly-aggregated dataset of readership data -
specifically, data about where, geographically, traffic to each of our
projects (and all of our projects) comes from. The data can be found
at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
put together an exploration tool for it at
https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
Hope it's useful to people!
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hello all,
I have uploaded the results from the *Signpost *readership survey to
Wikimedia Commons in PDF format:
https://commons.wikimedia.org/wiki/File:Signpost_February_2015_survey_resul…
Thanks very much to the WMF Learning and Evaluation Team for letting us use
Qualtrics.
The *Signpost* management team recently agreed to cross-post selected
content from the Wikimedia Blog into the *Signpost*. By doing this we can
both increase the exposure of Blog content (many *Signpost *readers don't
read the blog) and enhance the value of the *Signpost *to its current
readers (some of whom would like to see more coverage of sister projects
and other, diverse parts of the Wikimedia ecosystem).
Your comments on the survey results would be appreciated. The
*Signpost *management
team will have more to say after we study these results in more detail, and
we will publish our comments in a future *Signpost *issue.
Cheers,
Pine
*Signpost *Publication and Newsroom Manager
*This is an Encyclopedia* <https://www.wikipedia.org/>
*One gateway to the wide garden of knowledge, where lies The deep rock of
our past, in which we must delve The well of our future,The clear water we
must leave untainted for those who come after us,The fertile earth, in
which truth may grow in bright places, tended by many hands,And the broad
fall of sunshine, warming our first steps toward knowing how much we do not
know.*
*—Catherine Munro*
Erik Zachte, 25/02/2015 23:34:
> Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and
> http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguage…
Ironholds' looks more vulnerable to bots, it's easier to see in small
wikis (though, kudos! many more small wikis are included than in
wikistats). For instance, 20 more percentage points for USA on Breton
and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on
Kurdish. For Chinese bots they look similar, though in some cases I'm
not sure what's going on: for instance als.wiki also sees CH and RO emerge.
Will the new pageviews definition use the same bot filtering method?
Nemo
Yours is looking at just December, while mine is looking at the entire
year, for starters. Also, what's the apps/mobile web inclusion for
that report?
On 25 February 2015 at 17:34, Erik Zachte <ezachte(a)wikimedia.org> wrote:
> I am surprised that the new data, with crawlers excluded, show more wp:en traffic from US (43%) than the old data (36.4% for 2014), which contained much crawler traffic, presumably most of that from US.
>
> Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and
> http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguage…
>
> Any thoughts?
>
> Erik
>
> -----Original Message-----
> From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes
> Sent: Wednesday, February 25, 2015 22:37
> To: Research into Wikimedia content and communities
> Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
> Subject: Re: [Analytics] [Wiki-research-l] [Release]
>
> The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly.
>
> On 25 February 2015 at 16:32, Andrew Lih <andrew.lih(a)gmail.com> wrote:
>> Great job.
>>
>> Who knew Esperanto was big in Japan and China at #2 and #3?
>>
>>
>>
>> On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
>>>
>>> Hey all!
>>>
>>> We've released a highly-aggregated dataset of readership data -
>>> specifically, data about where, geographically, traffic to each of
>>> our projects (and all of our projects) comes from. The data can be
>>> found at http://dx.doi.org/10.6084/m9.figshare.1317408 -
>>> additionally, I've put together an exploration tool for it at
>>> https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
>>>
>>> Hope it's useful to people!
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Thanks for doing that Andrew!
On Tue, Feb 24, 2015 at 1:41 PM, Andrew Otto <aotto(a)wikimedia.org> wrote:
> I also added some Hadoop based used cases to that document.
>
>
> https://www.mediawiki.org/w/index.php?title=Wikimedia_MediaWiki_Core_Team%2…
>
>
> > On Feb 21, 2015, at 05:03, Emmanuel Engelhart <kelson(a)kiwix.org> wrote:
> >
> > Hi
> >
> > Thank you Nemo for adverting that interesting page about how to improve
> Wikimedia dumping processes. This topic is of course a primary concern for
> the Kiwix developer team.
> >
> > Here my contribution:
> >
> https://www.mediawiki.org/w/index.php?title=Wikimedia_MediaWiki_Core_Team%2…
> >
> > Hope to see things going forward on this, I will help as much as I can.
> >
> > Regards
> > Emmanuel
> >
> > On 21.02.2015 08:44, Federico Leva (Nemo) wrote:
> >> FYI
> >>
> >>
> >> -------- Messaggio inoltrato --------
> >> Oggetto: [Xmldatadumps-l] Your comments needed (long term dumps
> >> rewrite?)
> >> Data: Thu, 19 Feb 2015 12:30:01 +0200
> >> Mittente: Ariel Glenn WMF <ariel(a)wikimedia.org>
> >> A: Xmldatadumps-l(a)lists.wikimedia.org
> >>
> >>
> >>
> >> The MediaWiki Core team has opened a discussion about getting more
> >> involved in and maybe redoing the dumps infrastructure. A good starting
> >> point is to understand how folks use the dumps already or want to use
> >> them but can't, and some questions about that are listed here:
> >>
> https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improv…
> >>
> >> I've added some notes but please go weigh in. Don't be shy about what
> >> you do/what you need, this is the time to get it all on the table.
> >>
> >> Ariel
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Offline-l mailing list
> >> Offline-l(a)lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/offline-l
> >>
> >
> >
> > --
> > Kiwix - Wikipedia Offline & more
> > * Web: http://www.kiwix.org
> > * Twitter: https://twitter.com/KiwixOffline
> > * more: http://www.kiwix.org/wiki/Communication
> >
> > _______________________________________________
> > Analytics mailing list
> > Analytics(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
Hi all,
I am new to working with Wikipedia dumps. I am trying to obtain full
revision history of all the articles on Wikipedia. I downloaded
enwiki-20140707-pages-meta-history1.xml-*.7z
from https://dumps.wikimedia.org/enwiki/20140707/. However, by looking at
the xml files revision history of individual articles do not match with
revision history one may see from history page on Wikipedia website. It
seems the dump contains significantly smaller number of revisions than what
can be found on Wikipedia.
Anyone has an experienced this? Am I downloading the wrong files?
Best,
Behzad