We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hi,
Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hi all
Does anyone know of any information on editor retention rates based on
whether the person used Visual Editor or Wikitext?
I'm sure there are many ways you could explore this subject, my specific
interest is when running editor training would it be better to teach people
to use VE or wikitext?
Many thanks
John
Hoi,
At Wikidata we often find issues with data imported from a Wikipedia. Lists
have been produced with these issues on the Wikipedia involved and arguably
they do present issues with the quality of Wikipedia or Wikidata for that
matter. So far hardly anything resulted from such outreach.
When Wikipedia is a black box, not communicating about with the outside
world, at some stage the situation becomes toxic. At this moment there are
already those at Wikidata that argue not to bother about Wikipedia quality
because in their view, Wikipedians do not care about its own quality.
Arguably known issues with quality are the easiest to solve.
There are many ways to approach this subject. It is indeed a quality issue
both for Wikidata and Wikipedia. It can be seen as a research issue; how to
deal with quality and how do such mechanisms function if at all.
I blogged about it..
Thanks,
GerardM
http://ultimategerardm.blogspot.nl/2015/11/what-kind-of-box-is-wikipedia.ht…
Gerard,
On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com>
wrote:
> Hoi,
> To start of, results from the past are no indications of results in the
> future. It is the disclaimer insurance companies have to state in all their
> adverts in the Netherlands. When you continue and make it a "theological"
> issue, you lose me because I am not of this faith, far from it. Wikidata is
> its own project and it is utterly dissimilar from Wikipedia.To start of
> Wikidata has been a certified success from the start. The improvement it
> brought by bringing all interwiki links together is enormous.That alone
> should be a pointer that Wikipedia think is not realistic.
>
These benefits are internal to Wikimedia and a completely separate issue
from third-party re-use of Wikidata content as a default reference source,
which is the issue of concern here.
To continue, people have been importing data into Wikidata from the start.
> They are the statements you know and, it was possible to import them from
> Wikipedia because of these interwiki links. So when you call for sources,
> it is fairly save to assume that those imports are supported by the quality
> of the statements of the Wikipedias
The quality of three-quarters of the 280+ Wikipedia language versions is
about at the level the English Wikipedia had reached in 2002.
Even some of the larger Wikipedias have significant problems. The Kazakh
Wikipedia for example is controlled by functionaries of an oppressive
regime[1], and the Croatian one is reportedly[2] controlled by fascists
rewriting history (unless things have improved markedly in the Croatian
Wikipedia since that report, which would be news to me). The Azerbaijani
Wikipedia seems to have problems as well.
The Wikimedia movement has always had an important principle: that all
content should be traceable to a "reliable source". Throughout the first
decade of this movement and beyond, Wikimedia content has never been
considered a reliable source. For example, you can't use a Wikipedia
article as a reference in another Wikipedia article.
Another important principle has been the disclaimer: pointing out to people
that the data is anonymously crowdsourced, and that there is no guarantee
of reliability or fitness for use.
Both of these principles are now being jettisoned.
Wikipedia content is considered a reliable source in Wikidata, and Wikidata
content is used as a reliable source by Google, where it appears without
any indication of its provenance. This is a reflection of the fact that
Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision was, I
understand, made by Denny, who is both a Google employee and a WMF board
member.
The benefit to Google is very clear: this free, unattributed content adds
value to Google's search engine result pages, and improves Google's revenue
(currently running at about $10 million an hour, much of it from ads).
But what is the benefit to the end user? The end user gets information of
undisclosed provenance, which is presented to them as authoritative, even
though it may be compromised. In what sense is that an improvement for
society?
To me, the ongoing information revolution is like the 19th century
industrial revolution done over. It created whole new categories of abuse,
which it took a century to (partly) eliminate. But first, capitalists had a
field day, and the people who were screwed were the common folk. Could we
not try to learn from history?
> and if anything, that is also where
> they typically fail because many assumptions at Wikipedia are plain wrong
> at Wikidata. For instance a listed building is not the organisation the
> building is known for. At Wikidata they each need their own item and
> associated statements.
>
> Wikidata is already a success for other reasons. VIAF no longer links to
> Wikipedia but to Wikidata. The biggest benefit of this move is for people
> who are not interested in English. Because of this change VIAF links
> through Wikidata to all Wikipedias not only en.wp. Consequently people may
> find through VIAF Wikipedia articles in their own language through their
> library systems.
>
At the recent Wikiconference USA, a Wikimedia veteran and professional
librarian expressed the view to me that
* circular referencing between VIAF and Wikidata will create a humongous
muddle that nobody will be able to sort out again afterwards, because –
unlike wiki mishaps in other topic areas – here it's the most authoritative
sources that are being corrupted by circular referencing;
* third parties are using Wikimedia content as a *reference standard *when
that was never the intention (see above).
I've seen German Wikimedians express concerns that quality assurance
standards have dropped alarmingly since the project began, with bot users
mass-importing unreliable data.
> So do not forget about Wikipedia and the lessons learned. These lessons are
> important to Wikipedia. However, they do not necessarily apply to Wikidata
> particularly when you approach Wikidata as an opportunity to do things in a
> different way. Set theory, a branch of mathematics, is exactly what we
> need. When we have data at Wikidata of a given quality.. eg 90% and we have
> data at another source with a given quality eg 90%, we can compare the two
> and find a subset where the two sources do not match. When we curate the
> differences, it is highly likely that we improve quality at Wikidata or at
> the other source.
This sounds like "Let's do it quick and dirty and worry about the problems
later".
I sometimes get the feeling software engineers just love a programming
challenge, because that's where they can hone and display their skills.
Dirty data is one of those challenges: all the clever things one can do to
clean up the data! There is tremendous optimism about what can be done. But
why have bad data in the first place, starting with rubbish and then
proving that it can be cleaned up a bit using clever software?
The effort will make the engineer look good, sure, but there will always be
collateral damage as errors propagate before they are fixed. The engineer's
eyes are not typically on the content, but on their software. The content
their bots and programs manipulate at times seems almost incidental,
something for "others" to worry about – "others" who don't necessarily
exist in sufficient numbers to ensure quality.
In short, my feeling is that the engineering enthusiasm and expertise
applied to Wikidata aren't balanced by a similar level of commitment to
scholarship in generating the data, and getting them right first time.
We've seen where that approach can lead with Wikipedia. Wikipedia hoaxes
and falsehoods find their way into the blogosphere, the media, even the
academic literature. The stakes with Wikidata are potentially much higher,
because I fear errors in Wikidata stand a good chance of being massively
propagated by Google's present and future automated information delivery
mechanisms, which are completely opaque. Most internet users aren't even
aware to what extent the Google Knowledge Graph relies on anonymously
compiled, crowdsourced data; they will just assume that if Google says it,
it must be true.
In addition to honest mistakes, transcription errors, outdated info etc.,
the whole thing is a propagandist's wet dream. Anonymous accounts!
Guaranteed identity protection! Plausible deniability! No legal liability!
Automated import and dissemination without human oversight! Massive impact
on public opinion![3]
If information is power, then this provides the best chance of a power grab
humanity has seen since the invention of the newspaper. In the media
landscape, you at least have right-wing, centrist and left-wing
publications each presenting their version of the truth, and you know who's
publishing what and what agenda they follow. You can pick and choose,
compare and contrast, read between the lines. We won't have that online.
Wikimedia-fuelled search engines like Google and Bing dominate the
information supply.
The right to enjoy a pluralist media landscape, populated by players who
are accountable to the public, was hard won in centuries past. Some
countries still don't enjoy that luxury today. Are we now blithely giving
it away, in the name of progress, and for the greater glory of technocrats?
I don't trust the way this is going. I see a distinct possibility that
we'll end up with false information in Wikidata (or, rather, the Google
Knowledge Graph) being used to "correct" accurate information in other
sources, just because the Google/Wikidata content is ubiquitous. If you
build circular referencing loops fuelled by spurious data, you don't
provide access to knowledge, you destroy it. A lie told often enough etc.
To quote Heather Ford and Mark Graham, "We know that the engineers and
developers, volunteers and passionate technologists are often trying to do
their best in difficult circumstances. But there need to be better attempts
by people working on these platforms to explain how decisions are made
about what is represented. These may just look like unimportant lines of
code in some system somewhere, but they have a very real impact on the
identities and futures of people who are often far removed from the
conversations happening among engineers."
I agree with that. The "what" should be more important than the "how", and
at present it doesn't seem to be.
It's well worth thinking about, and having a debate about what can be done
to prevent the worst from happening.
In particular, I would like to see the decision to publish Wikidata under a
CC0 licence revisited. The public should know where the data it gets comes
from; that's a basic issue of transparency.
Andreas
[1]
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed
[2]
http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-contro…
[3]
http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-201…
These resources may give ideas to people who create project proposals or
any of several kinds of reports.
http://www.datavizcatalogue.com/index.htmlhttp://survey.timeviz.net/
Have fun exploring,
Pine
P.S. In the multimedia domain, I'd like to have the ability to add
interactive visualizations to Wikipedia and its sister projects. For
example, imagine how engaging it would be to have interactive phylogenetic
trees that allow the user to zoom in and out and see images of species.
We are now working on the "Cases" page of the draft Code of conduct.
This will become a separate page (for readability of the final CoC), but
is being drafted on the same page with the rest.
This includes both the intro section, and all the sub-sections, which
means everything that starts with "2." in the ToC. Currently this is
"Handling reports", "Responses and resolutions", and "Appealing a
resolution". However, the sections within "Cases" may change:
* Section:
https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_conduct_…
* Talk:
https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finishing_the_Cas…
* Alternatively, you can provide anonymous feedback to
conduct-discussion(a)wikimedia.org .
This is the best time to make any necessary changes to this page (and
explain why, in edit summaries and/or talk) and discuss it on the talk page.
Other updates:
* The text of the "Report a problem" section has been frozen. Thanks to
everyone who helped discuss and edit these sections. Participation
(including both named and anonymous) helped us improve the
confidentiality line.
Thanks,
Matt Flaschen
Hi all,
Some of us plan to have a conversation at the WCONUSA unconference sessions
about ENWP culture. Are there any recommended readings that you could
suggest as preparation, particularly on the subject of how to reinforce or
incentivize desirable user behavior? I think that Jonathan may have done
some research on this topic for the Teahouse, and Ocassi may have for done
research for TWA. I'm interested in applicable research as preparation both
for the unconference discussion and for my planned video series that
intends to inform and inspire new editors.
Thanks,
Pine
OpenSym 2016 General Call for Submissions (Papers)
----------------------------------------------
OpenSym 2016, the 12th International Symposium on Open Collaboration
August 17-19, 2016 | Berlin, Germany
http://opensym.org/os2016
ABOUT THE CONFERENCE
----------------------------------------------
The 12th International Symposium on Open Collaboration (OpenSym 2016) is the
premier conference on open collaboration research and practice, including open
source, open data, open education, wikis and related social media, Wikipedia,
and IT-driven open innovation research.
OpenSym is the first conference series to bring together the different strands
of open collaboration research and practice, seeking to create synergies and
inspire new collaborations between computer scientists, social scientists,
legal scholars, and everyone interested in understanding open collaboration
and how it is changing the world.
OpenSym 2016 will be held in Berlin, Germany, on August 17-19, 2016.
OpenSym is held in-cooperation with ACM SIGWEB and ACM SIGSOFT and the
conference proceedings will be archived in the ACM digital library like all
prior editions.
The research paper submission deadline is March 25th, 2016.
RESEARCH TRACK CALLS FOR SUBMISSIONS
----------------------------------------------
The conference provides peer-reviewed research tracks on
- Free, libre, and open source software research, chaired by Stephan Koch,
Boğaziçi Üniversitesi, and Daniel German, University of Victoria. For more
information see
http://www.opensym.org/os2016/call-for-papers/free-libre-and-open-source-so…
- Open data research, chaired by Dirk Riehle, Friedrich-Alexander University
Erlangen-Nürnberg, and Ina Schieferdecker, Fraunhofer FOKUS and TU Berlin. For
more information see
http://www.opensym.org/os2016/call-for-papers/open-data-research-track/
- Open education research, chaired by Astrid Wichmann, Ruhr-University Bochum,
and Johannes Moskaliuk, University of Tübingen. For more information see
http://www.opensym.org/os2016/call-for-papers/open-education-research-track/
- IT-driven open innovation research, chaired by Albrecht Fritzsche,
Friedrich-Alexander University of Erlangen-Nürnberg, and Srinivasan R, Indian
Institute of Management Bangalore. For more information see
http://www.opensym.org/os2016/call-for-papers/it-driven-open-innovation-res…
- Wikipedia research, chaired by Claudia Müller-Birn, Freie Universität
Berlin, and Benjamin Mako Hill, University of Washington. For more information
see http://www.opensym.org/os2016/call-for-papers/wikipedia-research-track/
- Open collaboration (wikis, social media, etc.) research, chaired by Oscar
Diaz, Universidad del Pais Vasco, and Dirk Riehle, Friedrich-Alexander
University Erlangen-Nürnberg. For more information see
http://www.opensym.org/os2016/call-for-papers/open-collaboration-research-t…
Research papers present integrative reviews or original reports of substantive
new work: theoretical, empirical, and/or in the design, development and/or
deployment of novel concepts, systems, and mechanisms. Research papers will be
reviewed by a research track program committee to meet rigorous academic
standards of publication. Papers will be reviewed for relevance, conceptual
quality, innovation and clarity of presentation.
Each track has its own call for papers, which you can find at
http://www.opensym.org/os2016/call-for-papers/. Submission deadline is March
25th, 2016.
Authors, whose submitted papers have been accepted for presentation at the
conference have a choice of
- having their paper become part of the official proceedings, archived in the
ACM Digital Library,
- having no publication record at all but only the presentation at the conference.
OpenSym seeks to accommodate the needs of the different research disciplines
it draws on.
DOCTORAL SYMPOSIUM CALL FOR SUBMISSIONS
----------------------------------------------
OpenSym seeks to explore the synergies between all strands of open
collaboration research. Thus, we will have a doctoral symposium, in which
Ph.D. students from different disciplines can present their work and receive
feedback from senior faculty and their peers.
The doctoral symposium is lead by Lutz Prechelt, Freie Universität of Berlin.
The call for papers for the doctoral symposium can be found at
http://www.opensym.org/os2016/call-for-papers/doctoral-symposium-at-opensym/.
Submission deadline is May 6th, 2016.
INDUSTRY AND COMMUNITY TRACK CALL FOR SUBMISSIONS
----------------------------------------------
OpenSym is also seeking submissions for experience reports (long and short),
tutorials, workshops, panels, industry and community posters, and demos. Such
work accepted for presentation or performance at the conference is considered
part of the community track. It will be put into the proceedings in a
community track section; authors can opt-out of the publication, as with
research papers.
The industry and community track is lead by Simon Dückert of Cogneon GmbH and
Lorraine Morgan of Maynooth University.
The call for submissions to the community track can be found at
http://www.opensym.org/os2016/call-for-papers/industry-and-community-track/.
The first submission deadline is April 22nd, 2016. A second submission
deadline for late-comers (at the risk of not getting a seat) is June 2nd, 2016.
THE OPENSYM CONFERENCE EXPERIENCE
----------------------------------------------
OpenSym 2016 will be held in Berlin on August 17-19, 2016. Research and
community presentations and performances will be accompanied by keynotes,
invited speakers, and a social program in one of the most vibrant cities on
this planet.
The open space track is a key ingredient of the event that distinguishes
OpenSym from other conferences. It is an integral part of the program that
makes it easy to talk to other researchers and practitioners and to stretch
your imagination and conversations beyond the limits of your own
subdiscipline, exposing you to the full breadth of open collaboration
research. The open space track is entirely participant-organized, is open for
everyone, and requires no submission or review.
CONFERENCE ORGANIZATION
----------------------------------------------
The general chair of the conference is Anthony "Tony" Wasserman, CMU Silicon
Valley.
Feel free to contact us with any questions you might have at info(a)opensym.org.
The conference committee can be found at
http://www.opensym.org/os2016/organization/.
--
Prof. Dr. Dirk Riehle, Friedrich-Alexander-University Erlangen-Nürnberg
Open Source Research Group, Applied Software Engineering
Web: http://osr.cs.fau.de, Email: dirk.riehle(a)fau.de
Cell phone: +49 157 8153 4150 or +1 650 450 8550
--
Website: http://dirkriehle.com - Twitter: @dirkriehle
Ph (DE): +49-157-8153-4150 - Ph (US): +1-650-450-8550