We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hi,
Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
We’ve gotten good participation as we’ve worked on sections of the Code
of Conduct over the past few months, and have made considerable
improvements to the draft based on your feedback.
Given that, and the community approval through the discussions on each
section, the best approach is to proceed by approving section-by-section
until the last section is done.
So, please continue to improve the Code of Conduct by participating now
and as future sections are discussed. When the last section is
completed and approved on the talk page, the Code of Conduct will become
policy and no longer be marked as a draft.
Also, two more discussions regarding the Code of Conduct have been
resolved and incorporated into the draft.
* "Enforcement issues" addressed the reporting process and clarified
that Committee decisions could not be circumvented
* "Marginalized and underrepresented groups" forbids discrimination
Thanks,
Matt Flaschen
Citations and references are the building blocks of Wikimedia projects.
However, as of today, they are still treated as second-class citizens.
Structured data bases such as Wikidata offer a unique opportunity
<https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData> to
turn into reality over a decade of endeavors to build the sum of all
citations and bibliographic metadata into a centralized repository. To
coordinate upcoming work in this space, we're organizing a technical event
in late May and opening up applications for prospective participants.
*WikiCite 2016 <https://meta.wikimedia.org/wiki/WikiCite_2016>* is a
hands-on event focused on designing data models and technology to *improve
the coverage, quality, standards-compliance and machine-readability of
citations and source metadata in Wikipedia, Wikidata and other Wikimedia
projects*. Our goal, in particular, is to define a technical roadmap for
building a repository of all Wikimedia references in Wikidata.
We are bringing together Wikidatans, Wikipedians, software engineers, data
modelers, and information and library science experts from organizations
including *Crossref*, *Zotero*, *CSL*, *ContentMine*, *Google*, *Datacite*,
*NISO*, *OCLC* and the *NIH*. We are also inviting academic researchers
with experience working with Wikipedia's citations and bibliographic data.
WikiCite will be hosted in *Berlin* on *May 25-26, 2016*. Participation to
the event is capped at about 50 participants and we expect to have a number
of open slots for applicants:
- if you were pre-invited and have already filled in a form, you will
receive a separate note from the organizers
- if you have not been invited but you would like to participate, please
fill in this application form <http://goo.gl/forms/Yv6rve2wCt> to give
us some information about you and your interest and expected contribution
to the event.
Please help us pass this on to anyone who has done important technical work
on Wikimedia references and citations.
*Important dates*
- *March 29, 2016*: applications open
- *April 11, 2016*: applications close
- *April 15, 2016*: notifications of acceptance are issued (if you
applied for a travel grant, we'll be able to confirm by this date if we can
cover the costs of your trip)
For any question, you can contact the organizing committee:
wikicite(a)wikimedia.org
The organizers,
Dario Taraborelli
Jonathan Dugan
Lydia Pintcher
Daniel Mietchen
Cameron Neylon
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
I usually send these to multiple lists, but I realized I forgot to send
this to the ones besides wikitech-l.
The "Marginalized and underrepresented groups" discussion
(https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#New_proposed_word…)
is still open. I'll probably give it two weeks total, which means
closing it late tomorrow.
Matt Flaschen
-------- Forwarded Message --------
Subject: Please provide feedback on new discrimination and enforcement
sections of Code of Conduct
Date: Wed, 16 Mar 2016 20:23:24 -0400
From: Matthew Flaschen <mflaschen(a)wikimedia.org>
To: Wikitech List <wikitech-l(a)lists.wikimedia.org>
Thanks for your participation in the recent Code of Conduct discussions.
The "Marginalized and underrepresented groups" discussion had a lot of
feedback. There was not consensus to use the exact original wording,
but many people expressed willingness to support a modified text.
I've proposed such a new text, based on Neil P. Quinn's text, with a
small modification to account for discrimination required by law (e.g.
age of people who can sign certain contracts).
Please participate at
https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#New_proposed_word…
.
The "Enforcement issues" section received general support, but some of
that was conditional, or expressed preference for wording that developed
during the discussion. The original wording also did not address the
appeals body, which was raised in the discussion.
Please participate at
https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Circumvention_tex…
Update regarding completed discussions:
The "Clarification of legitimate reasons for publication of private
communications and identity protection" and "Definitions - trolling,
bad-faith reports" discussions have been closed.
They both had support, and I've incorporated the text into the draft.
Thanks,
Matt
Hi everybody,
We’re preparing for the March 2016 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201603 and add your name next to any paper you are interested in covering. Our target publication date is Wednesday March 30 UTC although actual publication might happen several days later. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• Advances in Information Retrieval
• Candidate Searching and Key Coreference Resolution for Wikification
• Developing an annotator for Latin texts using Wikipedia
• "Did i say something wrong?" A word-level analysis of Wikipedia articles for deletion discussions
• Gender Biases in Cyberspace: A Two-Stage Model, the New Arena of Wikipedia and Other Websites
• Improving Information Literacy Skills through Learning To Use and Edit Wikipedia: A Chemistry Perspective
• Motivational determinants of participation trajectories in Wikipedia
• Open Content, Linus’ Law, and Neutral Point of View
• Teaching with Wikipedia in a 21st-century classroom: Perceptions of Wikipedia and its educational benefits
• Wikidata as a semantic framework for the Gene Wiki initiative
• Wikipedia in the anti-SOPA protests as a case study of direct, deliberative democracy in cyberspace
• *CSCW 2016 Conference proceedings
• *Wiki Workshop 2016 proceedings
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/wiki/Research:Newsletter
Apologies for cross-posting
2nd Call for Research & Innovation Papers
SEMANTiCS 2016 - The Linked Data Conference
Transfer // Engineering // Community
12th International Conference on Semantic Systems
Leipzig, Germany
September 12 -15, 2016
http://2016.semantics.cc
Important Dates (Research & Innovation)
* Abstract Submission Deadline: April 14, 2016 (11:59 pm, Hawaii
time)
* Paper Submission Deadline: April 21, 2016 (11:59 pm, Hawaii
time)
* Notification of Acceptance: May 26, 2016 (11:59 pm, Hawaii
time)
* Camera-Ready Paper: June 16, 2016 (11:59 pm, Hawaii
time)
Submissions via Easychair:
https://easychair.org/conferences/?conf=semantics2016research
As in the previous years, SEMANTiCS’16 proceedings are expected to be
published by ACM ICP.
The annual SEMANTiCS conference is the meeting place for professionals
who make semantic computing work, who understand its benefits and
encounter its limitations. Every year, SEMANTiCS attracts information
managers, IT-architects, software engineers and researchers from
organisations ranging from NPOs, through public administrations to the
largest companies in the world. Attendees learn from industry experts
and top researchers about emerging trends and topics in the fields of
semantic software, enterprise data, linked data & open data strategies,
methodologies in knowledge modelling and text & data analytics. The
SEMANTiCS community is highly diverse; attendees have responsibilities
in interlinking areas like knowledge management, technical
documentation, e-commerce, big data analytics, enterprise search,
document management, business intelligence and enterprise vocabulary
management.
The success of last year’s conference in Vienna with more than 280
attendees from 22 countries proves that SEMANTiCS 2016 will continue a
long tradition of bringing together colleagues from around the world.
There will be presentations on industry implementations, use case
prototypes, best practices, panels, papers and posters to discuss
semantic systems in birds-of-a-feather sessions as well as informal
settings. SEMANTICS addresses problems common among information
managers, software engineers, IT-architects and various specialist
departments working to develop, implement and/or evaluate semantic
software systems.
The SEMANTiCS program is a rich mix of technical talks, panel
discussions of important topics and presentations by people who make
things work - just like you. In addition, attendees can network with
experts in a variety of fields. These relationships provide great value
to organisations as they encounter subtle technical issues in any stage
of implementation. The expertise gained by SEMANTiCS attendees has a
long-term impact on their careers and organisations. These factors make
SEMANTiCS for our community the major industry related event across Europe.
#SEMANTiCS 2016 will especially welcome submissions for the following
hot topics:
* Data Quality Management
* Data Science (Data Mining, Machine Learning, Network Analytics)
* Semantics on the Web, Linked (Open) Data & schema.org
* Corporate Knowledge Graphs
* Knowledge Integration and Language Technologies
* Economics of Data, Data Services and Data Ecosystems
Following the success of previous years, the ‘horizontals’ (research)
and ‘verticals’ (industries) below are of interest for the conference:
Horizontals
* Enterprise Linked Data & Data Integration
* Knowledge Discovery & Intelligent Search
* Business Models, Governance & Data Strategies
* Big Data & Text Analytics
* Data Portals & Knowledge Visualization
* Semantic Information Management
* Document Management & Content Management
* Terminology, Thesaurus & Ontology Management
* Smart Connectivity, Networking & Interlinking
* Smart Data & Semantics in IoT
* Semantics for IT Safety & Security
* Semantic Rules, Policies & Licensing
* Community, Social & Societal Aspects
Verticals
* Industry & Engineering
* Life Sciences & Health Care
* Public Administration
* Galleries, Libraries, Archives & Museums (GLAM)
* Education & eLearning
* Media & Data Journalism
* Publishing, Marketing & Advertising
* Tourism & Recreation
* Financial & Insurance Industry
* Telecommunication & Mobile Services
* Sustainable Development: Climate, Water, Air, Ecology
* Energy, Smart Homes & Smart Grids
* Food, Agriculture & Farming
* Safety, Security & Privacy
* Transport, Environment & Geospatial
#Research / Innovation Papers
The Research & Innovation track at SEMANTiCS welcomes the submission of
papers on novel scientific research and/or innovations relevant to the
topics of the conference. Submissions must be original and must not have
been submitted for publication elsewhere. The Research & Innovation
track at SEMANTiCS is a single-blind review process (author names are
visible to reviewers, reviewers stay anonymous). The submitted abstract
and the topics are leveraged to find adequate reviewers for submitted
papers. Please write an email to
semantics2016researchtrack(a)easychair.org, if you have any questions.
Papers should follow the ACM ICPS guidelines for formatting and must not
exceed 8 pages in length for full papers and 4 pages for short papers,
including references and optional appendices. The layout templates can
be found here:
http://www.acm.org/sigs/publications/proceedings-templates All accepted
full papers and short papers will be published in the digital library of
the ACM ICP Series. Research & Innovation papers should be submitted
through EasyChair at:
https://easychair.org/conferences/?conf=semantics2016research. Papers
must be submitted in PDF (Adobe's Portable Document Format) format.
Other formats will not be accepted. For the camera-ready version, the
source files (Latex, WordPerfect, Word) will also be needed.
Important Dates (Research & Innovation)
* Abstract Submission Deadline: April 14, 2016 (11:59 pm,
Hawaii time)
* Paper Submission Deadline: April 21, 2016 (11:59 pm,
Hawaii time)
* Notification of Acceptance: May 26, 2016 (11:59 pm,
Hawaii time)
* Camera-Ready Paper: June 16, 2016 (11:59 pm,
Hawaii time)
Research and Innovation Chairs:
* Anna Fensel, University of Innsbruck
* Amrapali Zaveri, Stanford University
Contact email address: semantics2016researchtrack(a)easychair.org
Research and Innovation Deputy Chairs:
* Bernhard Haslhofer, Austrian Institute of Technology
* Artem Revenko, Semantic Web Company
Conference Chairs:
* Sebastian Hellmann, AKSW/KILT, InfAI, Leipzig University
* Tassilo Pellegrini, UAS St. Pölten
Senior Program Committee:
* Paul Buitelaar, Insight - National University of Ireland, Galway
* Oscar Corcho, Universidad Politécnica de Madrid
* Claudia D'Amato, University of Bari
* Brian Davis, DERI NUIG
* Victor de Boer, VU Amsterdam
* Christian Dirschl, Wolters Kluwer Germany
* Michel Dumontier, Stanford University
* Agata Filipowska, Department of Information Systems, Poznan University
of Economics
* Bernhard Haslhofer, AIT-Austrian Institute of Technology
* Sebastian Hellmann, AKSW/KILT, InfAI, Leipzig University
* Andreas Hotho, University of Wuerzburg
* Jose Emilio Labra Gayo, Universidad de Oviedo
* Peter Mika, Yahoo! Research
* Axel-Cyrille Ngonga Ngomo, University of Leipzig
* Josiane Xavier Parreira, Siemens AG Österreich
* Heiko Paulheim, University of Mannheim
* Tassilo Pellegrini, University of Applied Sciences St. Pölten
* Marta Sabou, Vienna University of Technology
* Harald Sack, Hasso-Plattner-Institute for IT Systems Engineering,
University of Potsdam
* Pierre-Yves Vandenbussche, Fujitsu
* Ruben Verborgh, Ghent University - iMinds
* Maria Esther Vidal, Universidad Simon Bolivar, Dept. Computer Science
On Wed, Mar 23, 2016 at 1:06 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
> Dan Andreescu, 23/03/2016 15:58:
>
>>
>> *Clean-up:* Analytics data on dumps was crammed into /other with
>> unrelated datasets. We made a new page to receive current and future
>> datasets [3] and linked to it from /other and /. Please let us know if
>> anything there looks confusing or opaque and I'll be happy to clarify.
>>
>
> I assume the old URLs will redirect to the new ones, right?
>
Good question, we didn't change any old URLs actually, so if you're trying
to get to other/pagecounts-ez, other/pagecounts-raw and all that, they're
all still there, just linked-to from /analytics. We did it this way
because we figured people had scripts that depended on those URLs. We
thought about moving and symlinking but it's probably unlikely that we'll
ever be able to delete the other/** location.
So mainly we just have a new page where we can do a better job of focusing
on the analytics datasets.
cc-ing our friends in research and wikitech (sorry I forgot initially)
We're happy to announce a few improvements to Analytics data releases on
> dumps.wikimedia.org:
>
> * We are releasing a new dataset, an estimate of Unique Devices accessing
> our projects [1]
> * We are officially making available a better Pageviews dataset [2]
> * We are deprecating two older pageview statistics datasets
> * We moved Analytics data from /other to /analytics [3]
>
> Details follow:
>
>
> *Unique Devices:* Since 2009, the Wikimedia Foundation used comScore to
> report data about unique web visitors. In January 2016, however, we
> decided to stop reporting comScore numbers [4] because of certain
> limitations in the methodology, these limitations translated into
> misreported mobile usage. We are now ready to replace comscore numbers with
> the Unique Devices Dataset [5][1]. While unique devices does not equal
> unique visitors, it is a good proxy for that metric, meaning that a major
> increase in the number of unique devices is likely to come from an increase
> in distinct users. We understand that counting uniques raises fairly big
> privacy concerns and we use a very private conscious way to count unique
> devices, it does not include any cookie by which your browser history can
> be tracked [6].
>
> We invite you to explore this new dataset and hope it’s helpful for the
> Wikimedia community in better understanding our projects. This data can
> help measurethe reach of wikimedia projects on the web.
>
> *Pageviews:* This [2] is the best quality data available for counting the
> number of pageviews our projects receive at the article and project level.
> We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to
> pageviews, in order to filter out more spider traffic and measure something
> closer to what we think is a real user viewing content. A short history
> might be useful:
>
> * pagecounts-raw: was maintained by Domas Mituzas originally and taken
> over by the analytics team. It was and still is the most used dataset,
> though it has some majore problems. It does not count access to the mobile
> site, it does not filter out spider or bot traffic, and it suffers from
> unknown loss due to logging infrastructure limitations.
> * pagecounts-all-sites: uses the same pageview definition as
> pagecounts-raw, and so also does not filter out spider or bot traffic. But
> it does include access to mobile and zero sites, and is built on a more
> reliable logging infrastructure.
> * pagecounts-ez: is derived from the best data available at the time.
> So until December 2015, it was based on pagecounts-raw and
> pagecounts-all-sites, and now it's based on pageviews. This dataset is
> great because it compresses very large files without losing any
> information, still providing hourly page and project level statistics.
>
> So the new dataset, pageviews, is what's behind our pageview API and is
> now available in static files for bulk download back to May 2015. But the
> multiple ways to download pageview data is confusing for consumers, so
> we're keeping only pageviews and pagecounts-ez and deprecating the other
> two. If you'd like to read more about the current pageview definition,
> details are on the research page [7].
>
> *Deprecating:* We are deprecating the pagecounts-raw and
> pagecounts-all-sites datasets in May 2016 (discussion here:
> https://phabricator.wikimedia.org/T130656 ). This data suffers from many
> artifacts, lack of mobile data, and/or infrastructure problems, and so is
> not comparable to the new way we track pageviews. It will remain here
> because we have historical data that may be useful, but it will not be
> maintained or updated beyond May 2016.
>
> *Clean-up:* Analytics data on dumps was crammed into /other with
> unrelated datasets. We made a new page to receive current and future
> datasets [3] and linked to it from /other and /. Please let us know if
> anything there looks confusing or opaque and I'll be happy to clarify.
>
>
> [1] http://dumps.wikimedia.org/other/unique_devices
> [2] http://dumps.wikimedia.org/other/pageviews
> [3] http://dumps.wikimedia.org/analytics/
> [4] https://meta.wikimedia.org/wiki/ComScore/Announcement
> [5] https://meta.wikimedia.org/wiki/Research:Unique_Devices
> [6]
> https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni…
> [7] https://meta.wikimedia.org/wiki/Research:Page_view
>