We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hi,
Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hi all,
Some of us plan to have a conversation at the WCONUSA unconference sessions
about ENWP culture. Are there any recommended readings that you could
suggest as preparation, particularly on the subject of how to reinforce or
incentivize desirable user behavior? I think that Jonathan may have done
some research on this topic for the Teahouse, and Ocassi may have for done
research for TWA. I'm interested in applicable research as preparation both
for the unconference discussion and for my planned video series that
intends to inform and inspire new editors.
Thanks,
Pine
Hi everybody,
We’re preparing for the October 2015 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201510 and add your name next to any paper you are interested in covering. Our target publication date is Wednesday October 28 UTC. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
Use and awareness of Wikipedia among the M.C.A students of C. D. Jain college of commerce, Shrirampur : A Study
Understanding Editing Behaviors in Multilingual Wikipedia
The Impact and Evolution of Group Diversity in Online Open Collaboration
Teaching Wikipedia: The Pedagogy and Politics of an Open Access Writing Community
“An Encyclopedia, Not an Experiment in Democracy”: Wikipedia Biographies, Authorship, and the Wikipedia Subject
"You get what you need” : A study of students’ attitudes towards using Wikipedia when doing school assignments
Machine Learning and the Detection of Anomalies in Wikipedia
"Collective remembering of organizations: Co-construction of organizational pasts in Wikipedia"
Top 100 historical figures of Wikipedia
Towards a Class-Based Model of Information Organization in Wikipedia
Transparency, Control, and Content Generation on Wikipedia: Editorial Strategies and Technical Affordances
Influence of Wikipedia and other web resources on acute and critical care decisions. A web-based survey
Utilising Wikipedia for text mining applications
Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia
Beyond Friendships and Followers: The Wikipedia Social Network
Intellectual Interchanges in the History of Massive Online Open-editing Encyclopedia, Wikipedia
Exploration of Online Culture Through Network Analysis of Wikipedia
Cyberpsychology, Behavior, and Social Networking
Measuring Article Quality in Wikipedia using the Collaboration Network
How do Twitter, Wikipedia, and Harrison's principles of medicine describe heart attacks?
Sociotechnical interaction at work: an ethnographic study of the Wikipedia community
Wikipedia and history: a worthwhile partnership in the digital era?
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/wiki/Research:Newsletter
*Extended Deadline November 20, 2015
CFP: Semantic Web Journal - Special Issue on Quality Management of
Semantic Web Assets (Data, Services and Systems):*
http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-…
Submission guidelines
*Deadline:October 31, 2015* > *November 20, 2015**
*
Submissions shall be made through the Semantic Web journal website at
http://www.semantic-web-journal.net. Prospective authors must take
notice of the submission guidelines posted at
http://www.semantic-web-journal.net/authors. Note that you need to
request an account on the website for submitting a paper. Please
indicate in the cover letter that it is for the Special Issue on Quality
Management of Semantic Web Assets (Data, Services and Systems).
Submissions are possible in the following categories: full research
papers, application reports, reports on tools and systems, and case
studies. While there is no upper limit, paper length must be justified
by content.
Guest editors
* Amrapali Zaveri, University of Leipzig, AKSW Group, Germany
* Dimitris Kontokostas, University of Leipzig, AKSW Group, Germany
* Sebastian Hellmann, University of Leipzig, AKSW Group, Germany
* Jürgen Umbrich, Vienna University of Economics and Business, Austria
*Overview and Topics*
The standardization and adoption of Semantic Web technologies has
resulted in a variety of assets, including an unprecedented volume of
data being semantically enriched and systems and services, which consume
or publish this data. Although gathering, processing and publishing data
is a step towards further adoption of Semantic Web, quality does not yet
play a central role in these assets (e.g., data lifecycle,
system/service development).
Quality management essentially refers to activities and tasks involved
to guarantee a certain level of consistency and to meet the quality
requirements for the assets. In general, quality management consists of
the following four phases and components: (i) quality planning, (ii)
quality control, (iii) quality assurance and (iv) quality improvement.
The quality planning phase in the Semantic Web typically involves the
design of procedures, strategies and policies to support the management
of the assets. The quality control and assurance components have their
primary aim in preventing errors and to meet quality requirements
pertaining to the Semantic Web standards. A core part for both
components are quality assessment methods which provide the necessary
input for the controlling and assurance tasks.
Quality assessment of Semantic Web Assets (data, services and systems),
in particular, presents new challenges that were not handled before in
other research areas. Thus, adopting existing approaches for data
quality assessment is not a straightforward solution. These challenges
are related to the openness of the Semantic Web, the diversity of the
information and the unbounded, dynamic set of autonomous data sources,
publishers and consumers (legal and software agents). Additionally,
detecting the quality of available data sources and making the
information explicit is yet another challenge. Moreover, noise in one
data set, or missing links between different data sets, propagates
throughout the Web of Data, and imposes great challenges on the data
value chain.
In case of systems and services, different implementations follow the
specifications for RDF and SPARQL to varying extents, or even propose
and offer new, non-standardized extensions. This causes strong
incompatibilities between systems, e.g., between the used SPARQL
features in the query engines and support features in RDF stores. The
potential heterogeneity and incompatibility poses several challenges for
the quality assessments in and for such systems and services.
Eventually, quality improvement methods are used to further enhance the
value of the Semantic Web Assets. One important step to improve the
quality of data is identifying the root cause of the problem and then
designing corresponding data improvement solutions. These solutions
select the most effective and efficient strategies and related set of
techniques and tools to improve quality. Quality improvement metrics for
products and services entails understanding and improving operational
processes and establishing valid and reliable service performance measures.
This Special Issue is addressed to those members of the community
interested in providing novel methodologies or frameworks in managing,
assessing, monitoring, maintaining and improving the quality of the
Semantic Web data, services and systems and also introduce tools and
user interfaces which can effectively assist in this management.
Topics of Interest
We welcome original high quality submissions on (but are not restricted
to) the following topics:
* Methodologies and frameworks to plan, control, assure or improve the
quality of Semantic Web Assets
* Quality exploration and analysis interfaces
* Quality monitoring
* Developing, deploying and managing quality service ecosystems
* Assessing the quality evolution of Semantic Web Assets
* Large-scale quality assessment of structured datasets
* Crowdsourcing data quality assessment
* Quality assessment leveraging background knowledge
* Use-case driven quality management
* Evaluation of trustworthiness of data
* Web Data and LOD quality benchmarks
* Data Quality improvement methods and frameworks, e.g., linkage,
alignment, cleaning, enrichment, correctness
* Service/system quality improvement methods and frameworks
* Managing sustainability issues in services
* Guarantee of service (availability, performance)
* Systems for transparent management of open data
Hey,
I was wondering if any one on the list had any contacts with Norwegian
academics doing research on Wikipedia, particularly from a gender gap
perspective?
Sincerely,
Laura Hale
--
twitter: purplepopple
Hi everyone,
The next Research showcase is completely dedicated to Teahouse. :-) It will
be live-streamed this Wednesday, October 21 at 18:30 (UTC). The streaming
link is:
http://www.youtube.com/watch?v=T73vRiNsRxo
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Archive>.
We look forward to seeing you!
Leila
This month
The impact of the Wikipedia Teahouse on new editor retentionBy *Jonathan
Morgan, Aaron Halfaker*
New Wikipedia editors face a variety of social and technical barriers to
participation. These barriers have been shown to cause even promising,
highly-motivated newcomers to give up and leave Wikipedia shortly after
joining. The Wikipedia Teahouse was launched in 2012 to provide new editors
with a space on Wikipedia where they could ask questions, introduce
themselves, and learn the ropes of editing in a friendly and supportive
environment, with the goal of increasing the percentage of good-faith
newcomers who go on to become productive Wikipedians. Research has shown
that the Teahouse provided a positive experience for participants, and
suggested that participating in the Teahouse led to more editing activity
and longer survival for new editors who participated. The current study
examines the impact of Teahouse invitations on new editors survival over a
longer period of time (2-6 months), and presents findings related to
contextual factors within editors' first few sessions that are associated
with overall survival rate and editing patterns associated with increased
likelihood of visiting the Teahouse.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Call for papers: Special Issue on
Mining Social Semantics on the Social Web
In recent years the amount of data available on the social web has
grown massively. Consequently, researchers have developed approaches
that leverage this social web data to tackle interesting challenges of
the semantic web. Among them are methods for learning ontologies from
social media or crowdsourcing, extracting semantics from data
collected by citizen science and participatory sensing initiatives, or
for better understanding and describing user activities.
The rich data provided by the social web can be used to learn and
construct the semantic web. This can be facilitated by learning basic
semantic relationships, e.g., between entities, or by employing more
sophisticated methods that are able to construct a complete knowledge
graph or ontology. Other methods enrich content from the social web
and link it to the semantic web.
The proposed special issue is open to all submissions that utilize
data from the social web a) with the help of semantic web
technologies, b) for inferring and extracting semantics, or c) for
enriching and linking content with/to the semantic web or existing
knowledge structures like the linked open data cloud. Any kind of data
can be utilized as long as it has a connection with the social web,
e.g., tags from Flickr, tweets from Twitter, check-ins from
Foursquare, articles from Wikipedia, shared mobile sensor data, data
from participatory mapping, crowd-sourced data, etc. Examples include
approaches for inferring the semantics of tags, extracting semantics
from Wikipedia articles, or enriching tweets with named entities. The
resulting knowledge can be integrated into structures like the linked
open data cloud.
== Topics of Interest ==
We welcome original high quality submissions on (but are not
restricted to) the following topics:
- - linked open data and the social web
- - machine learning for the semantic web on social web data
- - semantic enrichment (e.g., sentiment detection, polarity, named
entity recognition, ...) of user-generated texts (e.g., Wikipedia
articles, tweets, blogs, …)
- - extraction and modelling of arguments and discourse
- - never-ending language learning from user-generated content
- - ontology learning from user-generated content
- - semantics of social tagging (e.g., inferring semantics of tags,
identifying relationships between tags, learning ontologies from tags,
...)
- - mining Wikipedia (e.g., extracting semantics from articles, semantic
enrichment of articles, inter-language analyses, mining the Wikipedia
category graph, ...)
- - temporal and spatial semantics of content from the social web
- - inferring semantics from user data, usage logs, mobile sensing, ...
- - extracting location-based semantics from Foursquare, OpenStreetMap, ...
- - leveraging crowd-sourcing for the semantic web
- - capturing the semantics of user interactions
- - inferring semantics from user data and usage logs
== Submissions ==
31 January 2016 - Paper submission deadline
Submissions shall be made through the Semantic Web journal website at
http://www.semantic-web-journal.net/. Prospective authors must take
notice of the submission guidelines posted at
http://www.semantic-web-journal.net/authors. Note that you need to
request an account on the website for submitting a paper. Please
indicate in the cover letter that it is for the special issue on
Mining Semantics in/from the Social Web.
Submissions are possible as full research papers or surveys. While
there is no upper limit, the paper length must be justified by content.
== Important Dates ==
- - Call for papers: September 2015
- - Submission deadline: 31 January 2016
- - Notification: 31 March 2016
== Guest editors ==
Please use the e-mail address social-semantic-issue(a)l3s.de for inquiries.
- - Andreas Hotho, University of Würzburg, Germany
- - Robert Jäschke, L3S Research Center, Germany
- - Kristina Lerman, University of Southern California, United States
<http://www.semantic-web-journal.net/blog/call-papers-special-issue-mining-s…>
- --
Prof. Dr. Robert Jäschke
L3S Research Center/Leibniz University Hannover
http://www.kbs.uni-hannover.de/~jaeschke/
+49-(0)511-762-17775
<<<<< please participate: http://researchersontwitter.appspot.com/ >>>>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAEBAgAGBQJWJ7AnAAoJEPZY2c/EvlKYKe0QAJsMTBvjxSW5PHCletyNFINB
KFoVGtPlk0vxQhiFKInSTOXK8EsXDwx0NWgH7vPlk07LwmlppIs0qOnzBxNEydl9
auiChSrfcxBaWV3SIXwpEtSleUstv7YjB3K01cygryDFKx8VoP6Ol7edT3gqnqTq
WrAoSQ5G79Tgik1Z6fDg73ZY9LLs9JbpObzE4yPR/5m9+9g6wTd4seddbbu2NGEc
MUw430yimu2RRdeX6yhQxvVyM56IXwkGtjr9/cfsJjRn8tu/bT4t+xpcQ7iyZGwQ
Pf4bltYCkvkbywBvDUh7atXeuoPC7/z/wZOeSTWCzCNjSVtiqtNMjR5PaCPeK+xt
WxaSbdy7/g5BF5TL4vQr/+EQWI2icu8Pj1pFFEe+6DaxUAhV4aVfXtMi2PMW9MwD
OIequS50f46MLjQW0u6u2ufHmpHmrcJyZeHV1+Gr02+R+l1O2h3zTOY0My8TN8p8
0RNsSnKkIbh37w7Pp4U/gcnTnj5yK79qxSIlUQZoc434pvvKUxvsKWAObdHRcWYY
4dExicwY8eC+aMrJh8/q54BEwO6DO5LZkeX9vWD/Agqiir+a0jKEIYognvytXrdo
CswnxEaZY8RT0aOeXyePyy/8Uv7u3GN5egPkvmZN4Fofoh8LZC64KNSWOwuPD1iF
VFwGMzY0z+/Aa4RUXxUf
=8yeo
-----END PGP SIGNATURE-----