I could use a little help with understanding these dumps:
I'm trying to verify the claim that ENWP is the world's largest open text
project, and to do that I need to verify that ENWP is larger than English
Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open
text project than ENWP? Perhaps there's a library somewhere that has such a
huge volume of out-of-copyright materials that the combined bytes of
published text are larger than ENWP?
The next Research showcase will be live-streamed this Wednesday (tomorrow),
September 16 at 11.30 PST. The streaming link is:
As usual, you can join the conversation on IRC at #wikimedia-research.
We look forward to seeing you!
Morten Warncke-Wang will talk about the misalignment between production and
consumption of quality content on Wikipedia, and Besnik Fetahu proposes a
news-article suggestion task to improve news coverage in Wikipedia.
*Fun or Functional? The Misalignment Between Content Quality and Popularity
By Morten Warncke-Wang
In peer production communities like Wikipedia, individual community members
typically decide for themselves where to make contributions, often driven
by factors such as “fun” or a belief that “information should be free”.
However, the extent to which this bottom-up, interest-driven content
production paradigm meets the need of consumers of this content is unclear.
In this talk, I analyse four large Wikipedia language editions, finding
extensive misalignment between production and consumption of quality
content in all of them, and I show how this greatly impacts Wikipedia’s
readers. I also examine misalignment in more detail by studying how it
relates to specific topics, and to what extent high popularity is related
to sudden changes in demand (i.e. “breaking news”). Finally, I discuss
technologies and community practices that can help reduce misalignment in
*Automated News Suggestions for Populating Wikipedia Entity Pages*
By Besnik Fetahu
Wikipedia entity pages are a valuable source of information for direct
consumption and for knowledge-base construction, update and maintenance.
Facts in these entity pages are typically supported by references. Recent
studies show that as much as 20% of the references are from online news
sources. However, many entity pages are incomplete even if relevant
information is already available in existing news articles. Even for the
already present references, there is often a delay between the news article
publication time and the reference time. In this work, we therefore look at
Wikipedia through the lens of news and propose a novel news-article
suggestion task to improve news coverage in Wikipedia, and reduce the lag
of newsworthy references. Our work finds direct application, as a
precursor, to Wikipedia page generation and knowledge-base acceleration
tasks that rely on relevant and high quality input sources. We propose a
two-stage supervised approach for suggesting news articles to entity pages
for a given state of Wikipedia. First, we suggest news articles to Wikipedia
entities (article-entity placement) relying on a rich set of features which
take into account the salience and relative authority of entities, and the
novelty of news articles to entity pages. Second, we determine the exact
section in the entity page for the input article (article-section
placement) guided by class-based section templates. We perform an extensive
evaluation of our approach based on ground-truth data that is extracted
from external references in Wikipedia. We achieve a high precision value of
up to 93% in the article-entity suggestion stage and upto 84% for the
article-section placement. Finally, we compare our approach against
competitive baselines and show significant improvements.
I just filed a Request for Comments on Meta to enable Flow -- an improved
discussion system -- in the Research namespace on meta. See
This is meant to replace the talk-page based discussion mechanism that is
currently in use.
To try out a Flow board, see https://www.mediawiki.org/wiki/Talk:Sandbox
Please leave your comments on the bottom of the Meta page I linked to
above. Here's a copy-paste for the lazy:
I propose that Flow <https://meta.wikimedia.org/wiki/Flow> be enabled on
the Research_talk namespace. This namespace is used to document and list
projects involving original research of Wikimedia projects and
technologies. See Research:Index
I have been working on developing the research community around Wikimedia
wikis as well as performing my own research projects. As a result, I have
been very active in the "Research" namespace writing new content and
discussing projects with others. It's clear to me that our work in this
space will benefit from adopting Flow for discussions. Many researchers of
Wikimedia projects do not engage on-wiki often and they find Flow to be
more intuitive than talk pages. Flow topics are also much easier to track
with notifications (a benefit I hope to reap). Further, the "Research"
namespace content on meta is still new enough -- and under active
development -- that we are poised to adapt to Flow-based conversation
Note that I am the most active editor in this namespace. I have written our
core templates (*e.g.*, Template:Research project
pages (*e.g.*, Category:Research projects
documentation (*e.g.*, Research:New project
<https://meta.wikimedia.org/wiki/Research:New_project>). So this change
will primarily effect me and those who I work most closely with. As I have
been developing templates and other mechanisms to support navigation in
this namespace, I accept responsibility for making sure that workflow
problems around such a transition are resolved. After all, given that I'm
the most active editor in this namespace, most of those problems will be my
Note that I am proposing this under my volunteer account, but I also work
as senior research scientist for the Wikimedia Foundation. See User:Halfak
(WMF) <https://meta.wikimedia.org/wiki/User:Halfak_(WMF)>. --EpochFail
<https://meta.wikimedia.org/wiki/User_talk:EpochFail>) 17:36, 8 September
I know there's been lots of research (well, some) about why and how people
read Wikipedia, but has there been any significant studies or research
about why they don't edit? Not why they stopped editing or their numbers
are dropping or how they got sick of the bureaucracy and markup code, but
what barriers might exist to them ever making an edit in the first place.
I noticed after HTTPS was enabled by default that there were many fewer
spambots on one of the wikis that I monitor for recent changes. Did anyone
else noticed a decline in spambots after HTTPS was enabled?
This may be relevant to discussions about the highly active editor stats.
While I doubt that spambots and vandals succeed in getting to 100 edits on
the larger Wikipedias very often, rollbackers might. Additionally, a
reduction in spambots and spambot-related rollbacks might affect the number
of new accounts registered and the number of edits per month stats.