Out of idle curiosity ...
Are there significant numbers of articles NOT tagged by any WikiProject? In my experience
on-wiki, any article (apart from ones recently created) are tagged by one or more
WikiProjects.
I guess the converse question is what articles are the most tagged by WikiProjects? I am
often surprised at how many WikiProjects jump in to tag some article I have created (I am
more likely to notice the tagging of articles I create because they automatically go on my
watchlist).
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of
Isaac Johnson
Sent: Thursday, 16 January 2020 6:54 AM
To: Research into Wikimedia content and communities
<wiki-research-l(a)lists.wikimedia.org>
Subject: [Wiki-research-l] New dataset of articles tagged by WikiProjects
Hey Research Community,
TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject…
More details:
I wanted to notify everyone that we have published a dataset of the articles on English
Wikipedia that have been tagged by WikiProjects [1] through templates on their associated
talk pages. We are not planning to make this an ongoing release, but I have provided the
script that I used to generate it in the Figshare item so that others might update /
adjust to meet their needs.
As anyone who has done research on WikiProjects knows, it can be complicated to determine
what articles fit under a particular WikiProject's purview. The motivation for
generating this dataset was to support our work in developing topic models for Wikipedia
(see [2] for an overview), but we imagine that there are many other ways in which this
dataset might be
useful:
* Previous work has examined how active WikiProjects are based on edits to their pages in
the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects
are managing the most valuable articles on Wikipedia (in terms of quality or pageviews).
* Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects represent an
alternative that often is simpler and still quite rich. For instance, this could be used
for temporal analyses of article quality, demand, or distribution by topic.
* While WikiProjects are English-only and therefore limited in their utility to other
languages, we also provide the Wikidata ID and sitelinks
-- i.e. titles for corresponding articles in other languages -- to allow for multilingual
analyses. This could be used to compare gaps in coverage
-- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively extract the
WikiProject templates from talk pages, and, 2) consistently link them to a canonical
WikiProject name and topic. For example, the canonical template for WikiProject Medicine
is
https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is
https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are
13 more). To capture articles tagged with these many templates and all link them to the
same canonical WikiProject and eventually higher-level topic, we built a near-complete
list of WikiProjects based on the WikiProject Directory [4] and gathered all of their
associated templates. We purposefully excluded WikiProjects under the assistance /
maintenance category [5]. When parsing talk pages from the dump files then, we check for
any of these templates and list them under their canonical name. As a backup, we also
employ case-insensitive string matching with "WP" and "WikiProject",
which helps to guarantee that we did not miss any WikiProjects but introduces a number of
false positives as well. If you wish to map the WikiProjects listed in the dataset to
their higher-level topics, the mapping is in the figshare item and code that allows you to
do that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/ta…
[1]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2]
https://dl.acm.org/doi/10.1145/3274290
[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedi…
[4]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory
[5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikip…
Best,
Isaac
--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l