Hey Research Community,
TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject…
More details:
I wanted to notify everyone that we have published a dataset of the
articles on English Wikipedia that have been tagged by WikiProjects [1]
through templates on their associated talk pages. We are not planning to
make this an ongoing release, but I have provided the script that I used to
generate it in the Figshare item so that others might update / adjust to
meet their needs.
As anyone who has done research on WikiProjects knows, it can be
complicated to determine what articles fit under a particular WikiProject's
purview. The motivation for generating this dataset was to support our work
in developing topic models for Wikipedia (see [2] for an overview), but we
imagine that there are many other ways in which this dataset might be
useful:
* Previous work has examined how active WikiProjects are based on edits to
their pages in the Wikipedia namespace. This dataset makes it much easier
to identify which Wikiprojects are managing the most valuable articles on
Wikipedia (in terms of quality or pageviews).
* Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects
represent an alternative that often is simpler and still quite rich. For
instance, this could be used for temporal analyses of article quality,
demand, or distribution by topic.
* While WikiProjects are English-only and therefore limited in their
utility to other languages, we also provide the Wikidata ID and sitelinks
-- i.e. titles for corresponding articles in other languages -- to allow
for multilingual analyses. This could be used to compare gaps in coverage
-- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively
extract the WikiProject templates from talk pages, and, 2) consistently
link them to a canonical WikiProject name and topic. For example, the
canonical template for WikiProject Medicine is
https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one
used is
https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and
there are 13 more). To capture articles tagged with these many templates
and all link them to the same canonical WikiProject and eventually
higher-level topic, we built a near-complete list of WikiProjects based on
the WikiProject Directory [4] and gathered all of their associated
templates. We purposefully excluded WikiProjects under the assistance /
maintenance category [5]. When parsing talk pages from the dump files then,
we check for any of these templates and list them under their canonical
name. As a backup, we also employ case-insensitive string matching with
"WP" and "WikiProject", which helps to guarantee that we did not miss
any
WikiProjects but introduces a number of false positives as well. If you
wish to map the WikiProjects listed in the dataset to their higher-level
topics, the mapping is in the figshare item and code that allows you to do
that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/ta…
[1]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2]
https://dl.acm.org/doi/10.1145/3274290
[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedi…
[4]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory
[5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikip…
Best,
Isaac
--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation