Hey Research Community, TL;DR New dataset: https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_...
More details:
I wanted to notify everyone that we have published a dataset of the articles on English Wikipedia that have been tagged by WikiProjects [1] through templates on their associated talk pages. We are not planning to make this an ongoing release, but I have provided the script that I used to generate it in the Figshare item so that others might update / adjust to meet their needs.
As anyone who has done research on WikiProjects knows, it can be complicated to determine what articles fit under a particular WikiProject's purview. The motivation for generating this dataset was to support our work in developing topic models for Wikipedia (see [2] for an overview), but we imagine that there are many other ways in which this dataset might be useful:
* Previous work has examined how active WikiProjects are based on edits to their pages in the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects are managing the most valuable articles on Wikipedia (in terms of quality or pageviews).
* Many topic-level analyses of Wikipedia rely on the category network. Categories can be very messy and difficult to work with, but WikiProjects represent an alternative that often is simpler and still quite rich. For instance, this could be used for temporal analyses of article quality, demand, or distribution by topic.
* While WikiProjects are English-only and therefore limited in their utility to other languages, we also provide the Wikidata ID and sitelinks -- i.e. titles for corresponding articles in other languages -- to allow for multilingual analyses. This could be used to compare gaps in coverage -- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively extract the WikiProject templates from talk pages, and, 2) consistently link them to a canonical WikiProject name and topic. For example, the canonical template for WikiProject Medicine is https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are 13 more). To capture articles tagged with these many templates and all link them to the same canonical WikiProject and eventually higher-level topic, we built a near-complete list of WikiProjects based on the WikiProject Directory [4] and gathered all of their associated templates. We purposefully excluded WikiProjects under the assistance / maintenance category [5]. When parsing talk pages from the dump files then, we check for any of these templates and list them under their canonical name. As a backup, we also employ case-insensitive string matching with "WP" and "WikiProject", which helps to guarantee that we did not miss any WikiProjects but introduces a number of false positives as well. If you wish to map the WikiProjects listed in the dataset to their higher-level topics, the mapping is in the figshare item and code that allows you to do that can be found here: https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/tax...
[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2] https://dl.acm.org/doi/10.1145/3274290
[3] https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedia...
[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [5] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikipe...
Best, Isaac
Out of idle curiosity ...
Are there significant numbers of articles NOT tagged by any WikiProject? In my experience on-wiki, any article (apart from ones recently created) are tagged by one or more WikiProjects.
I guess the converse question is what articles are the most tagged by WikiProjects? I am often surprised at how many WikiProjects jump in to tag some article I have created (I am more likely to notice the tagging of articles I create because they automatically go on my watchlist).
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Isaac Johnson Sent: Thursday, 16 January 2020 6:54 AM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] New dataset of articles tagged by WikiProjects
Hey Research Community, TL;DR New dataset: https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_...
More details:
I wanted to notify everyone that we have published a dataset of the articles on English Wikipedia that have been tagged by WikiProjects [1] through templates on their associated talk pages. We are not planning to make this an ongoing release, but I have provided the script that I used to generate it in the Figshare item so that others might update / adjust to meet their needs.
As anyone who has done research on WikiProjects knows, it can be complicated to determine what articles fit under a particular WikiProject's purview. The motivation for generating this dataset was to support our work in developing topic models for Wikipedia (see [2] for an overview), but we imagine that there are many other ways in which this dataset might be useful:
* Previous work has examined how active WikiProjects are based on edits to their pages in the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects are managing the most valuable articles on Wikipedia (in terms of quality or pageviews).
* Many topic-level analyses of Wikipedia rely on the category network. Categories can be very messy and difficult to work with, but WikiProjects represent an alternative that often is simpler and still quite rich. For instance, this could be used for temporal analyses of article quality, demand, or distribution by topic.
* While WikiProjects are English-only and therefore limited in their utility to other languages, we also provide the Wikidata ID and sitelinks -- i.e. titles for corresponding articles in other languages -- to allow for multilingual analyses. This could be used to compare gaps in coverage -- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively extract the WikiProject templates from talk pages, and, 2) consistently link them to a canonical WikiProject name and topic. For example, the canonical template for WikiProject Medicine is https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are 13 more). To capture articles tagged with these many templates and all link them to the same canonical WikiProject and eventually higher-level topic, we built a near-complete list of WikiProjects based on the WikiProject Directory [4] and gathered all of their associated templates. We purposefully excluded WikiProjects under the assistance / maintenance category [5]. When parsing talk pages from the dump files then, we check for any of these templates and list them under their canonical name. As a backup, we also employ case-insensitive string matching with "WP" and "WikiProject", which helps to guarantee that we did not miss any WikiProjects but introduces a number of false positives as well. If you wish to map the WikiProjects listed in the dataset to their higher-level topics, the mapping is in the figshare item and code that allows you to do that can be found here: https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/tax...
[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2] https://dl.acm.org/doi/10.1145/3274290
[3] https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedia...
[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [5] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikipe...
Best, Isaac
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Good question -- in our experience, almost every single article in Wikipedia is tagged with at least one WikiProject: there are 5,926,244 articles in the dataset with associated WikiProjects and while I don't remember how many articles were in the December dump upon which this was based, it was less than 6M. Many WikiProjects though could reasonably tag many more articles than they do, so while almost all of the English Wikipedia articles are represented here, I would not say that the WikiProject (and, by extension, topic) tags are complete.
--Isaac
On Wed, Jan 15, 2020 at 3:13 PM Kerry Raymond kerry.raymond@gmail.com wrote:
Out of idle curiosity ...
Are there significant numbers of articles NOT tagged by any WikiProject? In my experience on-wiki, any article (apart from ones recently created) are tagged by one or more WikiProjects.
I guess the converse question is what articles are the most tagged by WikiProjects? I am often surprised at how many WikiProjects jump in to tag some article I have created (I am more likely to notice the tagging of articles I create because they automatically go on my watchlist).
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Isaac Johnson Sent: Thursday, 16 January 2020 6:54 AM To: Research into Wikimedia content and communities < wiki-research-l@lists.wikimedia.org> Subject: [Wiki-research-l] New dataset of articles tagged by WikiProjects
Hey Research Community, TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_...
More details:
I wanted to notify everyone that we have published a dataset of the articles on English Wikipedia that have been tagged by WikiProjects [1] through templates on their associated talk pages. We are not planning to make this an ongoing release, but I have provided the script that I used to generate it in the Figshare item so that others might update / adjust to meet their needs.
As anyone who has done research on WikiProjects knows, it can be complicated to determine what articles fit under a particular WikiProject's purview. The motivation for generating this dataset was to support our work in developing topic models for Wikipedia (see [2] for an overview), but we imagine that there are many other ways in which this dataset might be useful:
- Previous work has examined how active WikiProjects are based on edits to
their pages in the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects are managing the most valuable articles on Wikipedia (in terms of quality or pageviews).
- Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects represent an alternative that often is simpler and still quite rich. For instance, this could be used for temporal analyses of article quality, demand, or distribution by topic.
- While WikiProjects are English-only and therefore limited in their
utility to other languages, we also provide the Wikidata ID and sitelinks -- i.e. titles for corresponding articles in other languages -- to allow for multilingual analyses. This could be used to compare gaps in coverage -- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively extract the WikiProject templates from talk pages, and, 2) consistently link them to a canonical WikiProject name and topic. For example, the canonical template for WikiProject Medicine is https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are 13 more). To capture articles tagged with these many templates and all link them to the same canonical WikiProject and eventually higher-level topic, we built a near-complete list of WikiProjects based on the WikiProject Directory [4] and gathered all of their associated templates. We purposefully excluded WikiProjects under the assistance / maintenance category [5]. When parsing talk pages from the dump files then, we check for any of these templates and list them under their canonical name. As a backup, we also employ case-insensitive string matching with "WP" and "WikiProject", which helps to guarantee that we did not miss any WikiProjects but introduces a number of false positives as well. If you wish to map the WikiProjects listed in the dataset to their higher-level topics, the mapping is in the figshare item and code that allows you to do that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/tax...
[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2] https://dl.acm.org/doi/10.1145/3274290
[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedia...
[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikipe...
Best, Isaac
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Kerry,
I suspect it is likely to be different if you differentiate between articles tagged by people involved in a particular WikiProject, articles tagged into a WikiProject by newpage patrollers and other taggers, and articles tagged into all their relevant wikiprojects.
The backlog of articles not allocated to any Wikiproject at all is usually pretty small. But that doesn't mean that all articles are fully tagged for Wikiprojects.
A few years ago I was involved in a major cleanup operation for our then backlog of unsourced biographies of Living People. One of our tactics was to create reports for each relevant wikiproject showing the unsourced biographies that were relevant to them and encouraging them to help delete or improve the articles in that report. At one point in the cleanup we realised that only about half of the articles we were looking at were tagged to any Wikiproject other than Biography. So a group of volunteers went through I think it was twenty thousand unsourced biographies that were only tagged to WikiProject Biography and tagged them to the relevant wikiprojects, that usually meant at least one geographic project and one occupational one. Some of the Wikiprojects were assiduous in improving all the articles we found for them, Heavy Metal I remember being very efficient. Others just trawled through and nominated the unnotables and hoaxes for deletion, WikiProject Croatia was one of those.
With a large proportion of WikiProjects dormant at any one time, I rather suspect that most of the tagging for WikiProjects is in effect a subset of the categorisation process rather than a sign that someone interested in the topic has tagged the article for their WikiProject.
Regards
Jonathan
On Wed, 15 Jan 2020 at 21:13, Kerry Raymond kerry.raymond@gmail.com wrote:
Out of idle curiosity ...
Are there significant numbers of articles NOT tagged by any WikiProject? In my experience on-wiki, any article (apart from ones recently created) are tagged by one or more WikiProjects.
I guess the converse question is what articles are the most tagged by WikiProjects? I am often surprised at how many WikiProjects jump in to tag some article I have created (I am more likely to notice the tagging of articles I create because they automatically go on my watchlist).
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Isaac Johnson Sent: Thursday, 16 January 2020 6:54 AM To: Research into Wikimedia content and communities < wiki-research-l@lists.wikimedia.org> Subject: [Wiki-research-l] New dataset of articles tagged by WikiProjects
Hey Research Community, TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_...
More details:
I wanted to notify everyone that we have published a dataset of the articles on English Wikipedia that have been tagged by WikiProjects [1] through templates on their associated talk pages. We are not planning to make this an ongoing release, but I have provided the script that I used to generate it in the Figshare item so that others might update / adjust to meet their needs.
As anyone who has done research on WikiProjects knows, it can be complicated to determine what articles fit under a particular WikiProject's purview. The motivation for generating this dataset was to support our work in developing topic models for Wikipedia (see [2] for an overview), but we imagine that there are many other ways in which this dataset might be useful:
- Previous work has examined how active WikiProjects are based on edits to
their pages in the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects are managing the most valuable articles on Wikipedia (in terms of quality or pageviews).
- Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects represent an alternative that often is simpler and still quite rich. For instance, this could be used for temporal analyses of article quality, demand, or distribution by topic.
- While WikiProjects are English-only and therefore limited in their
utility to other languages, we also provide the Wikidata ID and sitelinks -- i.e. titles for corresponding articles in other languages -- to allow for multilingual analyses. This could be used to compare gaps in coverage -- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively extract the WikiProject templates from talk pages, and, 2) consistently link them to a canonical WikiProject name and topic. For example, the canonical template for WikiProject Medicine is https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are 13 more). To capture articles tagged with these many templates and all link them to the same canonical WikiProject and eventually higher-level topic, we built a near-complete list of WikiProjects based on the WikiProject Directory [4] and gathered all of their associated templates. We purposefully excluded WikiProjects under the assistance / maintenance category [5]. When parsing talk pages from the dump files then, we check for any of these templates and list them under their canonical name. As a backup, we also employ case-insensitive string matching with "WP" and "WikiProject", which helps to guarantee that we did not miss any WikiProjects but introduces a number of false positives as well. If you wish to map the WikiProjects listed in the dataset to their higher-level topics, the mapping is in the figshare item and code that allows you to do that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/tax...
[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2] https://dl.acm.org/doi/10.1145/3274290
[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedia...
[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikipe...
Best, Isaac
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org