Hi all,
[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]
We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on English Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following:
We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:
Flow_cytometry Bioinformatics
Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list
The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.
One way we could do the above would be to parse wikitext instead of the SQL dumps and focus on extracting categories marked by pattern [[Category:XX]], but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.
Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories?
Thanks, Leila
[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages
[2] The exact code we use is
SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0
and the edges of the category graph are extracted with
*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*
[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics%E2%80%8B