Hi Cristian,
On Thu, Jul 20, 2017 at 8:21 AM, Cristian Consonni cristian@balist.es wrote:
Hi Leila,
On 11/07/2017 03:20, Leila Zia wrote:
Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:
Flow_cytometry Bioinformatics
Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list
I wanted to point out that to me the main difference between the first two categories and the last three is that the former are automatically added by templates. In fact, if you look at the page source you will only find the first two.
This makes sense. Here is why we ended up in this place: * If we would use XML dumps (which we initially did) for category extraction (based on link extraction), we would consider a category such as Guided_missiles_of_Norway a root category (which is wrong). The issue with this category is that its parents' categories are generated by templates and we could not (at least relatively easily) pick this information up from XML dumps. As a result, we decided to go with SQL dumps. * The nice thing about using SQL dumps is that we can save the parents of a category such as Guided_missiles_of_Norway, the downside is that we lose information about which category is generated via template and which one the usual way.
Two more things to add: * Focusing on categories that belong to Main_topic_articles seems to address the issue we ran into. * We discussed whether a category such as "Wikipedia_articles_published_in_PLOS_Computational_Biology" is a good one or not, and given that its path is reasonable (by eye-balling), we now consider it a category that should stay as a good category in the category graph. Check the path for it:
Wikipedia_articles_published_in_PLOS_Computational_Biology Public_Library_of_Science Open_access_publishers Academic_publishing_companies Academic_publishing Academia Education Euthenics Social_sciences ... and up to the root
so now we know that it's good that our approach for building the graph of categories doesn't exclude this category immediately.
Best, Leila
Cristian
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l