On Thu, Jul 20, 2017 at 8:21 AM, Cristian Consonni <cristian(a)balist.es> wrote:
On 11/07/2017 03:20, Leila Zia wrote:
Using this approach, we get 5 categories
associated with Flow cytometry
bioinformatics article :
I wanted to point out that to me the main difference between the first
two categories and the last three is that the former are automatically
added by templates. In fact, if you look at the page source you will
only find the first two.
This makes sense. Here is why we ended up in this place:
* If we would use XML dumps (which we initially did) for category
extraction (based on link extraction), we would consider a category
such as Guided_missiles_of_Norway a root category (which is wrong).
The issue with this category is that its parents' categories are
generated by templates and we could not (at least relatively easily)
pick this information up from XML dumps. As a result, we decided to go
with SQL dumps.
* The nice thing about using SQL dumps is that we can save the parents
of a category such as Guided_missiles_of_Norway, the downside is that
we lose information about which category is generated via template and
which one the usual way.
Two more things to add:
* Focusing on categories that belong to Main_topic_articles seems to
address the issue we ran into.
* We discussed whether a category such as
"Wikipedia_articles_published_in_PLOS_Computational_Biology" is a good
one or not, and given that its path is reasonable (by eye-balling), we
now consider it a category that should stay as a good category in the
category graph. Check the path for it:
... and up to the root
so now we know that it's good that our approach for building the graph
of categories doesn't exclude this category immediately.
Wiki-research-l mailing list