category extraction question

List overview All Threads
Download

newer

older

The April 2017 issue of the...

Re: [Wiki-research-l] Recognizing...

Leila Zia

10 Jul 2017 10 Jul '17

8:20 p.m.

Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on English Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the SQL dumps and focus on extracting categories marked by pattern [[Category:XX]], but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics%E2%80%8B

Show replies by date

Stuart A. Yeates

10 Jul 10 Jul

8:45 p.m.

The category system on en.wiki is not an IS-A system and there have been several discussions about making it it based on mathematical principals which have come to nothing because the consensus of editors is against it. The best way to think about categories is as a locally-faceted related links system.

Having said that, Category:Wikipedia maintenance is an important root probably useful for separating the wheat from the chaff. Most of these are also hidden categories. I'm not sure whether this flag appears in the SQL, but see https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

cheers stuart

-- ...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia leila@wikimedia.org wrote:

...

Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on English Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the SQL dumps and focus on extracting categories marked by pattern [[Category:XX]], but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics%E2%80%8B _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Bowen Yu

8:51 p.m.

Hi Leila,

I did something similar before. I was trying to create "top-level" category labels for the articles, like history, society, technology, etc. I parsed the wikitext in dump data to extract all the sub category labels of the article. Also, by parsing pages of namespace 14, I created a category-relation graph for all the category labels, where ideally, each sub category can reach some "top-level" category. Then, for each article, you can take the sub category label into the graph for the top-level categories. More detail can be found in 3.3.2 Independent Variables - Identity-based Attachment subsection in the paper. Hope it helps!

On Mon, Jul 10, 2017 at 8:45 PM, Stuart A. Yeates syeates@gmail.com wrote:

...

The category system on en.wiki is not an IS-A system and there have been several discussions about making it it based on mathematical principals which have come to nothing because the consensus of editors is against it. The best way to think about categories is as a locally-faceted related links system.

Having said that, Category:Wikipedia maintenance is an important root probably useful for separating the wheat from the chaff. Most of these are also hidden categories. I'm not sure whether this flag appears in the SQL, but see https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

cheers stuart

-- ...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia leila@wikimedia.org wrote:

...
Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on

English

...
Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the

SQL

...
dumps and focus on extracting categories marked by pattern

[[Category:XX]],

...
but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics%E2%80%8B _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Leila Zia

11 Jul 11 Jul

9:53 a.m.

Hi Stuart,

On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates syeates@gmail.com wrote:

...

The category system on en.wiki is not an IS-A system and there have been several discussions about making it it based on mathematical principals which have come to nothing because the consensus of editors is against it. The best way to think about categories is as a locally-faceted related links system.

It would be great if you can share a link to one or more of those conversations, if it's not too hard to find them. This is a conversation that comes up often and I'd like to educate myself with this background. (and to confirm: on our end the goal is not to change the category system on enwiki, but to make it machine understandable for specific applications.)

...

Having said that, Category:Wikipedia maintenance is an important root probably useful for separating the wheat from the chaff. Most of these are also hidden categories. I'm not sure whether this flag appears in the SQL, but see https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

Looking into these. thanks!

Best, Leila

...

cheers stuart

-- ...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia leila@wikimedia.org wrote:

...
Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on English Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the SQL dumps and focus on extracting categories marked by pattern [[Category:XX]], but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Stuart A. Yeates

24 Jul 24 Jul

7:22 p.m.

Sorry it's taken me so long to get back to this.

https://pdfs.semanticscholar.org/dea9/142b39bdc2c3738e0f9cb7c6d117750ef2f7.p... and https://meta.wikimedia.org/wiki/Beyond_categories are good places to start on the issues with cats on en.wiki.

cheers stuart

-- ...let us be heard from red core to black sky

On 12 July 2017 at 02:53, Leila Zia leila@wikimedia.org wrote:

...

Hi Stuart,

On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates syeates@gmail.com wrote:

...
The category system on en.wiki is not an IS-A system and there have been several discussions about making it it based on mathematical principals which have come to nothing because the consensus of editors is against

it.

...
The best way to think about categories is as a locally-faceted related links system.

It would be great if you can share a link to one or more of those conversations, if it's not too hard to find them. This is a conversation that comes up often and I'd like to educate myself with this background. (and to confirm: on our end the goal is not to change the category system on enwiki, but to make it machine understandable for specific applications.)

...
Having said that, Category:Wikipedia maintenance is an important root probably useful for separating the wheat from the chaff. Most of these

are

...
also hidden categories. I'm not sure whether this flag appears in the

SQL,

...
but see https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

Looking into these. thanks!

Best, Leila

...
cheers stuart

-- ...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia leila@wikimedia.org wrote:

...
Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on

English

...
...
Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there).

One

...
...
issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on

English

...
...
Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the

last

...
...
category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the

SQL

...
...
dumps and focus on extracting categories marked by pattern

[[Category:XX]],

...
...
but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related

or

...
...
other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Leila Zia

25 Jul 25 Jul

2:05 p.m.

On Mon, Jul 24, 2017 at 5:22 PM, Stuart A. Yeates syeates@gmail.com wrote:

...

Sorry it's taken me so long to get back to this. https://pdfs.semanticscholar.org/dea9/142b39bdc2c3738e0f9cb7c6d117750ef2f7.p... and https://meta.wikimedia.org/wiki/Beyond_categories are good places to start on the issues with cats on en.wiki.

very helpful. Thanks!

Leila

...

cheers stuart

-- ...let us be heard from red core to black sky

On 12 July 2017 at 02:53, Leila Zia leila@wikimedia.org wrote:

...
Hi Stuart,

On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates syeates@gmail.com wrote:

...
The category system on en.wiki is not an IS-A system and there have been several discussions about making it it based on mathematical principals which have come to nothing because the consensus of editors is against

it.

...
The best way to think about categories is as a locally-faceted related links system.

It would be great if you can share a link to one or more of those conversations, if it's not too hard to find them. This is a conversation that comes up often and I'd like to educate myself with this background. (and to confirm: on our end the goal is not to change the category system on enwiki, but to make it machine understandable for specific applications.)

...
Having said that, Category:Wikipedia maintenance is an important root probably useful for separating the wheat from the chaff. Most of these

are

...
also hidden categories. I'm not sure whether this flag appears in the

SQL,

...
but see https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

Looking into these. thanks!

Best, Leila

...
cheers stuart

-- ...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia leila@wikimedia.org wrote:

...
Hi all,

[If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)]

We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on

English

...
...
Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there).

One

...
...
issue that we are facing is the following:

We are currently using SQL dumps to extract categories associated with every article on

English

...
...
Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the

last

...
...
category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the

SQL

...
...
dumps and focus on extracting categories marked by pattern

[[Category:XX]],

...
...
but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template.

Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related

or

...
...
other types of categories?

Thanks, Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0

and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14*

[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Cristian Consonni

20 Jul 20 Jul

10:21 a.m.

Hi Leila,

On 11/07/2017 03:20, Leila Zia wrote:

...

Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

I wanted to point out that to me the main difference between the first two categories and the last three is that the former are automatically added by templates. In fact, if you look at the page source you will only find the first two.

Cristian

Leila Zia

25 Jul 25 Jul

2:03 p.m.

Hi Cristian,

On Thu, Jul 20, 2017 at 8:21 AM, Cristian Consonni cristian@balist.es wrote:

...

Hi Leila,

On 11/07/2017 03:20, Leila Zia wrote:

...
Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]:

Flow_cytometry Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list

I wanted to point out that to me the main difference between the first two categories and the last three is that the former are automatically added by templates. In fact, if you look at the page source you will only find the first two.

This makes sense. Here is why we ended up in this place: * If we would use XML dumps (which we initially did) for category extraction (based on link extraction), we would consider a category such as Guided_missiles_of_Norway a root category (which is wrong). The issue with this category is that its parents' categories are generated by templates and we could not (at least relatively easily) pick this information up from XML dumps. As a result, we decided to go with SQL dumps. * The nice thing about using SQL dumps is that we can save the parents of a category such as Guided_missiles_of_Norway, the downside is that we lose information about which category is generated via template and which one the usual way.

Two more things to add: * Focusing on categories that belong to Main_topic_articles seems to address the issue we ran into. * We discussed whether a category such as "Wikipedia_articles_published_in_PLOS_Computational_Biology" is a good one or not, and given that its path is reasonable (by eye-balling), we now consider it a category that should stay as a good category in the category graph. Check the path for it:

Wikipedia_articles_published_in_PLOS_Computational_Biology Public_Library_of_Science Open_access_publishers Academic_publishing_companies Academic_publishing Academia Education Euthenics Social_sciences ... and up to the root

so now we know that it's good that our approach for building the graph of categories doesn't exclude this category immediately.

Best, Leila

...

Cristian

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2553

Age (days ago)

2567

Last active (days ago)

wiki-research-l@lists.wikimedia.org

7 comments

4 participants

tags (0)

participants (4)

Bowen Yu
Cristian Consonni
Leila Zia
Stuart A. Yeates