Re: [Wiki-research-l] category extraction question

11 Jul 2017

Hi Leila,

I've been working on taxonomy learning from Wikipedia categories in my 
past research.
Here's a recap of the approach I proposed to address the pruning problem 
you faced. It's a pipeline with a bottom-up direction, i.e., from the 
leaves up to the root.

Stage 1: leaf nodes
INPUT = category + category links SQL dumps, like you do
1.1. extract the full set of article pages;
1.2. extract categories that are linked to article pages only, by 
looking at the outgoing links for each article;
1.3. identify the set of categories with no sub-categories.

Stage 2: prominent nodes
INPUT = stage 1 output
2.1. traverse the leaf graph, see the algorithm [1];
2.2. NLP to identify categories that hold is-a relations, i.e., *noun 
phrases* with *plural head*, inspired by the YAGO approach [2, 3];
2.3. (optional) set a usage weight based on the number of category 
interlanguage links (more links = more usage across language chapters).

These 2 stages should output the clean dataset you're looking for.
Based on that, you can then build the taxonomy.

Feel free to ping me if you need more information.
Best,

Marco

[1] Input: L (leaf nodes set) Output: PN (prominent nodes set)
for all l in L do
	isProminent = true;
	P = getTransitiveParents(l);
	for all p in P do
		C = getChildren(p);
		areAllLeaves = true;
  		for all c in C do
			if c not in L then
				areAllLeaves = false;
				break;
		end for
		if areAllLeaves then
			PN.add(p);
			isProminent = false;
	end for
	if isProminent then
		PN.add(l);
end for
return PN
[2] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a
core of semantic knowledge. In Proceedings of the 16th
International Conference on World Wide Web, pages
697–706. ACM, 2007.
[3] J. Hoffart, F. M. Suchanek, K. Berberich, and
G. Weikum. Yago2: a spatially and temporally
enhanced knowledge base from wikipedia. AI,
194:28–61, 2013.

On 7/11/17 03:21, wiki-research-l-request(a)lists.wikimedia.org wrote:
...
  Date: Mon, 10 Jul 2017 18:20:47 -0700
 From: Leila Zia&lt;leila(a)wikimedia.org&gt;
 To: Research into Wikimedia content and communities
 	&lt;wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: [Wiki-research-l] category extraction question
 Message-ID:
 	&lt;CAK0Oe2s_VDPS3JNLY8_0V+CFeXHMT+0p-VNbsv+0mtD2NMT7dA(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset="UTF-8"

 Hi all,

 [If you are not interested in discussions related to the category system
  (on English Wikipedia)
 , you can stop here. :)]

 We have run into a problem that some of you may have thought about or
 addressed before. We are trying to clean up the category system on English
 Wikipedia by turning the category structure to an IS-A hierarchy. (The
 output of this work can be useful for the research on template
 recommendation [1], for example, but the use-cases won't stop there). One
 issue that we are facing is the following:

 We are currently
 using
   SQL dumps to extract categories associated with every article on English
 Wikipedia (main namespace). [2]
  Using this approach, we get 5 categories associated with Flow cytometry
 bioinformatics article [3]:

 Flow_cytometry
 Bioinformatics

 Wikipedia_articles_published_in_peer-reviewed_literature
 Wikipedia_articles_published_in_PLOS_Computational_Biology
 CS1_maint:_Multiple_names:_authors_list

 The problem is that only the first two categories are the ones we are
 interested in. We have one cleaning step through which we only keep
 categories that belong to category Article and that step removes the last
 category above, but the other two Wikipedia_... remain there. We need to
 somehow prune the data and clean it from those two categories.

 One way we could do the above would be to parse wikitext instead of the SQL
 dumps and focus on extracting categories marked by pattern [[Category:XX]],
 but in that case, we would lose a good category such as
 Guided_missiles_of_Norway
  because that's generated by a template.

 Any ideas on how we can start with a "cleaner" dataset of categories
 related to the topic of the articles as opposed to maintenance related or
 other types of categories?

 Thanks,
 Leila

 [1]https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
 _stubs_across_languages

 [2] The exact code we use is

 SELECT p.page_id id, p.page_title title, cl.cl_to category
 FROM categorylinks cl
 JOIN page p
 on cl.cl_from = p.page_id
 where cl_type = 'page'
 and page_namespace = 0
 and page_is_redirect = 0

 and the edges of the category graph are extracted with

 *SELECT p.page_title category, cl.cl_to parent *
 *FROM categorylinks cl *
 *JOIN page p *
 *ON p.page_id = cl.cl_from *
 *where p.page_namespace = 14*

 [3]https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] category extraction question