are good places to
start on the issues with cats on en.wiki.
cheers
stuart
--
...let us be heard from red core to black sky
On 12 July 2017 at 02:53, Leila Zia <leila(a)wikimedia.org> wrote:
Hi Stuart,
On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates <syeates(a)gmail.com>
wrote:
The category system on en.wiki is not an IS-A
system and there have been
several discussions about making it it based on mathematical principals
which have come to nothing because the consensus of editors is against
it.
The best way to think about categories is as a
locally-faceted related
links system.
It would be great if you can share a link to one or more of those
conversations, if it's not too hard to find them. This is a
conversation that comes up often and I'd like to educate myself with
this background. (and to confirm: on our end the goal is not to change
the category system on enwiki, but to make it machine understandable
for specific applications.)
Having said that, Category:Wikipedia maintenance
is an important root
probably useful for separating the wheat from the chaff. Most of these
are
also hidden categories. I'm not sure whether
this flag appears in the
SQL,
Looking into these. thanks!
Best,
Leila
cheers
stuart
--
...let us be heard from red core to black sky
On 11 July 2017 at 13:20, Leila Zia <leila(a)wikimedia.org> wrote:
> Hi all,
>
> [If you are not interested in discussions related to the category system
> (on English Wikipedia)
> , you can stop here. :)]
>
> We have run into a problem that some of you may have thought about or
> addressed before. We are trying to clean up the category system on
English
> Wikipedia by turning the category structure
to an IS-A hierarchy. (The
> output of this work can be useful for the research on template
> recommendation [1], for example, but the use-cases won't stop there).
One
> issue that we are facing is the following:
>
> We are currently
> using
> SQL dumps to extract categories associated with every article on
English
> Wikipedia (main namespace). [2]
> Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article [3]:
>
> Flow_cytometry
> Bioinformatics
>
> Wikipedia_articles_published_in_peer-reviewed_literature
> Wikipedia_articles_published_in_PLOS_Computational_Biology
> CS1_maint:_Multiple_names:_authors_list
>
> The problem is that only the first two categories are the ones we are
> interested in. We have one cleaning step through which we only keep
> categories that belong to category Article and that step removes the
last
> category above, but the other two
Wikipedia_... remain there. We need to
> somehow prune the data and clean it from those two categories.
>
> One way we could do the above would be to parse wikitext instead of the
SQL
> dumps and focus on extracting categories
marked by pattern
[[Category:XX]],
> but in that case, we would lose a good
category such as
> Guided_missiles_of_Norway
> because that's generated by a template.
>
> Any ideas on how we can start with a "cleaner" dataset of categories
> related to the topic of the articles as opposed to maintenance related
or
other
types of categories?
Thanks,
Leila
[1]
https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
_stubs_across_languages
[2] The exact code we use is
SELECT p.page_id id, p.page_title title, cl.cl_to category
FROM categorylinks cl
JOIN page p
on cl.cl_from = p.page_id
where cl_type = 'page'
and page_namespace = 0
and page_is_redirect = 0
and the edges of the category graph are extracted with
*SELECT p.page_title category, cl.cl_to parent *
*FROM categorylinks cl *
*JOIN page p *
*ON p.page_id = cl.cl_from *
*where p.page_namespace = 14*
[3]
https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l