Categories problem

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [MediaWiki-CVS]...

Cite.php footnote problems on...

Julien Lemoine

27 Aug 2006 27 Aug '06

11:23 a.m.

Hello,

I read the thread "how bad is a category with ....", and I was wondering how categories were filled. If I understand well, categories were filled by editors of the article. This assume that these editors know the whole set of categories and that these categories will not change with time ? I was wondering if there is projects to help *detect* categories and then to help editors by *suggesting* categories ?

I am thinking about two different technologies to help dealing with these two problems : 1) Text clustering to help finding categories but probably not using classical approaches where words space is used to describe a document (applying a part of speech tagging http://en.wikipedia.org/wiki/Part-of-speech_tagging, stemming http://en.wikipedia.org/wiki/Stemmer, ...). I am thinking about clustering links graph (seems similar to the clique problem http://en.wikipedia.org/wiki/Clique_problem but with different constraints), i.e. each document will not be described by his words (or lemmas, LSA vector...) but by his links to other articles using an algorithm that do not needs the number of cluster before processing but needs a distance or a similarity threshold. With this kind of processing, you will have a set of clusters that are linked together, but a cluster will probably not be a complete graph (this is the difference with the clique problem). Once you have the clusters, you need to try labeling them with a category : - give to the user the role of identifying the category name - use the words space to find the better words that describe this set of articles - ... Then you can run this algorithm on a category to try to split it in sub categories.

2) Machine learning or links graph exploration to suggest categories during edition of an article. This first idea is to try to learn existing categories with a machine learning algorithm (using words space) to guess categories of a new article (but this algorithm will have to deal with the new categories and the fact that the number of document not having a category is grater than number of document having a category). The second idea is really more simple and easier to implement : When you edit an article, you can suggest categories of linked articles (can be replaced by an other graph-exploration algorithm).

Is there some functions like these in Wikimedia ? and to you think that this kind of algorithms could help ? Finally, do you know people working on this functionalities (maybe people working on semantic web ?)

Best Regards. Julien Lemoine

Show replies by date

Platonides

27 Aug 27 Aug

9:13 p.m.

Add to it category redirections. If you tried to save with a category which is a redirect, you shouldn't be able to, or automagically change to the redirected one though it may be confusing seeing the wiki don't obeying you ;-) As a plus, the job queue could handle when redirects are *being made*. However not all kinds of category addings may be detected by an automated process.

Steve Bennett

10 p.m.

On 8/27/06, Platonides Platonides@gmail.com wrote:

...

Add to it category redirections. If you tried to save with a category which is a redirect, you shouldn't be able to, or automagically change to the redirected one though it may be confusing seeing the wiki don't obeying you ;-)

I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

Steve

Jay R. Ashworth

10:30 p.m.

On Mon, Aug 28, 2006 at 12:00:26AM +0200, Steve Bennett wrote:

...

On 8/27/06, Platonides Platonides@gmail.com wrote:

...
Add to it category redirections. If you tried to save with a category which is a redirect, you shouldn't be able to, or automagically change to the redirected one though it may be confusing seeing the wiki don't obeying you ;-)

I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

I concur, but both examples are miswarts: on inspection, it's pretty clear why those choices were made...

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 The Internet: We paved paradise, and put up a snarking lot.

Rob Church

10:42 p.m.

On 27/08/06, Jay R. Ashworth jra@baylink.com wrote:

...

On Mon, Aug 28, 2006 at 12:00:26AM +0200, Steve Bennett wrote:

...
On 8/27/06, Platonides Platonides@gmail.com wrote:

...
Add to it category redirections. If you tried to save with a category which is a redirect, you shouldn't be able to, or automagically change to the redirected one though it may be confusing seeing the wiki don't obeying you ;-)

I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

I concur, but both examples are miswarts: on inspection, it's pretty clear why those choices were made...

You wouldn't need to change the saved text; the magical redirection could occur during the links update process.

Rob Church

Andre Engels

29 Aug 29 Aug

9:49 a.m.

2006/8/28, Jay R. Ashworth jra@baylink.com:

...

...
I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

I concur, but both examples are miswarts: on inspection, it's pretty clear why those choices were made...

I agree where the ~~~~ is concerned, but how about [[Foo (blah)|]]? I have always considered the way that has been programmed to be some kind of error. I think it would have been better to keep [[Foo (blah)|]] in the text, and change it at parsing. Apart from the 'surprise' factor, I see two more advantages: * It diminishes the human readability of the wiki text less (just disregard the special symbols and anything between brackets and you get the 'flat text') * It makes it much easier for newbies to learn this 'trick'. Now you have to hear or read it somewhere, but if it were not changed on save, one could find it in existing wiki text and learn that way.

-- Andre Engels, andreengels@gmail.com ICQ: 6260644 -- Skype: a_engels

Steve Bennett

11:44 a.m.

On 8/29/06, Andre Engels andreengels@gmail.com wrote:

...

I agree where the ~~~~ is concerned, but how about [[Foo (blah)|]]? I have always considered the way that has been programmed to be some kind of error. I think it would have been better to keep [[Foo (blah)|]] in the text, and change it at parsing. Apart from the 'surprise' factor, I see two more advantages:

It diminishes the human readability of the wiki text less (just

disregard the special symbols and anything between brackets and you get the 'flat text')

It makes it much easier for newbies to learn this 'trick'. Now you

have to hear or read it somewhere, but if it were not changed on save, one could find it in existing wiki text and learn that way.

Agree with all that - I would be curious to hear what the rationale was. Most likely it simplifies parsing to do it this way, as the parser doesn't need to be touched.

On the other hand, it raises questions about the true syntax of MediaWiki: is [[foo (blah)|]] well-formed Wikitext or not? Technically, it's not. Practically, it is.

Steve

Jay R. Ashworth

3:56 p.m.

On Tue, Aug 29, 2006 at 01:44:54PM +0200, Steve Bennett wrote:

...

On the other hand, it raises questions about the true syntax of MediaWiki: is [[foo (blah)|]] well-formed Wikitext or not? Technically, it's not. Practically, it is.

I don't see why it's not well-formed. They've just made it magic, because it looks prettier on output to hide the parenthetical.

Again: miswart -- looks broken, but not, because it was done for a good reason.

Cheers, -- jra

Steve Bennett

4:46 p.m.

On 8/29/06, Jay R. Ashworth jra@baylink.com wrote:

...

I don't see why it's not well-formed. They've just made it magic, because it looks prettier on output to hide the parenthetical.

How would you know? How can you test if the parser can parse something if every time you try and save it, mediawiki converts it into something else?

Maybe something really tricky like using subst would work: [[foo (blah){{a}} where {{a}} is the string "|]]".

Steve

Rob Church

4:49 p.m.

On 29/08/06, Steve Bennett stevage@gmail.com wrote:

...

How would you know? How can you test if the parser can parse something if every time you try and save it, mediawiki converts it into something else?

The pedantic response: Special page extension or otherwise; pass it direct into the parser without going via a pre-save transform.

Rob Church

Jay R. Ashworth

5:53 p.m.

On Tue, Aug 29, 2006 at 06:46:24PM +0200, Steve Bennett wrote:

...

On 8/29/06, Jay R. Ashworth jra@baylink.com wrote:

...
I don't see why it's not well-formed. They've just made it magic, because it looks prettier on output to hide the parenthetical.

How would you know? How can you test if the parser can parse something if every time you try and save it, mediawiki converts it into something else?

Maybe something really tricky like using subst would work: [[foo (blah){{a}} where {{a}} is the string "|]]".

Your question is, I think, really "does the parser treat it as a first class object, handling all legal arguments in that situation?"

And yeah, that's a valid question. :-)

Cheers, -- jra

Brion Vibber

7:48 p.m.

Steve Bennett wrote:

...

On 8/29/06, Andre Engels andreengels@gmail.com wrote:

...
I agree where the ~~~~ is concerned, but how about [[Foo (blah)|]]? I have always considered the way that has been programmed to be some kind of error. I think it would have been better to keep [[Foo (blah)|]] in the text, and change it at parsing. Apart from the 'surprise' factor, I see two more advantages:

It diminishes the human readability of the wiki text less (just

disregard the special symbols and anything between brackets and you get the 'flat text')

It makes it much easier for newbies to learn this 'trick'. Now you

have to hear or read it somewhere, but if it were not changed on save, one could find it in existing wiki text and learn that way.

Agree with all that - I would be curious to hear what the rationale was.

So the feature could be removed if unpopular without breaking text.

-- brion vibber (brion @ pobob.com)

Jay R. Ashworth

3:54 p.m.

On Tue, Aug 29, 2006 at 11:49:44AM +0200, Andre Engels wrote:

...

2006/8/28, Jay R. Ashworth jra@baylink.com:

...
...
I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

I concur, but both examples are miswarts: on inspection, it's pretty clear why those choices were made...

I agree where the ~~~~ is concerned, but how about [[Foo (blah)|]]? I have always considered the way that has been programmed to be some kind of error. I think it would have been better to keep [[Foo (blah)|]] in the text, and change it at parsing.

I'm of two minds on that. I think I might like invisible magic in the parser even less than I like visible magic in the subst:

...

                                             Apart from the
'surprise' factor, I see two more advantages:

It diminishes the human readability of the wiki text less (just

disregard the special symbols and anything between brackets and you get the 'flat text')

It makes it much easier for newbies to learn this 'trick'. Now you

have to hear or read it somewhere, but if it were not changed on save, one could find it in existing wiki text and learn that way.

Well, if you learned it.

I really *would* like to know what percentage of our editors are the sort who would learn something like that; I'm *certain* it's declining... and we'd do well to remember that.

Cheers, -- jra

Platonides

28 Aug 28 Aug

9:45 p.m.

"Steve Bennett"

...

I think the behaviour where the saved text is not what you actually typed ought to be kept to a minimum. At current I know of that happening in two instances: ~~~~ (and variants) and [[Foo (blah)|]], which is replaced by [[Foo (blah)|Foo]]. It's kind of surprising to the user...

Steve

I agree it's annoying, tried to improve a mesage saying: You can't add this category: it's a redirect. uh..

Steve Bennett

10:45 p.m.

On 8/28/06, Platonides Platonides@gmail.com wrote:

...

I agree it's annoying, tried to improve a mesage saying: You can't add this category: it's a redirect. uh..

Why shouldn't category redirects work exactly as you'd expect: exactly the same as if you'd put the target of the redirect?

Steve

Nick Jenkins

4:47 a.m.

Hi Julien,

...

I was wondering if there is projects to help *detect* categories and then to help editors by *suggesting* categories ?

I think it's a good idea.

The closest thing I can see is at: http://en.wikipedia.org/wiki/Wikipedia:Auto-categorization (although I'm not 100% clear on what that project was doing with categories, so maybe it's not as related as it sounds, but I thought it best to mention it).

...

Once you have the clusters, you need to try labelling them with a category :

give to the user the role of identifying the category name

use the words space to find the better words that describe this set

of articles

...

Then you can run this algorithm on a category to try to split it in sub categories.

How are you going to apply the categories? E.g. leave a message on the talk page / automatically apply them with a bot / an external web page where people can see possible categories / or something integrated into MediaWiki?

If it's something interactive, you can maybe produce a checkbox list of the top 10 possible categories, have the user tick all the categories that apply. Then for the categories that apply, you could maybe expand those and show the subcategories, and have the user tick the ones that apply, and keep on expanding the subcategories that apply until the user has gotten as specific as they can.

...

The second idea is really more simple and easier to implement : When you edit an article, you can suggest categories of linked articles (can be replaced by an other graph-exploration algorithm).

Can it maybe also suggest which stubs to use? I can never remember what the right stub to use is (so I just use the standard "{{stub}}"), but there has got to be a better way than having to remember the full list of stub types.

All the best, Nick.

Julien Lemoine

6:02 a.m.

Hi Nick,

Nick Jenkins wrote:

...

Hi Julien,

...
I was wondering if there is projects to help *detect* categories and then to help editors by *suggesting* categories ?

I think it's a good idea.

The closest thing I can see is at: http://en.wikipedia.org/wiki/Wikipedia:Auto-categorization (although I'm not 100% clear on what that project was doing with categories, so maybe it's not as related as it sounds, but I thought it best to mention it).

Thank you for the link, it is very interesting. I will try to contact the author of this page.

...

...
Once you have the clusters, you need to try labelling them with a category :

give to the user the role of identifying the category name

use the words space to find the better words that describe this set

of articles

...

Then you can run this algorithm on a category to try to split it in sub categories.

How are you going to apply the categories? E.g. leave a message on the talk page / automatically apply them with a bot / an external web page where people can see possible categories / or something integrated into MediaWiki?

I think a proof of concept on a external page is better to see if results are good enough to be integrated in MediaWiki.

...

If it's something interactive, you can maybe produce a checkbox list of the top 10 possible categories, have the user tick all the categories that apply. Then for the categories that apply, you could maybe expand those and show the subcategories, and have the user tick the ones that apply, and keep on expanding the subcategories that apply until the user has gotten as specific as they can.

Clustering is really an interactive process, but needs quite a lot of cpu time. I don't think I will produce a interactive result, a least at the beginning.

...

...
The second idea is really more simple and easier to implement : When you edit an article, you can suggest categories of linked articles (can be replaced by an other graph-exploration algorithm).

Can it maybe also suggest which stubs to use? I can never remember what the right stub to use is (so I just use the standard "{{stub}}"), but there has got to be a better way than having to remember the full list of stub types.

Yes, you can extend it to stubs with the same kind of processing.

Best Regards. Julien Lemoine

6540

Age (days ago)

6542

Last active (days ago)

wikitech-l@lists.wikimedia.org

16 comments

8 participants

tags (0)

participants (8)

Andre Engels
Brion Vibber
Jay R. Ashworth
Julien Lemoine
Nick Jenkins
Platonides
Rob Church
Steve Bennett