Category pruning

List overview All Threads
Download

newer

older

[Solution ?] [ 983680 ] username...

Plese minnan WP move to...

Stevertigo

13 Jul 2004 13 Jul '04

10:07 a.m.

Is there someone operating a CatBot for deleting things like [[Category:FHM 100 Sexiest Women List]] from all the entries (more than 100 ironically, because its "retroactive". I created [[Category:Hustler Asshole of the Month]], to help make my point, but this is being ignored in favor of a typically tittilating TNA trajectory, so... before I go ahead and populate it... some ideas are welcome. See http://en.wikipedia.org/wiki/Wikipedia:Categories_for_deletion

Geeks being largely guys, there seems to be a pass given to this one policy-wise, but that's unrelated to the need for pruning categories, IMHO.

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Show replies by date

Rowan Collins

14 Jul 14 Jul

1:24 p.m.

...

Is there someone operating a CatBot for deleting things like [[Category:FHM 100 Sexiest Women List]] from all the entries

It occurs to me (in a fairly inconsequential and rambling way) that this need is due to the software's use of inline markup for what should be out-of-line metadata - i.e. we can't have an easy database removal of a category, because the article end of the equation is stored in the text of the article, rather than (as well as?) separately in the DB.

It's kind of obvious, but it's a disadvantage that more and more features are encountering (interlanguage links and some template maintenance tasks are similarly hard to automate, for the same reason).

I know there's been an absolute ton of discussion about metadata, key-value pairs, etc, but I wonder if it's not worth going for a fairly simple system of having a "metadata" text-box which is actually a magic interface to special fields in the DB - the user edits special markup like they do now (for translations, categories, etc), but it's parsed at save and not actually stored as text (except perhaps in some kind of cache if necessary). That way, a lot of 'bots' could be integrated and operate at the DB level, rather than sitting outside the software and pretending to be users.

But I guess that's all just a pipe-dream, and there are plenty of things it doesn't solve. Maybe it would be better just to produce a special bot-friendly interface, instead (limitable to registered bot accounts); one that doesn't have to wait for and parse HTML responses, thus slowing down both it and the website...

I'll shut up now.

-- Rowan Collins BSc [IMSoP]

Bill Clark

1:34 p.m.

On Wed, 14 Jul 2004 18:24:06 +0100, Rowan Collins rowan.collins@gmail.com wrote:

...

Maybe it would be better just to produce a special bot-friendly interface, instead (limitable to registered bot accounts); one that doesn't have to wait for and parse HTML responses, thus slowing down both it and the website...

That sounds like a really good idea.

I was talking with Tim Starling on IRC briefly last night about maybe moving some of the parsing code out into Javascript so that the client could do some of the work.

An interface for viewing articles as unparsed wikicode (and minus the surrounding page framework, so it could all go in an IFRAME, like how Epoz and FCKedit work) would be useful for both a Javascript parser and a bot interface.

I assume there are routines for outputting articles this way anyway, since something needs to populate the <textarea> fields when editing.

/off to go pick through code

-Bill Clark

Magnus Manske

1:45 p.m.

Rowan Collins wrote:

...

I know there's been an absolute ton of discussion about metadata, key-value pairs, etc, but I wonder if it's not worth going for a fairly simple system of having a "metadata" text-box which is actually a magic interface to special fields in the DB - the user edits special markup like they do now (for translations, categories, etc), but it's parsed at save and not actually stored as text (except perhaps in some kind of cache if necessary). That way, a lot of 'bots' could be integrated and operate at the DB level, rather than sitting outside the software and pretending to be users.

Someone (was it me?) once proposed this two-step plan: 1. Take meta info out of the text on save and store it in special db fields, then paste it back at the end of the page on the next edit. 2. Once this works, split the edit box in two - one for article, one for meta data. Acceptance is everything here :-)

...

But I guess that's all just a pipe-dream, and there are plenty of things it doesn't solve. Maybe it would be better just to produce a special bot-friendly interface, instead (limitable to registered bot accounts); one that doesn't have to wait for and parse HTML responses, thus slowing down both it and the website...

How about editing like "...?title=xyz&action=edit&mode=bot"? That could remove the framework, and return a blank page with either "OK" or "EDIT_CONFLICT" after saving, thus letting the bot know what happened.

...

I'll shut up now.

No, please, by all means, stay with it :-)

Magnus

Rowan Collins

3:11 p.m.

Magnus Manske magnus.manske@web.de wrote:

...

Someone (was it me?) once proposed this two-step plan:

Take meta info out of the text on save and store it in special db

fields, then paste it back at the end of the page on the next edit. 2. Once this works, split the edit box in two - one for article, one for meta data. Acceptance is everything here :-)

That actually sounds very sensible - it allows for a fair bit of robustness testing on the conversion process. It might even be possible to leave the inline-syntax-extractor enabled to give users a warning if they enter metadata in the wrong box, but act on it anyway (as though it had been entered in the other box in the first place).

On a very minor note, if it did get implemented, it might be best to label it something less intimidating than "metadata", especially given the existing uses of "meta" within Wikimedia projects.

...

How about editing like "...?title=xyz&action=edit&mode=bot"? That could remove the framework, and return a blank page with either "OK" or "EDIT_CONFLICT" after saving, thus letting the bot know what happened.

That sounds good (plus "NO_PERMISSION" for 'protected' pages, etc). I guess the main thing that would need thinking about for such an interface is the *submission* part of the process - it could work essentially as now, but I wander if it couldn't be simplified. The actual HTTP POST is probably simple enough, but then you've got to handle cookies for authentication, and there are 'hidden' fields to do with edit conflicts that need retaining and sending back. From a bot's point of view, it would be best if these were all rolled into one framework (some kind of XML?) so that the whole thing could be dumped back as one big POSTDATA field. I think cookies in particular seem an unnecessary overhead for a single-purpose bot, although I don't know how easily general authentication can be seperated from cookie-specific code as the software stands.

...

...
I'll shut up now.

No, please, by all means, stay with it :-)

Well, one of these days, I'll stop dreaming and submit a properly-written patch. Mañana, mañana...

-- Rowan Collins BSc [IMSoP]

Lars Aronsson

2:56 p.m.

Rowan Collins wrote:

...

removal of a category, because the article end of the equation is stored in the text of the article, rather than (as well as?) separately in the DB.

The reason that they are stored in the article text is that article texts are easy to edit. If you want to avoid this, you have to provide another easy way to input and edit the same information. Perhaps the edit page should have a list of one-line input fields for metadata under the big textarea, like this:

[ Germany is a country in Europe... ] [ bla bla ] [ bla bla ]

metadata: [ de:Deutschland ] (leave blank to remove) metadata: [ category: Country ] metadata: [ fr:Allemagne ] metadata: [ ] (add new metadata here)

[save] [preview]

However, this is the road to compartmentalization, and wiki has so far been very successful in doing without that.

I less intrusive approach would be to separate out all metadata when the page is saved, and never actually store metadata with the article text. In this case, it has to be collected from where it is stored before the textarea is presented to the next editor.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

7465

Age (days ago)

7466

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

5 participants

tags (0)

participants (5)

Bill Clark
Lars Aronsson
Magnus Manske
Rowan Collins
Stevertigo