Re: [Wikitech-l] GSoC Project: Category Moving

List overview All Threads
Download

newer

older

Random "how the world feels" from...

#4150 - Show New pages in...

Samuel Wantman

2 Apr 2008 2 Apr '08

2:03 a.m.

I don't know if this has been discussed, but I'm hoping some serious consideration could be put into creating a category history that can be viewed and used for reverting. Every addition and removal of an article should be kept in the history. It should be possible to revert every change. Categories should be able to be put on watchlists. Without the ability to watch a category, see its history and revert changes, it is really not possible to get the categorization of articles to improve much. Considering the lack of these "wiki" features it is quite remarkable that categories have gotten as good as they are in several projects. It is extremely frustrating to create and populate a category with hundreds of members, just to have someone undo all or most of the effort. There is no easy way to monitor a category or undo the damage. This technical limitation has the effect of strengthening the status quo and quashing innovation.

-- Samuel Wantman [[en:User:Sam]]

Show replies by date

Bryan Tong Minh

2 Apr 2 Apr

5 a.m.

New subject: GSoC Project: Category Moving

On Wed, Apr 2, 2008 at 8:03 AM, Samuel Wantman wantman@earthlink.net wrote:

...

I don't know if this has been discussed, but I'm hoping some serious consideration could be put into creating a category history that can be viewed and used for reverting. Every addition and removal of an article should be kept in the history. It should be possible to revert every change. Categories should be able to be put on watchlists. Without the ability to watch a category, see its history and revert changes, it is really not possible to get the categorization of articles to improve much. Considering the lack of these "wiki" features it is quite remarkable that categories have gotten as good as they are in several projects. It is extremely frustrating to create and populate a category with hundreds of members, just to have someone undo all or most of the effort. There is no easy way to monitor a category or undo the damage. This technical limitation has the effect of strengthening the status quo and quashing innovation.

-- Samuel Wantman [[en:User:Sam]]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

In fact it would be useful to have a general "links recentchanges" which tracks changes in pagelinks, imagelinks, categorylinks and interwikilinks.

Bryan

Simetrical

3 Apr 3 Apr

10:36 a.m.

New subject: GSoC Project: Category Moving

On Wed, Apr 2, 2008 at 2:03 AM, Samuel Wantman wantman@earthlink.net wrote:

...

I don't know if this has been discussed, but I'm hoping some serious consideration could be put into creating a category history that can be viewed and used for reverting.

That would be a very good feature, yes. It's also worth considering at some point.

On Wed, Apr 2, 2008 at 4:58 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

Wouldn't it be easier for upgrading and backwards compatibility to keep the current cl_to field which should indicate the category that is indicated in wikitext, and add a cl_id field, which indicates the real category that is being pointed to.

cl_to is a VARCHAR(255) times 200 million rows. Being able to get rid of it would significantly reduce the size (therefore also, to some extent, improve the speed) of the categorylinks table. Furthermore, having both the name and ID stored will unnecessarily allow inconsistency, i.e., it's gratuitously denormalized.

There will probably have to be a transitional period where both fields are present, just for the sake of updating. However, I'm viewing this as best made an intra-version period, so it changes totally from one release to the next. This is a breaking schema change, but we can't *always* avoid those. We don't have major versions that we can pack them all into; instead we sprinkle them in minor versions.

On Wed, Apr 2, 2008 at 8:02 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

Simetrical schreef:

...
Well, the simple SQL query could turn out to be a problem for very large categories. I might be wrong; a single update may well run faster than the insert/delete we have right now for large page deletions.

That's why I suggested using the category table rather than changing lots of rows in categorylinks.

Using the category table how? Just changing the id's? It doesn't work if you want to then change them back, or alter redirects. You could do a join, but that seems like it would break sorted retrieval.

...

There is one thing nobody mentioned yet: nonexistent categories can have members, so it's possible to move one category on top of another one. For example, let [[Category:A]] be an existent category and [[Category:B]] a nonexistent one that does have members. If [[Category:A]] is then moved to [[Category:B]] (which is allowed, since the target doesn't exist), the categories would have to be merged. The thing is that A and B had different category IDs before the move, but the merged category will only have one ID after the move. This again means updating category IDs in the categorylinks table. We could probably use row count estimates here to decide which ID the unified category gets (A's or B's, depending on which one would result in more rows being changed) and stuff the UPDATEs in the job queue if both estimates are unacceptably large.

Why would we want to allow moving one category on top of another? Why not ban it, and allow people to create a redirect if they want to "merge" them?

Roan Kattouw

10:41 a.m.

New subject: GSoC Project: Category Moving

Simetrical schreef:

...

Using the category table how? Just changing the id's? It doesn't work if you want to then change them back, or alter redirects. You could do a join, but that seems like it would break sorted retrieval.

By keeping the category ID the same over a move (like we do with page IDs now), so you only have to change the category's name in the category table rather than in potentially millions of categorylinks rows.

...

Why would we want to allow moving one category on top of another? Why not ban it, and allow people to create a redirect if they want to "merge" them?

That's also an option, hadn't thought of that.

Roan Kattouw (Catrope)

Platonides

5 Apr 5 Apr

5:06 p.m.

New subject: GSoC Project: Category Moving

Roan Kattouw wrote:

...

...
Why would we want to allow moving one category on top of another? Why not ban it, and allow people to create a redirect if they want to "merge" them?

That's also an option, hadn't thought of that.

Roan Kattouw (Catrope)

Merge it with other requested feature: blue category links (used categories should be trated as existing). So when you add a inexistent category, the category page is automatically created. A warning step could be placed before auto-creation, which is good, since you probably don't want to use a non-existing category. Or you could use a switch to force added categories to exist (if you want a new one, separately create the page).

Marco Schuster

6 Apr 6 Apr

6:46 a.m.

New subject: GSoC Project: Category Moving

Samuel Wantman schrieb:

...

I don't know if this has been discussed, but I'm hoping some serious consideration could be put into creating a category history that can be viewed and used for reverting. Every addition and removal of an article should be kept in the history. It should be possible to revert every change. Categories should be able to be put on watchlists. Without the ability to watch a category, see its history and revert changes, it is really not possible to get the categorization of articles to improve much. Considering the lack of these "wiki" features it is quite remarkable that categories have gotten as good as they are in several projects. It is extremely frustrating to create and populate a category with hundreds of members, just to have someone undo all or most of the effort. There is no easy way to monitor a category or undo the damage. This technical limitation has the effect of strengthening the status quo and quashing innovation.

You could use a combination of toolserver and some hook in MediaWiki: 1) When an user adds or removes a category, there is an SQL query called to update the category table. Maybe there is also a hook herein, which is exactly what we need. 2) When the hook is run, in some way the toolserver is contacted (maybe via UDP and an UDP server listening on TS), and the TS then knows that in article A the category B was added/removed by user C 3) The TS can then make the gathered data available to the public (human-readable, bot-readable, whatever).

Marco

Bryan Tong Minh

6:56 a.m.

New subject: GSoC Project: Category Moving

On Sun, Apr 6, 2008 at 12:46 PM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...

Samuel Wantman schrieb:

...
I don't know if this has been discussed, but I'm hoping some serious consideration could be put into creating a category history that can be viewed and used for reverting. Every addition and removal of an article should be kept in the history. It should be possible to revert every change. Categories should be able to be put on watchlists. Without the ability to watch a category, see its history and revert changes, it is really not possible to get the categorization of articles to improve much. Considering the lack of these "wiki" features it is quite remarkable that categories have gotten as good as they are in several projects. It is extremely frustrating to create and populate a category with hundreds of members, just to have someone undo all or most of the effort. There is no easy way to monitor a category or undo the damage. This technical limitation has the effect of strengthening the status quo and quashing innovation.

You could use a combination of toolserver and some hook in MediaWiki:

When an user adds or removes a category, there is an SQL query called

to update the category table. Maybe there is also a hook herein, which is exactly what we need. 2) When the hook is run, in some way the toolserver is contacted (maybe via UDP and an UDP server listening on TS), and the TS then knows that in article A the category B was added/removed by user C 3) The TS can then make the gathered data available to the public (human-readable, bot-readable, whatever).

Marco

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

We don't want features that should be in MediaWiki to be put on the toolserver. I openened a bug someday ago for tracking of linkchanges, see https://bugzilla.wikimedia.org/show_bug.cgi?id=13588. Please add implementation details there.

Bryan

Platonides

11:21 a.m.

New subject: GSoC Project: Category Moving

Bryan Tong Minh wrote:

...

We don't want features that should be in MediaWiki to be put on the toolserver. I openened a bug someday ago for tracking of linkchanges, see https://bugzilla.wikimedia.org/show_bug.cgi?id=13588. Please add implementation details there.

Bryan

Couldn't you track the link changes based on the binlogs content? It'd be quite low level, but already feasible.

Simetrical

11:54 a.m.

New subject: GSoC Project: Category Moving

On Sun, Apr 6, 2008 at 11:21 AM, Platonides Platonides@gmail.com wrote:

...

Couldn't you track the link changes based on the binlogs content? It'd be quite low level, but already feasible.

Er, assuming that the user is keeping binlogs, and using MySQL for that matter. And assuming your application has filesystem read access to the binlogs, which would be approximately equal to having read access to all databases for security purposes. And assuming that they're kept forever, or that you're okay with having your history truncated every time the binlogs are. Which is a lot of assumptions.

The obvious way to implement link versioning is just to have fields like cl_from be foreign keys into revision, not page. The slight issue with this is increasing the size of all the links tables by a factor of, say, a hundred, when they're already among the largest tables. It should be feasible somehow; it's no more data to version than the article data, and in fact it's redundant to the article data. But no obvious plan strikes me at first thought.

Platonides

1:03 p.m.

New subject: GSoC Project: Category Moving

Simetrical wrote:

...

On Sun, Apr 6, 2008 at 11:21 AM, Platonides wrote:

...
Couldn't you track the link changes based on the binlogs content? It'd be quite low level, but already feasible.

Er, assuming that the user is keeping binlogs, and using MySQL for that matter. And assuming your application has filesystem read access to the binlogs, which would be approximately equal to having read access to all databases for security purposes. And assuming that they're kept forever, or that you're okay with having your history truncated every time the binlogs are. Which is a lot of assumptions.

I was merely assuming a toolserver scenario. So mysql is used, having read access isn't a big deal and sensitive data isn't replicated (i think). Plus, why keeping them forever? You probably only want a small time fraction, for tracking changes. Thus the new field would be useless for many entries.

Simetrical

1:12 p.m.

New subject: GSoC Project: Category Moving

On Sun, Apr 6, 2008 at 1:03 PM, Platonides Platonides@gmail.com wrote:

...

I was merely assuming a toolserver scenario.

Wrong list? :) A toolserver tool isn't an useful solution if we're talking about software development, and isn't great even if we're only talking about Wikimedia use (which there's no reason we should be).

...

So mysql is used, having read access isn't a big deal and sensitive data isn't replicated (i think).

The sensitive data on the toolserver is cordoned off by views for ordinary users. Roots can access all sensitive info, AFAIK. In fact, if my understanding is correct, the toolserver uses the exact same binlogs as the main cluster's slaves. At any rate, the binlogs are going to be on the database servers, not the normal toolserver people have access to.

Is it even possible to configure MySQL to only log certain statements, within a given database or table? It seems like it would be impractical. You would have to rewrite the UPDATE statements. If it used row-based replication, that might be different, but that doesn't work well with MySQL 4.0. :)

...

Plus, why keeping them forever? You probably only want a small time fraction, for tracking changes. Thus the new field would be useless for many entries.

I guess you're not aiming very high with this. Undoubtedly someone could hack up some toolserver thing, yeah. I'd be more interested in talking about long-term, scalable solutions, properly integrated into the software.

Bryan Tong Minh

1:31 p.m.

New subject: GSoC Project: Category Moving

On Sun, Apr 6, 2008 at 7:12 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

...
Plus, why keeping them forever? You probably only want a small time fraction, for tracking changes. Thus the new field would be useless for many entries.

I guess you're not aiming very high with this. Undoubtedly someone could hack up some toolserver thing, yeah. I'd be more interested in talking about long-term, scalable solutions, properly integrated into the software.

If storing this forever would be a problem, it could be purged easily if stored in a separate table, just like done for recentchanges.

Marco Schuster

4:16 p.m.

New subject: GSoC Project: Category Moving

Simetrical schrieb:

...

I guess you're not aiming very high with this. Undoubtedly someone could hack up some toolserver thing, yeah. I'd be more interested in talking about long-term, scalable solutions, properly integrated into the software.

In my opinion it's easier to have a quick and dirty hack to have this feature for people who need it. And if there is a really good thing implemented, we can switch to it. One good example for this thinking is IMO Leon's access counters, which are now made obsolete by http://stats.grok.se/.

Marco

6121

Age (days ago)

6125

Last active (days ago)

wikitech-l@lists.wikimedia.org

12 comments

6 participants

tags (0)

participants (6)

Bryan Tong Minh
Marco Schuster
Platonides
Roan Kattouw
Samuel Wantman
Simetrical