watchlist idea

List overview All Threads
Download

newer

older

merging watchlists

small texvc patch

Peter Gervai

15 Jan 2004 15 Jan '04

9:46 a.m.

Okay. Since watchlist on en: is disabled I have to think.

I suspect the problem is that when User checks her watchlist, mediawiki gets the list, and searched through all the articles to see the recent changes. Right?

If not, discard my email.

If not....

I suspect there are 150000+ articles to check against 10-500 entries on the watchlist.

I do not know how many watchlist entries are there (brion?), but I suspect at least a magnitude less. Let's say 10000 entries.

What if the entries have:

watchlist db: the user id watched article # article last changed (date, submitter, comment)

And every time an article gets updated, it updates all the _watchlist_ entries for itself.

Checking the watchlist from then on is one simple lookup and print. Nothing.

Updating is a _simple_ update (UPDATE watchtable SET changed=NOW() WHERE artnum=666; etc.) on article change.

Well?

Show replies by date

Brion Vibber

15 Jan 15 Jan

10:35 a.m.

On Jan 15, 2004, at 01:46, Peter Gervai wrote:

...

I suspect the problem is that when User checks her watchlist, mediawiki gets the list, and searched through all the articles to see the recent changes. Right?

There are two ways the db may search the watchlist.

It can look through all the items in your watchlist, check their timestamps etc, then sort the result and take only the recent items. This is efficient for small watchlists.

Or, it can look through the most recently changed pages to see which are on your watchlist. If you have a really big watchlist this can be more efficient, but the rate at which pages are edited on en.wikipedia.org this only helps if the cutoff is quite short indeed.

...

I suspect there are 150000+ articles to check against 10-500 entries on the watchlist.

The longest watchlists are in the range of a couple thousand, iirc. A few hundred is not atypical for power users. Most people have relatively few.

...

I do not know how many watchlist entries are there (brion?), but I suspect at least a magnitude less. Let's say 10000 entries.

Actually, it's about on the same order as the total number of pages. de.wikipedia has 60419 watchlist entries; en.wikipedia has 263770.

However any given page has at most a couple hundred people watching it.

The most-watched page on de.wikipedia is the Hauptseite with 115 watchers; on en.wikipedia it's Wikipedia:Village pump with 288. [Aside: Jesus Christ is less popular than World War II at 59 vs 60 watchers, but He still beats out George W. Bush by two Wikipedians!]

...

What if the entries have:

watchlist db: the user id watched article # article last changed (date, submitter, comment)

And every time an article gets updated, it updates all the _watchlist_ entries for itself.

Updating up to a couple hundred rows on page save may or may not be a worse performance drain than the current system. If it works, it might be worth it for the faster reads.

Tests I did some time ago were inconclusive about read improvements, but they may not have been properly indexed for the join to cur to get the revision data.

Also a slight complication; currently the watchlist only uses a single entry for each page and its talk page. To add timestamps to sort on, we'd have to double the number of rows involved.

-- brion vibber (brion @ pobox.com)

Peter Gervai

11:59 a.m.

On Thu, 15 Jan 2004 02:35:45 -0800, Brion Vibber brion@pobox.com wrote:

...

...
I suspect there are 150000+ articles to check against 10-500 entries on the watchlist.

The longest watchlists are in the range of a couple thousand, iirc. A few hundred is not atypical for power users. Most people have relatively few.

Ok.

...

...
I do not know how many watchlist entries are there (brion?), but I suspect at least a magnitude less. Let's say 10000 entries.

Actually, it's about on the same order as the total number of pages. de.wikipedia has 60419 watchlist entries; en.wikipedia has 263770.

Huh. I thought it's much less.

...

However any given page has at most a couple hundred people watching it.

...

...
What if the entries have:

watchlist db: the user id watched article # article last changed (date, submitter, comment)

And every time an article gets updated, it updates all the _watchlist_ entries for itself.

Updating up to a couple hundred rows on page save may or may not be a worse performance drain than the current system. If it works, it might be worth it for the faster reads.

You should check how many updates and how many watchlist checks are there in every sec/min.

I think it should be tested (dunno how big piece of code would that require, which would be discarded is it turns out not to be better)...

I don't know about locking issues and difference on read/write performance on a loaded mySQL; maybe lots of small writes (update on save) slow the db more than lots of larger reads (check watchlist now).

...

Tests I did some time ago were inconclusive about read improvements, but they may not have been properly indexed for the join to cur to get the revision data.

I can't help in that. :-)

...

Also a slight complication; currently the watchlist only uses a single entry for each page and its talk page. To add timestamps to sort on, we'd have to double the number of rows involved.

Well, you can usually only gain speed by using more space for auxilary tables. I do not believe in that hardware upgrade is the solution for lack of design. :-]

(...which doesn't mean that I have the holy grail handy, though.)

cya, grin

Karl Eichwalder

17 Jan 17 Jan

4:40 a.m.

"Peter Gervai" grin@grin.hu writes:

...

Well, you can usually only gain speed by using more space for auxilary tables. I do not believe in that hardware upgrade is the solution for lack of design. :-]

I've the feeling you are not wrong.

For the moment, my idea is: the user should be allowed to fetch his list once a day. If that's still to expensive simplify the layout or contents of the watchlist - don't generate diff and version links (I am able to do the diff locally). Or prepare user watch lists once a day and let the user download this daily list. Without watchlists working on the project isn't that much fun...

And switch to PostgreSQL. PostgreSQL obeys the SQL standard much better and thus it is more likely to attract SQL developers

-- | ,__o | _-_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

Brion Vibber

6:03 a.m.

On Jan 16, 2004, at 20:40, Karl Eichwalder wrote:

...

For the moment, my idea is: the user should be allowed to fetch his list once a day.

What's been hacked in for now is that a fetched watchlist will be cached for an hour, so someone clicking the button over and over won't load things down. The limits can be adjusted here and there if necessary.

...

If that's still to expensive simplify the layout or contents of the watchlist - don't generate diff and version links (I am able to do the diff locally).

I'm not sure I understand what you mean; generating diff links is a simple string concatenation and doesn't involve the database.

...

And switch to PostgreSQL. PostgreSQL obeys the SQL standard much better and thus it is more likely to attract SQL developers

I'd like to assume this statement is backed up with a plan to adapt MediaWiki to work with PostgreSQL and demonstrate superior performance with benchmarks. We're all in favor of choice here and would like MediaWiki to be more generalized, but PostgreSQL advocacy here has a history of being a lot of bluster followed by no action.

Unless someone would like to lend the benefit of their experience in getting the code set up and actually demonstrate something, it's just not going to happen; code doesn't write itself.

-- brion vibber (brion @ pobox.com)

Karl Eichwalder

6:27 a.m.

Brion Vibber brion@pobox.com writes:

...

What's been hacked in for now is that a fetched watchlist will be cached for an hour, so someone clicking the button over and over won't load things down. The limits can be adjusted here and there if necessary.

That's great, thanks a lot!

...

Unless someone would like to lend the benefit of their experience in getting the code set up and actually demonstrate something, it's just not going to happen; code doesn't write itself.

I just learnt the basics about PostgreSQL, though. Studying the weekly dump was not enough for understanding the database structure (of course); all you can see at the first glance that some special MySQL data types are in use.

I guess first I'll have to browse the Meta wiki - I hate web browsing ;)

-- | ,__o | _-_<, http://www.gnu.franken.de/ke/ | (*)/'(*)

Peter Gervai

10:39 a.m.

On Fri, 16 Jan 2004 22:03:32 -0800, Brion Vibber brion@pobox.com wrote:

...

On Jan 16, 2004, at 20:40, Karl Eichwalder wrote:

...
For the moment, my idea is: the user should be allowed to fetch his list once a day.

What's been hacked in for now is that a fetched watchlist will be cached for an hour, so someone clicking the button over and over won't load things down. The limits can be adjusted here and there if necessary.

I hope it does have a note somewhere telling that. (not that I would use watchlist more often than 2-3 hours).

...

...
And switch to PostgreSQL. PostgreSQL obeys the SQL standard much better and thus it is more likely to attract SQL developers

I'd like to assume this statement is backed up with a plan to adapt MediaWiki to work with PostgreSQL and demonstrate superior performance with benchmarks. We're all in favor of choice here and would like MediaWiki to be more generalized, but PostgreSQL advocacy here has a history of being a lot of bluster followed by no action.

Unless someone would like to lend the benefit of their experience in getting the code set up and actually demonstrate something, it's just not going to happen; code doesn't write itself.

Yep. That's the reason I kept my big mouth shut. :) I'm using Psql and tried mySQL a few times, and I was not impressed by its feature set, but mySQL was definitely faster on _read_. The main difference I believe (and it is backed by tests of more potent beings than me) it when a database is being used in heavily loaded read/write mode, due to the better locking features of pgsql. Since I am not familiar with php (to put it nicely the "i hate php" phrase) I cannot tell how complex would it be to change from mysql to pgsql, but as I heard php is pretty general about SQL handling, so it might not be that complex.

Only thing I'm pretty sure in is that it won't be Brion who would do that. He uses mySQL, he KNOWS the beast, I see no reason to force pgSQL down his throat. :) Anyone with php+psql experience could step in and check: maybe it is a trivial change. (afaik mysql's sql syntax is getting close to the standards, so it should be simple).

If anyone want to use my (sort of) psql experience (without php experience) then feel free to ask. I do not have much time to code, though. (another euphemism; I do not have time to code at all.)

Brion, if you can tell me SQL commands which should be timed under psql I can try to convert the en: database to psql and do some tests, like check timing of commands, etc. Naturally I do not have such a huge beast machine like you do, but still, maybe helps.

Peter

Kelly Anderson

20 Jan 20 Jan

1:33 a.m.

Just my 2c here, which isn't worth much, but as an open source project MySQL help is easier to find than Postgress (even though it's open source too). The extra efficiency in running speed would require a sacrifice of development speed. Not a good tradeoff in my mind.

You'll get much faster results from improving algorithms than by switching to a different SQL engine, IMHO.

-Kelly

...

...
And switch to PostgreSQL. PostgreSQL obeys the SQL standard much better and thus it is more likely to attract SQL developers

I'd like to assume this statement is backed up with a plan to adapt MediaWiki to work with PostgreSQL and demonstrate superior performance with benchmarks. We're all in favor of choice here and would like MediaWiki to be more generalized, but PostgreSQL advocacy here has a history of being a lot of bluster followed by no action.

Agon S. Buchholz

17 Jan 17 Jan

12:13 p.m.

Karl Eichwalder wrote:

...

And switch to PostgreSQL. PostgreSQL obeys the SQL standard much better and thus it is more likely to attract SQL developers.

This is a pretty useless discussion. MySQL has the much larger user base, especially when it comes to database-backed websites. Besides of this, MySQL is generally faster, except for applications optimized for the richer feature set of pgsql, and pgsql needs this ugly VACUUM command from time to time, if I remember correctly. Usually, pgsql is much slower on INSERT, UPDATE and DELETE commands, but faster on SELECT and JOIN. See http://www.mysql.com/information/benchmarks-old.html (and http://www.ntlug.org/~cbbrowne/rdbmssql.html for some background information).

Don't get me wrong, I like PostgreSQL and I would like to see some database abstraction layer in MediaWiki and the possibility to plug-in other database backends like Interbase or Firebird for more academical reasons, but I don't see that it would currently *help* us.

However, nobody is hindered to contribute some kind of database abstraction to the CVS ;)

Regards, -asb

Evan Prodromou

15 Jan 15 Jan

10:29 p.m.

...

...
...
...
...
"BV" == Brion Vibber brion@pobox.com writes:

BV> There are two ways the db may search the watchlist.

Three ways. It could do a join to the cur table and old table, or to recent changes.

~ESP

-- Evan Prodromou evan@wikitravel.org Wikitravel - http://www.wikitravel.org/ The free, complete, up-to-date and reliable world-wide travel guide

Brion Vibber

10:33 p.m.

On Jan 15, 2004, at 14:29, Evan Prodromou wrote:

...

BV> There are two ways the db may search the watchlist.
Three ways. It could do a join to the cur table and old table, or to recent changes.

Special:Watchlist joins only to cur; there are two ways to optimize the join/sort operation (starting from watchlist, or starting from cur).

If the function did something else than what it does, then it might have many different ways of doing it, of course.

-- brion vibber (brion @ pobox.com)

7489

Age (days ago)

7494

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

6 participants

tags (0)

participants (6)

Agon S. Buchholz
Brion Vibber
Evan Prodromou
Karl Eichwalder
Kelly Anderson
Peter Gervai