RE: [Wikitech-l] Re: Faster parsing

List overview All Threads
Download

newer

older

Re: Cool project for perl...

Unbelieveable 1.1.0

Poor, Edmund W

26 Feb 2004 26 Feb '04

5:11 p.m.

Timwi is, of course, completely accurate in all his statements. If I were a college professor, I would award him A+ and try to get him a job as a teaching fellow! :-)

My worst error was describing n log n as "log n" -- back to algebra class for Uncle Ed!

Anyway, since there are lots of sharp minds ready to pounce on any bugs, why don't we start taking an organized look at the database structure? I'm clearly no expert on /devising/ sort algorithms, but I'm fairly good at recognizing whether someone has come up with a good idea.

Who wants to work on database structure with me -- or, at least, is willing to let me sit in and watch?!

Ed Poor, aka Uncle Ed

Show replies by date

Timwi

26 Feb 26 Feb

9:34 p.m.

New subject: Faster parsing

Poor, Edmund W wrote:

...

Anyway, since there are lots of sharp minds ready to pounce on any bugs, why don't we start taking an organized look at the database structure? Who wants to work on database structure with me -- or, at least, is willing to let me sit in and watch?!

Back in August/September last year, I started working on a completely new Wiki engine for Wikipedia. I never got anywhere, but the database schema I devised back then is still available (scroll down a bit):

http://lionking.org/~timwi/t/wikipedia/wikipedia-iv.html

I'm no longer wanting to re-implement the entire software, especially since so many new features have arrived ({{msg}}, TOCs) and are about to arrive (categories), but I believe the database schema should nevertheless be revised.

...

I agree with Brion, and I'm sorry I replied to Timwi before reading Brion's "I forbid" post.

Although I agree with Brion that someone should probably do some profiling, I see no harm in discussing improvements that may not turn out to be *major* improvements. I certainly do NOT agree that anything that is clearly on-topic and constructive, should be "forbidden" by anybody.

Timwi

Jimmy Wales

9:49 p.m.

New subject: Faster parsing

Timwi wrote:

...

Although I agree with Brion that someone should probably do some profiling, I see no harm in discussing improvements that may not turn out to be *major* improvements. I certainly do NOT agree that anything that is clearly on-topic and constructive, should be "forbidden" by anybody.

I agree with you, even though I also agreed with Brion. He was just joking, and so was I.

His point was that some discussions of optimizations really are premature until we spend some time working out just where bottlenecks actually are.

--Jimbo

Brion Vibber

10:10 p.m.

New subject: Faster parsing

On Feb 26, 2004, at 12:49, Jimmy Wales wrote:

...

Timwi wrote:

...
Although I agree with Brion that someone should probably do some profiling, I see no harm in discussing improvements that may not turn out to be *major* improvements. I certainly do NOT agree that anything that is clearly on-topic and constructive, should be "forbidden" by anybody.

I agree with you, even though I also agreed with Brion. He was just joking, and so was I.

His point was that some discussions of optimizations really are premature until we spend some time working out just where bottlenecks actually are.

Working on performance improvements costs developer time, and that is a precious resource. If we spend all our time talking about things that might not help much, we are in fact decreasing the actual improvements we can make. (In economics this lovely concept is called opportunity cost; think of all the fun things you could have done in your life if you hadn't been sitting around in front of the computer! ;)

Sure, it's possible to put a lot of effort into something that yields very little improvement, but, y'know, why?

-- brion vibber (brion @ pobox.com)

Timwi

10:52 p.m.

New subject: Faster parsing

Brion Vibber wrote:

...

Working on performance improvements costs developer time, and that is a precious resource.

This is assuming that the developers whose time you have to your disposition are all equally interested in various things. I, for example, am interested in database optimisation, but not in profiling. This is why I discuss database optimisation, but not profiling.

...

Sure, it's possible to put a lot of effort into something that yields very little improvement, but, y'know, why?

Similarly: Why not? - seeing as I'm not going to help with the profiling either way, I might as well help with the database optimisation.

Timwi

Brion Vibber

11:14 p.m.

New subject: Faster parsing

On Feb 26, 2004, at 13:52, Timwi wrote:

...

...
Sure, it's possible to put a lot of effort into something that yields very little improvement, but, y'know, why?

Similarly: Why not? - seeing as I'm not going to help with the profiling either way, I might as well help with the database optimisation.

If you can add something to http://meta.wikipedia.org/wiki/Proposed_Database_Schema_Changes please do.

-- brion vibber (brion @ pobox.com)

Timwi

27 Feb 27 Feb

12:25 a.m.

New subject: Faster parsing

Brion Vibber wrote:

...

On Feb 26, 2004, at 13:52, Timwi wrote:

...
...
Sure, it's possible to put a lot of effort into something that yields very little improvement, but, y'know, why?

Similarly: Why not? - seeing as I'm not going to help with the profiling either way, I might as well help with the database optimisation.

If you can add something to http://meta.wikipedia.org/wiki/Proposed_Database_Schema_Changes please do.

Hm. Looks okay for short-to-middle term. I suppose this suggestion is based on the thought of having to convert the current database easily. It doesn't account for several things we really really want to do in the middle-to-long term, such as combining all language Wikipedias in a single Wiki, so that people don't have to register hundreds of usernames, or more flexible handling of user properties and article properties.

The database schema I propsed last year: http://meta.wikipedia.org/wiki/Experimental_new_database_schema takes these into account, but is (obviously) a little dated by now. For example, it does not know about the MediaWiki namespace and how it now handles translatable site text.

Timwi

Evan Prodromou

3:45 p.m.

New subject: Development Roadmap (was Re: Re: Faster parsing)

So, the words "1.2" have been batted around, but we've got some important features on the table for "future" consideration (e.g. the schema changes).

I'm wondering if it's time to do a development roadmap and get an idea of what we want from 1.2, 1.3, 2.0, dot dot dot.

~ESP

-- Evan Prodromou evan@wikitravel.org Wikitravel - http://www.wikitravel.org/ The free, complete, up-to-date and reliable world-wide travel guide

Tomasz Wegrzanowski

3:58 p.m.

New subject: Development Roadmap (was Re: Re: Faster parsing)

On Fri, Feb 27, 2004 at 09:45:50AM -0500, Evan Prodromou wrote:

...

So, the words "1.2" have been batted around, but we've got some important features on the table for "future" consideration (e.g. the schema changes).

I'm wondering if it's time to do a development roadmap and get an idea of what we want from 1.2, 1.3, 2.0, dot dot dot.

Have you hired some people to do work on MediaWiki ? If not, such roadmaps aren't very useful - people will work on whatever they want.

Agon S. Buchholz

6:31 p.m.

New subject: Development Roadmap (was Re: Re: Faster parsing)

Tomasz Wegrzanowski wrote:

...

...
I'm wondering if it's time to do a development roadmap and get an idea of what we want from 1.2, 1.3, 2.0, dot dot dot.

...

Have you hired some people to do work on MediaWiki ? If not, such roadmaps aren't very useful - people will work on whatever they want.

Nobody forces anyone to do a fixed amount of work until a deadline ends or whatever.

I think a roadmap would be a great help for all of us to make development and prioritoes more transparent to ourselves and all users. A short notice could point out that given dates of a planned release 2.0 are rough estimates and no "product announcement".

Regards, -asb

Jimmy Wales

29 Feb 29 Feb

4:17 p.m.

New subject: Development Roadmap (was Re: Re: Faster parsing)

Agon S. Buchholz wrote:

...

I think a roadmap would be a great help for all of us to make development and prioritoes more transparent to ourselves and all users.

I think that's right. Having a consensus goal or direction can help us to move coherently, while of course the volunteer nature of the work means that people will do or not do whatever they like. But lots of time, people may be perfectly satisfied and interested in any number of different problems, and if we have a roadmap, they can pick one that's most sensible.

I'm not the right person to set this roadmap, though. I'm not close enough to the software development process.

--Jimbo

Ivo Köthnig

27 Feb 27 Feb

12:33 a.m.

New subject: Faster parsing

Am Donnerstag, 26. Februar 2004 21:49 schrieb Jimmy Wales:

...

Timwi wrote:

...
Although I agree with Brion that someone should probably do some profiling, I see no harm in discussing improvements that may not turn out to be *major* improvements. I certainly do NOT agree that anything that is clearly on-topic and constructive, should be "forbidden" by anybody.

I agree with you, even though I also agreed with Brion. He was just joking, and so was I.

His point was that some discussions of optimizations really are premature until we spend some time working out just where bottlenecks actually are.

I think the bottleneck is using a database mangement system (here mysql). So my question is: What is the advantage of using it? As I understand the toppic dbms is usefull for very structured redundant data.

But we have no stuctured data (just articles which point to some others) and no redundance in that sence.

All the nice tables we use could be implemented dircetly and handled much much faster by a pice of mediawiki-software (if not written in php) by using the right datastructures.

For me it always looks as we use the database just to use the database. So again: where is the advantage?

--Ivo Köthnig

Timwi

10:13 a.m.

New subject: Faster parsing

Ivo Köthnig wrote:

...

I think the bottleneck is using a database mangement system (here mysql). So my question is: What is the advantage of using it? [...] For me it always looks as we use the database just to use the database. So again: where is the advantage?

1) easy backup 2) easy construction of queries (SQL) 3) flexibility with construction of *new* queries that we haven't thought of before 4) don't (well, shouldn't) need to update hundreds of data structures upon every edit 5) databases are designed to do this sort of thing (esp. indexing)

There are probably *loads* more.

Timwi

Tomasz Wegrzanowski

10:39 a.m.

New subject: Faster parsing

On Fri, Feb 27, 2004 at 09:13:12AM +0000, Timwi wrote:

...

Ivo Köthnig wrote:

...
I think the bottleneck is using a database mangement system (here mysql). So my question is: What is the advantage of using it? [...] For me it always looks as we use the database just to use the database. So again: where is the advantage?

easy backup

easy construction of queries (SQL)

flexibility with construction of *new* queries that we haven't thought of before

don't (well, shouldn't) need to update hundreds of data structures upon every edit

databases are designed to do this sort of thing (esp. indexing)

There are probably *loads* more.

Mirroring is the biggest *disadvantage* of MySQL solution. UseMod wikis are extremely easy to efficiently mirror via rsync, and it was done with Phase I too. It was even easy to mirror it in "almost real-time" (like once an hour or so) over very small pipe (I did mirror Polish Wikipedia like that on 115kbit/s, and it took only small part of my bandwidth).

MySQL can only produce full dumps, and they are huge, take lot of time to generate, and lot of time to import at the mirroring site.

As far as queries are concerned, you can see it yourself on the Special Pages - most of the queries are turned off for performance reasons, so that's not really a good argument.

And we need to update the data structures all the time for performance reasons.

Timwi

3:35 p.m.

New subject: Faster parsing

Tomasz Wegrzanowski wrote:

...

Mirroring is the biggest *disadvantage* of MySQL solution. UseMod wikis are extremely easy to efficiently mirror via rsync, and it was done with Phase I too. It was even easy to mirror it in "almost real-time" (like once an hour or so) over very small pipe (I did mirror Polish Wikipedia like that on 115kbit/s, and it took only small part of my bandwidth).

There is MySQL replication. I don't know much about it though (nor do I know much about rsync other than what it's used for), so I don't know.

...

MySQL can only produce full dumps, and they are huge, take lot of time to generate, and lot of time to import at the mirroring site.

I suppose I could argue that you could couple this with CVS, which can produce patches to the dumps, but that's getting hacky :-)

...

As far as queries are concerned, you can see it yourself on the Special Pages - most of the queries are turned off for performance reasons, so that's not really a good argument.

Ah, but that's not an indication that MySQL, or RDBMS in general, is a bad idea. I just shows that the current database schema is ill-designed.

...

And we need to update the data structures all the time for performance reasons.

I'm not sure I fully understand what you mean by this. Apparently we do update something like the "recentchanges" table on every edit, because some people seem to have thought that it would make for better performance, but I'm pretty much convinced that this assumption is fallacious. Updating that table on every edit, I believe, is more costly than the benefit it brings in reading it. I'll gladly let an expert correct me on this, but this is what my past experience told me.

Timwi

Ivan Krstic

3:50 p.m.

New subject: Faster parsing

Timwi wrote:

...

There is MySQL replication. I don't know much about it though (nor do I know much about rsync other than what it's used for), so I don't know.

As far as I know, MySQL replication isn't practical for the one-to-many scenario that a lot of people would like to see it used for (home users keeping syncs of the wikipedia database). I've been writing some basic code that would let us do this type of 1-to-x replication nicely, but we'll see whether anything comes of that.

...

I suppose I could argue that you could couple this with CVS, which can produce patches to the dumps, but that's getting hacky :-)

It would hardly be a valid argument. My Feb 15 dump of 'cur' and Jan 30 dump of 'old' are roughly 15.1Gb total in size, uncompressed. I'm not even sure how diff would behave with files that size; if you compress them, you'd be looking into binary patches that would in all likelihood be useless (i.e. someone correct me if I'm wrong, but depending on your blocksize, a change early in the file would probably propagate downstream and require retransfer of most of the file).

...

...
As far as queries are concerned, you can see it yourself on the Special Pages - most of the queries are turned off for performance reasons, so that's not really a good argument.

Ah, but that's not an indication that MySQL, or RDBMS in general, is a bad idea. I just shows that the current database schema is ill-designed.

Possibly. In my mind, a database is certainly the way to go; the question becomes whether a less general implementation, tailored to our needed subset of SQL92 (as opposed to something implementing the entire thing and more) is needed. I have no data to speak to concretely, but I'd suspect that such a drop in complexity could likely produce strong performance improvements.

...

I'm not sure I fully understand what you mean by this. Apparently we do update something like the "recentchanges" table on every edit, because some people seem to have thought that it would make for better performance, but I'm pretty much convinced that this assumption is fallacious.

I lay no claims to expertise in this area, but I'm (and I'm sure others are) more than willing to hear better proposals. How would you do it?

Cheers, Ivan

Timwi

28 Feb 28 Feb

4 a.m.

New subject: Faster parsing

Ivan Krstic wrote:

...

Timwi wrote:

...
I'm not sure I fully understand what you mean by this. Apparently we do update something like the "recentchanges" table on every edit, because some people seem to have thought that it would make for better performance, but I'm pretty much convinced that this assumption is fallacious.

I lay no claims to expertise in this area, but I'm (and I'm sure others are) more than willing to hear better proposals. How would you do it?

Do what? Recent Changes?

Currently, on every edit, among lots of other things, this happens: - a row is inserted into "old" (involves a BLOB and several varchars) - the row in "cur" is updated (also involves a BLOB & several varchars) - another row is inserted into "recentchanges" (several varchars)

This is three operations for something that can be done in one. (Simply insert a row into some generic table, 'articlerevisions' or somesuch.) There is neither a reason for the cur/old split, nor is there a reason for the 'recentchanges' table.

Now, of course the database schema I proposed would indeed require two inserts (rather than just one) for this operation: - one insert for the metadata (edit summary being the only varchar) - one insert for the actual article text (the BLOB and nothing else)

However, this is not noticeably more costly than adding the same data into one table that contains both (which is bad because then reading the metadata when you don't need the article text is inefficient). Furthermore, it means you don't lock an entire table with central importance for quite as long.

Again, please let me stress that I do not claim to be an expert in this. Although I have created database schemata in the past, I have not had the benefit of testing them under load as heavy as that of Wikipedia. However, a major contribution to my experience is my involvement at LiveJournal, which is also a pretty heavily loaded website.

Timwi

Lars Aronsson

4:30 a.m.

New subject: Faster parsing

Timwi wrote:

...

This is three operations for something that can be done in one. (Simply insert a row into some generic table, 'articlerevisions' or somesuch.) There is neither a reason for the cur/old split, nor is there a reason for the 'recentchanges' table.

Good thinking, could be worth trying. However, you should consider that the number of edits is far smaller than the number of reads. And a read is a simple select from today's cur table.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Timwi

6:01 a.m.

New subject: Faster parsing

Lars Aronsson wrote:

...

Timwi wrote:

...
This is three operations for something that can be done in one. (Simply insert a row into some generic table, 'articlerevisions' or somesuch.) There is neither a reason for the cur/old split, nor is there a reason for the 'recentchanges' table.

Good thinking, could be worth trying. However, you should consider that the number of edits is far smaller than the number of reads. And a read is a simple select from today's cur table.

I know that; but you also need to consider that a write locks a database table, a read does not.

Timwi

Brion Vibber

6:18 a.m.

New subject: Faster parsing

On Feb 27, 2004, at 19:30, Lars Aronsson wrote:

...

Timwi wrote:

...
This is three operations for something that can be done in one. (Simply insert a row into some generic table, 'articlerevisions' or somesuch.) There is neither a reason for the cur/old split, nor is there a reason for the 'recentchanges' table.

Good thinking, could be worth trying. However, you should consider that the number of edits is far smaller than the number of reads. And a read is a simple select from today's cur table.

The primary difficulty is with reads, not writes. We have a fair number of operations that want to draw on both current and old revisions and working with two tables split between current and old is very difficult for that; it leads to uglier, more complicated, harder to maintain, buggier code. (The writes are also buggier with the split. We had to track down and fix a number of race conditions which could and did destroy near-simultaneous edits, and others may remain.)

-- brion vibber (brion @ pobox.com)

erik_moeller＠gmx.de

27 Feb 27 Feb

2:41 p.m.

New subject: Faster parsing

Ivo-

...

I think the bottleneck is using a database mangement system (here mysql). So my question is: What is the advantage of using it? As I understand the toppic dbms is usefull for very structured redundant data.

This is an eternal debate. You may want to look at MoinMoin, which stores all its pages in plaintext and still manages to be very powerful:

http://moin.sourceforge.net/

Also see my review of several DB and non-DB based wiki engines in c't 25/03.

Using a database allows us to run a set of highly complex queries to examine the relationships between articles and the metadata about articles. In fact we can even examine the relationship between articles and users (watchlist), or between the metadata of different articles (redirect status in "What links here").

All that would only be possible using ugly hacks without a database -- you would effectively have to build your own mini-DB to store this metadata, and most text-based wikis do this. They get extremely slow when it comes to very basic stuff such as listing backlinks or doing a fulltext search. Not on a small wiki, but on one of the size of Wikipedia. I have imported large datasets into TWiki etc., and many operations became practically unusable.

The MediaWiki FT search has locking issues. It's very fast for a normal wiki, but for a huge one like Wikipedia it tends to get in a deadlock state. That's why we have to disable it on Wikipedia. Some other queries could also be made faster. Many of our queries have been optimized now and every single one should eventually return results with a response time of milliseconds.

It is a fallacy that a database makes things more complicated. It makes them much easier because you don't have to worry where to store something before implementing it. How do you want to store your user prefs, the redirect info, the recent changes table, the page history, always keeping quick access in mind? Usemod, Twiki etc. all use ugly hacks for that kind of stuff.

In short, a DB is faster, cleaner, and more convenient. We benefit from the thousands of man hours that the MySQL developers have invested in building a scalable, fast, powerful multi-user database instead of having to reinvent the wheel whenever we want to do something that goes beyond "show me that file".

Regards,

Erik

Ivo Köthnig

4:19 p.m.

New subject: Faster parsing

Am Freitag, 27. Februar 2004 14:42 schrieb Erik Moeller:

...

Ivo-

...
I think the bottleneck is using a database mangement system (here mysql). So my question is: What is the advantage of using it? As I understand the toppic dbms is usefull for very structured redundant data.

This is an eternal debate. You may want to look at MoinMoin, which stores all its pages in plaintext and still manages to be very powerful:

Woudnt it make more sence to store the pages in plaintext and just indexing them in the db? What is the advantage of storing it in the db?

...

Using a database allows us to run a set of highly complex queries to examine the relationships between articles and the metadata about articles. In fact we can even examine the relationship between articles and users (watchlist), or between the metadata of different articles (redirect status in "What links here").

This also can be done without a database. Of course in that case you have to implement the needet datastructure and can not use sql-queries. This takes more time to develop, but you are free to code whatever datastructure you might think is the best. The try to emulate it by a db-structure can fail and even if you it does not fail you are slower since you have all the overhead-calculations the database produce since it wants to be as generell as possible.

...

All that would only be possible using ugly hacks without a database -- you

You only get ugly hacks if you don't think about your code before implementing something.

...

would effectively have to build your own mini-DB to store this metadata, and most text-based wikis do this. They get extremely slow when it comes to very basic stuff such as listing backlinks or doing a fulltext search.

Thats just a matter of using the right datastructure, independently from using a db or not.

...

Not on a small wiki, but on one of the size of Wikipedia. I have imported large datasets into TWiki etc., and many operations became practically unusable.

But the reason ist not that the use no database they just don't have the right datastructure for the operations.

...

The MediaWiki FT search has locking issues. It's very fast for a normal wiki, but for a huge one like Wikipedia it tends to get in a deadlock state. That's why we have to disable it on Wikipedia. Some other queries could also be made faster. Many of our queries have been optimized now and every single one should eventually return results with a response time of milliseconds.

There I have a question. Do we use SQL for queries or do we use directly the relational operations (join, semijoin, projection etc [relations algebra])? Is it possible to use that?

...

It is a fallacy that a database makes things more complicated. It makes

I dont say this. It makes coding easier. But easy code is always slow...

...

them much easier because you don't have to worry where to store something before implementing it. How do you want to store your user prefs, the

That is the only real argument for using a dbms. But than we have to ask ourself: Do we want something slow and easy or something fast but more complicated.

...

redirect info, the recent changes table, the page history, always keeping quick access in mind? Usemod, Twiki etc. all use ugly hacks for that kind of stuff.

Jeah, the features are implemented fast and in consequence running slow.

...

In short, a DB is faster, cleaner, and more convenient. We benefit from the thousands of man hours that the MySQL developers have invested in building a scalable, fast, powerful multi-user database instead of having to reinvent the wheel whenever we want to do something that goes beyond "show me that file".

Reinventing the wheel is a danger that always exists. But blowing up by trying to be too general is anotherone. I still think we do not really need all the features of the db. I think we need very few and this could be implemented directly and thus be much faster... But on the other hand it may is to late to change that again...

--Ivo Köthnig

erik_moeller＠gmx.de

5:59 p.m.

New subject: Faster parsing

Ivo-

...

You only get ugly hacks if you don't think about your code before implementing something.

True. Go ahead, do a rewrite and show us how much better you can do with text files. Or if you don't want to do that yet, try loading all of the English Wikipedia (CUR and OLD) into MoinMoin and then try to mirror MediaWiki's functionality in MoinMoin.

Really, this is not the place for philosophical discussions. Show us the code, please.

Regards,

Erik

Ivo Köthnig

9:22 p.m.

New subject: Faster parsing

Am Freitag, 27. Februar 2004 18:00 schrieb Erik Moeller:

...

Ivo-

...
You only get ugly hacks if you don't think about your code before implementing something.

True. Go ahead, do a rewrite and show us how much better you can do with text files. Or if you don't want to do that yet, try loading all of the English Wikipedia (CUR and OLD) into MoinMoin and then try to mirror MediaWiki's functionality in MoinMoin.

My intention was not to start a

...

Really, this is not the place for philosophical discussions. Show us the code, please.

My intention was not to start a philosphical discussion. I just want to know a good reason to do things that way (using a dbms). Implementing the features in MoinMoin would not be a good idea. There are just 2 options. Restart from scratch or slowly transform MediaWiki. I would prefer the last option. Since I dont have the time for coding I just can give you some thougts how to handle or improve things. Get rid of the dbms is just one way. Using relational algebra instead of sql is another (but since I dont know mysql that good I even dont know if this option really exists).

What is about indexing all articles in the database but storing it in files? that could help to hold the hole database in memory and thus also speed up the software.

--Ivo Köthnig

Timwi

28 Feb 28 Feb

4:22 a.m.

New subject: Faster parsing

Ivo Köthnig wrote:

...

...
Using a database allows us to run a set of highly complex queries to examine the relationships between articles and the metadata about articles. In fact we can even examine the relationship between articles and users (watchlist), or between the metadata of different articles (redirect status in "What links here").

This also can be done without a database. Of course in that case you have to implement the needet datastructure and can not use sql-queries. This takes more time to develop, but you are free to code whatever datastructure you might think is the best. The try to emulate it by a db-structure can fail and even if you it does not fail you are slower since you have all the overhead-calculations the database produce since it wants to be as generell as possible.

You have not provided examples of "overhead calculations" that a DB would have to perform that a file system wouldn't. You also haven't provided evidence of your claim that DBMSs cannot represent the required range of data structures.

There is a huge number of organisatorial tasks that you would need to do with your files (when reading, that would be parsing the file, locating data, possibly caching and perhaps sorting; when writing, that would be updating indexes, managing "holes" in the file, and a huge host of other horrendous things). Next, if your data structures are spread over several files, you will have huge problems implementing atomicity, and you will run into a number of race conditions.

DBMSs are specifically designed to do exactly these things. People often tend to forget that DBMSs aren't really much more than systems that manage files! But they have all these incredibly complicated and advanced things in it that you need and you really really don't want to have to re-implement from scratch.

For 99.9% of data you want to store, you *can* create *exactly* the data structure you need using an RDBMS.

...

...
It is a fallacy that a database makes things more complicated. It makes

I dont say this. It makes coding easier. But easy code is always slow...

It is pretty bold to think that you can do so much better (i.e. faster) than the hundreds of thousands of experienced developers who have put their effort into MySQL.

Timwi

Ivo Köthnig

10:34 a.m.

New subject: Faster parsing

Am Samstag, 28. Februar 2004 04:22 schrieb Timwi:

...

Ivo Köthnig wrote:

...
...
Using a database allows us to run a set of highly complex queries to examine the relationships between articles and the metadata about articles. In fact we can even examine the relationship between articles and users (watchlist), or between the metadata of different articles (redirect status in "What links here").

This also can be done without a database. Of course in that case you have to implement the needet datastructure and can not use sql-queries. This takes more time to develop, but you are free to code whatever datastructure you might think is the best. The try to emulate it by a db-structure can fail and even if you it does not fail you are slower since you have all the overhead-calculations the database produce since it wants to be as generell as possible.

You have not provided examples of "overhead calculations" that a DB would have to perform that a file system wouldn't. You also haven't provided evidence of your claim that DBMSs cannot represent the required range of data structures.

That starts with parsing and transforming a sql-statement into an expression of relational algebra (which can not always produce the optimal (fastets) expression) and ends by the point that you are restricted to these expressions which can be handeled only step by step (operation always on the huge table).

...

There is a huge number of organisatorial tasks that you would need to do with your files (when reading, that would be parsing the file, locating data, possibly caching and perhaps sorting; when writing, that would be updating indexes, managing "holes" in the file, and a huge host of other

that is no problem if you use good datastructures...

...

horrendous things). Next, if your data structures are spread over several files, you will have huge problems implementing atomicity, and you will run into a number of race conditions.

That is a good point and one of the points I wanted to hear from you to convince me.

...

DBMSs are specifically designed to do exactly these things. People often tend to forget that DBMSs aren't really much more than systems that manage files! But they have all these incredibly complicated and advanced things in it that you need and you really really don't want to have to re-implement from scratch.

Yes, but you don't want to implement each of that features.

...

For 99.9% of data you want to store, you *can* create *exactly* the data structure you need using an RDBMS.

By data structure I am not only mean the structure of data. I also mean the code which belongs to it to handle it (For example heaps, sorted lists and so on). The structure for the data surly can be created but not the code...

But yes: Indexing and sorting is something the dbms (hopefully) can handle perfectly... (But on the other hand sorting and indexing can be implemented very easy)

...

...
...
It is a fallacy that a database makes things more complicated. It makes

I dont say this. It makes coding easier. But easy code is always slow...

It is pretty bold to think that you can do so much better (i.e. faster) than the hundreds of thousands of experienced developers who have put their effort into MySQL.

The reason ist not that I can code much better (and I never told this). The argument was that we (may) don't need this general piece of software. Of course hundreds of thousands of experienced developers who have put their effort into MySQL can implement a better MySQL. But we also don't use M$ Windows just because hundreds of thousands of (may or may not) experienced developers coded there some functions we may could use.

I really just asked why we use it and what the advantages are. You pointed out a good point (atomicity). That is something I would accept. Thats a point that makes sence for me.

--Ivo Köthnig

Timwi

4:41 p.m.

New subject: Faster parsing

Ivo Köthnig wrote/schrieb/a écrit/skribis/etc.:

...

Am Samstag, 28. Februar 2004 04:22 schrieb Timwi:

...
You have not provided examples of "overhead calculations" that a DB would have to perform that a file system wouldn't. You also haven't provided evidence of your claim that DBMSs cannot represent the required range of data structures.

That starts with parsing and transforming a sql-statement into an expression of relational algebra (which can not always produce the optimal (fastets) expression) and ends by the point that you are restricted to these expressions which can be handeled only step by step (operation always on the huge table).

I think you are greatly underestimating the powerful optimisations that can be done (and are done) in MySQL. Parsing an SQL statement is trivial (in terms of required computational power); I don't know much about relational algebra, but it seems that you would somehow have to express and then parse statements in that algebra too. If you don't have an index, then operations on huge tables are slow no matter what data structure you use. Any kind of indexing alleviates this problem, and RDMBSs do indexing really well.

...

...
There is a huge number of organisatorial tasks that you would need to do with your files (when reading, that would be parsing the file, locating data, possibly caching and perhaps sorting; when writing, that would be updating indexes, managing "holes" in the file, and a huge host of other

that is no problem if you use good datastructures...

I think you haven't quite understood what I was getting at here. This is independent of what data structure you use.

...

...
DBMSs are specifically designed to do exactly these things. People often tend to forget that DBMSs aren't really much more than systems that manage files! But they have all these incredibly complicated and advanced things in it that you need and you really really don't want to have to re-implement from scratch.

Yes, but you don't want to implement each of that features.

Yes, you do. You do need to manage your files. You do need to store your data structures somehow, and parse them, and update them, etc.etc.

...

...
For 99.9% of data you want to store, you *can* create *exactly* the data structure you need using an RDBMS.

By data structure I am not only mean the structure of data. I also mean the code which belongs to it to handle it (For example heaps, sorted lists and so on). The structure for the data surly can be created but not the code...

You never specifically need [[heap]]s. A heap is a special way of implementing a [[priority queue]] - if you need a priority queue, then you can do that by creating an indexed table. Whether the index is implemented as a heap should not be your concern (it should be the DMBS's). It is well possible that MySQL uses heaps, I don't know.

Timwi

Ivo Köthnig

8:37 p.m.

New subject: Faster parsing

...

You never specifically need [[heap]]s. A heap is a special way of implementing a [[priority queue]] - if you need a priority queue, then you can do that by creating an indexed table. Whether the index is implemented as a heap should not be your concern (it should be the DMBS's). It is well possible that MySQL uses heaps, I don't know.

For sorted lists (index) you will need heaps. Sorting for the largest articles is one example. I think MySQL would use heaps (or hash-tables) for such thinks...

--Ivo Köthnig

Tomasz Wegrzanowski

8:47 p.m.

New subject: Faster parsing

On Sat, Feb 28, 2004 at 08:37:44PM +0100, Ivo Köthnig wrote:

...

...
You never specifically need [[heap]]s. A heap is a special way of implementing a [[priority queue]] - if you need a priority queue, then you can do that by creating an indexed table. Whether the index is implemented as a heap should not be your concern (it should be the DMBS's). It is well possible that MySQL uses heaps, I don't know.

For sorted lists (index) you will need heaps. Sorting for the largest articles is one example. I think MySQL would use heaps (or hash-tables) for such thinks...

No, it wouldn't use heaps. MySQL would use B+tree index. It would, if the indexes were set up right. As they aren't, it has to go through entire database (including cur_text) to find largest articles, project it so it contains only sizes, then use generalized merge sort to find the largest articles.

Ivo Köthnig

29 Feb 29 Feb

12:53 p.m.

New subject: Faster parsing

Am Samstag, 28. Februar 2004 20:47 schrieb Tomasz Wegrzanowski:

...

On Sat, Feb 28, 2004 at 08:37:44PM +0100, Ivo Köthnig wrote:

...
...
You never specifically need [[heap]]s. A heap is a special way of implementing a [[priority queue]] - if you need a priority queue, then you can do that by creating an indexed table. Whether the index is implemented as a heap should not be your concern (it should be the DMBS's). It is well possible that MySQL uses heaps, I don't know.

For sorted lists (index) you will need heaps. Sorting for the largest articles is one example. I think MySQL would use heaps (or hash-tables) for such thinks...

No, it wouldn't use heaps. MySQL would use B+tree index.

I think a B+tree is nothing else than a special heap...

...

indexes were set up right. As they aren't, it has to go through entire database (including cur_text) to find largest articles, project it so it contains only sizes, then use generalized merge sort to find the largest articles.

Yes, so we should have an index for that. It should be fastet to have an O(logn)-Operation more for each edit instead of a O(n*logn)-Search each time a user want the largest articles... (n = Number of articles)

--Ivo Köthnig

Tomasz Wegrzanowski

1:31 p.m.

New subject: Faster parsing

On Sun, Feb 29, 2004 at 12:53:46PM +0100, Ivo Köthnig wrote:

...

Am Samstag, 28. Februar 2004 20:47 schrieb Tomasz Wegrzanowski:

...
...
For sorted lists (index) you will need heaps. Sorting for the largest articles is one example. I think MySQL would use heaps (or hash-tables) for such thinks...

No, it wouldn't use heaps. MySQL would use B+tree index.

I think a B+tree is nothing else than a special heap...

If you think "heap" as in "priority queue", then no.

...

...
indexes were set up right. As they aren't, it has to go through entire database (including cur_text) to find largest articles, project it so it contains only sizes, then use generalized merge sort to find the largest articles.

Yes, so we should have an index for that. It should be fastet to have an O(logn)-Operation more for each edit instead of a O(n*logn)-Search each time a user want the largest articles... (n = Number of articles)

Don't think O(). n on Wikipedia is known. By separating cur_text from the rest of cur we can make all searches order of magnitude faster - without changing O().

Ivo Köthnig

28 Feb 28 Feb

8:43 p.m.

New subject: Faster parsing

Am Samstag, 28. Februar 2004 16:41 schrieb Timwi:

...

I think you are greatly underestimating the powerful optimisations that can be done (and are done) in MySQL. Parsing an SQL statement is trivial

Yes parsing ist easy... but transform it to an optimal expression of relational algebra is not easy... Thus it would be better to use it directly...

...

(in terms of required computational power); I don't know much about relational algebra, but it seems that you would somehow have to express and then parse statements in that algebra too. If you don't have an

That depends on how you have to write down the expression... But we both stated parsing is not the problem.

...

index, then operations on huge tables are slow no matter what data structure you use.

You have to know for what reason you want the index... Most of the special functions can be switched on, if we just use an index for each of them...

7609

Age (days ago)

7612

Last active (days ago)

wikitech-l@lists.wikimedia.org

31 comments

11 participants

tags (0)

participants (11)

Agon S. Buchholz
Brion Vibber
erik_moeller＠gmx.de
Evan Prodromou
Ivan Krstic
Ivo Köthnig
Jimmy Wales
Lars Aronsson
Poor, Edmund W
Timwi
Tomasz Wegrzanowski