If someone can answer some of the following questions, that would be great. I'm going to paste each SQL table definition, then ask questions about it. I'll number each question so I can keep track of which ones still aren't answered at the end of the week. :-)
I'm sure Ed Poor will have more insightful and relevant questions than the ones I'm posting below; Brion and you other developers, please don't be offended at the tone of the questions. I seek knowledge only.
CREATE TABLE archive ( ar_namespace tinyint(2) unsigned NOT NULL default '0', ar_title varchar(255) binary NOT NULL default '', ar_text mediumtext NOT NULL, ar_comment tinyblob NOT NULL, ar_user int(5) unsigned NOT NULL default '0', ar_user_text varchar(255) binary NOT NULL default '', ar_timestamp varchar(14) binary NOT NULL default '', ar_minor_edit tinyint(1) NOT NULL default '0', ar_flags tinyblob NOT NULL ) TYPE=MyISAM PACK_KEYS=1;
1) What is the "archive" table for? Is that where articles are stored?
2) Why is there a separate ar_user_text property? Is it because MySQL doesn't support subqueries?
3) What is ar_flags used for? And why is it a tinyblob?
4) Could we save a lot of disk space by replacing ar_title with article_id, and having all the article titles and their corresponding ids in a separate table?
CREATE TABLE brokenlinks ( bl_from int(8) unsigned NOT NULL default '0', bl_to varchar(255) binary NOT NULL default '', KEY bl_from (bl_from), KEY bl_to (bl_to(10)) ) TYPE=MyISAM;
5) Why is a broken link table necessary? Couldn't it be sufficient to do a query to find out which links are broken? Or is this another "we have no subqueries" issue?
6) I notice bl_from is an integer, while bl_to is a string; why the difference? I thought articles didn't have identifying id numbers; if so, what is bl_from then?
CREATE TABLE cur ( cur_id int(8) unsigned NOT NULL auto_increment, cur_namespace tinyint(2) unsigned NOT NULL default '0', cur_title varchar(255) binary NOT NULL default '', cur_text mediumtext NOT NULL, cur_comment tinyblob NOT NULL, cur_user int(5) unsigned NOT NULL default '0', cur_user_text varchar(255) binary NOT NULL default '', cur_timestamp varchar(14) binary NOT NULL default '', cur_restrictions tinyblob NOT NULL, cur_counter bigint(20) unsigned NOT NULL default '0', cur_ind_title varchar(255) NOT NULL default '', cur_ind_text mediumtext NOT NULL, cur_is_redirect tinyint(1) unsigned NOT NULL default '0', cur_minor_edit tinyint(1) unsigned NOT NULL default '0', cur_is_new tinyint(1) unsigned NOT NULL default '0', UNIQUE KEY cur_id (cur_id), KEY cur_namespace (cur_namespace), KEY cur_title (cur_title(20)), KEY cur_timestamp (cur_timestamp), FULLTEXT KEY cur_ind_title (cur_ind_title), FULLTEXT KEY cur_ind_text (cur_ind_text) ) TYPE=MyISAM PACK_KEYS=1;
7) What is the table cur for? I find myself understanding nothing about it; I even looked at the 3 default entries that are inserted, and they didn't clue me in.
CREATE TABLE image ( img_name varchar(255) binary NOT NULL default '', img_size int(8) unsigned NOT NULL default '0', img_description tinyblob NOT NULL, img_user int(5) unsigned NOT NULL default '0', img_user_text varchar(255) binary NOT NULL default '', img_timestamp varchar(14) binary NOT NULL default '', KEY img_name (img_name(10)), KEY img_size (img_size), KEY img_timestamp (img_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
8) Is img_user_text present for the "MySQL has no subquery functionality" reason?
9) What format are timestamps in?
10) Is there special PHP code that ensures an image with the same name, but differing descriptions or img_user properties are present in the table? Does MySQL not have the "UNIQUE" keyword which could eliminate all need for that code, or is it wanted to have a record of all changes to the description of an image? If so, I would recommend a slightly different solution, similar to that for the articles table.
CREATE TABLE imagelinks ( il_from varchar(255) binary NOT NULL default '', il_to varchar(255) binary NOT NULL default '', KEY il_from (il_from(10)), KEY il_to (il_to(10)) ) TYPE=MyISAM;
11) Surely we could save space by using integer article ids?
CREATE TABLE ipblocks ( ipb_address varchar(40) binary NOT NULL default '', ipb_user int(8) unsigned NOT NULL default '0', ipb_by int(8) unsigned NOT NULL default '0', ipb_reason tinyblob NOT NULL, ipb_timestamp varchar(14) binary NOT NULL default '', KEY ipb_address (ipb_address), KEY ipb_user (ipb_user) ) TYPE=MyISAM PACK_KEYS=1;
12) No questions, but why is the KEY keyword used? Is this a MySQL trigger for making indexes? Also, theres a bunch of things that could use the UNIQUE keyword to save php code.
CREATE TABLE links ( l_from varchar(255) binary NOT NULL default '', l_to int(8) unsigned NOT NULL default '0', KEY l_from (l_from(10)), KEY l_to (l_to) ) TYPE=MyISAM;
13) Similar to the question about the orphanedlinks table, why is l_from a string, while l_to is an integer?
CREATE TABLE old ( old_id int(8) unsigned NOT NULL auto_increment, old_namespace tinyint(2) unsigned NOT NULL default '0', old_title varchar(255) binary NOT NULL default '', old_text mediumtext NOT NULL, old_comment tinyblob NOT NULL, old_user int(5) unsigned NOT NULL default '0', old_user_text varchar(255) binary NOT NULL default '', old_timestamp varchar(14) binary NOT NULL default '', old_minor_edit tinyint(1) NOT NULL default '0', old_flags tinyblob NOT NULL, UNIQUE KEY old_id (old_id), KEY old_title (old_title(20)), KEY old_timestamp (old_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
14) What is table "old" used for? Deleted articles? Versions of articles that are no longer "the most current"?
CREATE TABLE oldimage ( oi_name varchar(255) binary NOT NULL default '', oi_archive_name varchar(255) binary NOT NULL default '', oi_size int(8) unsigned NOT NULL default '0', oi_description tinyblob NOT NULL, oi_user int(5) unsigned NOT NULL default '0', oi_user_text varchar(255) binary NOT NULL default '', oi_timestamp varchar(14) binary NOT NULL default '', KEY oi_name (oi_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
15) So, history IS stored for each image? Then I recommend the scheme of having a separate table for unique image names, and their corresponding "image id"
CREATE TABLE random ( ra_current tinyint(1) unsigned NOT NULL default '0', ra_title varchar(255) binary NOT NULL default '' ) TYPE=MyISAM PACK_KEYS=1;
16) What is this "random" table for?
CREATE TABLE recentchanges ( rc_timestamp varchar(14) binary NOT NULL default '', rc_cur_time varchar(14) binary NOT NULL default '', rc_user int(10) unsigned NOT NULL default '0', rc_user_text varchar(255) binary NOT NULL default '', rc_namespace tinyint(3) unsigned NOT NULL default '0', rc_title varchar(255) binary NOT NULL default '', rc_comment varchar(255) binary NOT NULL default '', rc_minor tinyint(3) unsigned NOT NULL default '0', rc_new tinyint(3) unsigned NOT NULL default '0', rc_cur_id int(10) unsigned NOT NULL default '0', rc_this_oldid int(10) unsigned NOT NULL default '0', rc_last_oldid int(10) unsigned NOT NULL default '0' ) TYPE=MyISAM PACK_KEYS=1;
17) Could the recent changes table be replaced with an appropriately defined "view"? Or are views something MySQL doesn't do very well?
CREATE TABLE site_stats ( ss_row_id int(8) unsigned NOT NULL default '0', ss_total_views bigint(20) unsigned default '0', ss_total_edits bigint(20) unsigned default '0', ss_good_articles bigint(20) unsigned default '0', UNIQUE KEY ss_row_id (ss_row_id) ) TYPE=MyISAM;
18) What is ss_row_id and ss_good_articles? How does this table work? Does it contain stats for each article? Shouldn't that go in the table that already contains the info for an article? That would lessen the number of SQL updates, collapsing two UPDATE statements into one.
CREATE TABLE user ( user_id int(5) unsigned NOT NULL auto_increment, user_name varchar(255) binary NOT NULL default '', user_rights tinyblob NOT NULL, user_password tinyblob NOT NULL, user_newpassword tinyblob NOT NULL, user_email tinytext NOT NULL, user_options blob NOT NULL, user_watch mediumblob NOT NULL, user_newtalk tinyint(1) unsigned NOT NULL default '1', UNIQUE KEY user_id (user_id), KEY user_name (user_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
19) Does user_rights correspond to ar_flags?
20) What is "newpassword" for?
21) Are all the "options" really necessary?
22) It might be the SQL way to make a "watchlist" table, where each row contains a user id, and an article id. Or maybe this would be useless in a database that doesn't have subqueries.
23) What is the user_newtalk field?
Cheers!
Jonathan
Jonathan Walther wrote:
CREATE TABLE archive ( ar_namespace tinyint(2) unsigned NOT NULL default '0', ar_title varchar(255) binary NOT NULL default '', ar_text mediumtext NOT NULL, ar_comment tinyblob NOT NULL, ar_user int(5) unsigned NOT NULL default '0', ar_user_text varchar(255) binary NOT NULL default '', ar_timestamp varchar(14) binary NOT NULL default '', ar_minor_edit tinyint(1) NOT NULL default '0', ar_flags tinyblob NOT NULL ) TYPE=MyISAM PACK_KEYS=1;
- What is the "archive" table for? Is that where articles are stored?
That's where "deleted" articles are stored, so they can be "undeleted". That table will have to be flushed from time to time in the future. The current articles are in the "cur" table, the old revisions in "old".
- Why is there a separate ar_user_text property? Is it because MySQL
doesn't support subqueries?
- What is ar_flags used for? And why is it a tinyblob?
I'm not sure myself about these, as I didn't really work with that table before. "tinyblob" is the same as "tinytext", but with case-sensitive search (basically).
- Could we save a lot of disk space by replacing ar_title with
article_id, and having all the article titles and their corresponding ids in a separate table?
No.
CREATE TABLE brokenlinks ( bl_from int(8) unsigned NOT NULL default '0', bl_to varchar(255) binary NOT NULL default '', KEY bl_from (bl_from), KEY bl_to (bl_to(10)) ) TYPE=MyISAM;
- Why is a broken link table necessary? Couldn't it be sufficient to
do a query to find out which links are broken? Or is this another "we have no subqueries" issue?
It is very useful for the "most wanted" page, among other things.
- I notice bl_from is an integer, while bl_to is a string; why the
difference? I thought articles didn't have identifying id numbers; if so, what is bl_from then?
Say, article A links to non-existing article B. bl_from is the cur_id of A, and bl_to is the name of the non-existent article B.
CREATE TABLE cur ( cur_id int(8) unsigned NOT NULL auto_increment, cur_namespace tinyint(2) unsigned NOT NULL default '0', cur_title varchar(255) binary NOT NULL default '', cur_text mediumtext NOT NULL, cur_comment tinyblob NOT NULL, cur_user int(5) unsigned NOT NULL default '0', cur_user_text varchar(255) binary NOT NULL default '', cur_timestamp varchar(14) binary NOT NULL default '', cur_restrictions tinyblob NOT NULL, cur_counter bigint(20) unsigned NOT NULL default '0', cur_ind_title varchar(255) NOT NULL default '', cur_ind_text mediumtext NOT NULL, cur_is_redirect tinyint(1) unsigned NOT NULL default '0', cur_minor_edit tinyint(1) unsigned NOT NULL default '0', cur_is_new tinyint(1) unsigned NOT NULL default '0', UNIQUE KEY cur_id (cur_id), KEY cur_namespace (cur_namespace), KEY cur_title (cur_title(20)), KEY cur_timestamp (cur_timestamp), FULLTEXT KEY cur_ind_title (cur_ind_title), FULLTEXT KEY cur_ind_text (cur_ind_text) ) TYPE=MyISAM PACK_KEYS=1;
- What is the table cur for? I find myself understanding nothing about
it; I even looked at the 3 default entries that are inserted, and they didn't clue me in.
Current article versions.
CREATE TABLE image ( img_name varchar(255) binary NOT NULL default '', img_size int(8) unsigned NOT NULL default '0', img_description tinyblob NOT NULL, img_user int(5) unsigned NOT NULL default '0', img_user_text varchar(255) binary NOT NULL default '', img_timestamp varchar(14) binary NOT NULL default '', KEY img_name (img_name(10)), KEY img_size (img_size), KEY img_timestamp (img_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
- Is img_user_text present for the "MySQL has no subquery
functionality" reason?
That table is Lee's invention. img_user_text contains thename of the user who uploaded it.
- What format are timestamps in?
UNIX, I think.
- Is there special PHP code that ensures an image with the same name,
but differing descriptions or img_user properties are present in the table? Does MySQL not have the "UNIQUE" keyword which could eliminate all need for that code, or is it wanted to have a record of all changes to the description of an image? If so, I would recommend a slightly different solution, similar to that for the articles table.
I'm not sure I understand that question. If there's an image "A", and you upload a new image "A", you'll do the same as saving a new version of an article.
CREATE TABLE imagelinks ( il_from varchar(255) binary NOT NULL default '', il_to varchar(255) binary NOT NULL default '', KEY il_from (il_from(10)), KEY il_to (il_to(10)) ) TYPE=MyISAM;
- Surely we could save space by using integer article ids?
I guess so. But, the totally saved space would probably be less than a single "old" revision of a long article. Not worth the bother, I'd say.
CREATE TABLE ipblocks ( ipb_address varchar(40) binary NOT NULL default '', ipb_user int(8) unsigned NOT NULL default '0', ipb_by int(8) unsigned NOT NULL default '0', ipb_reason tinyblob NOT NULL, ipb_timestamp varchar(14) binary NOT NULL default '', KEY ipb_address (ipb_address), KEY ipb_user (ipb_user) ) TYPE=MyISAM PACK_KEYS=1;
- No questions, but why is the KEY keyword used? Is this a MySQL
trigger for making indexes? Also, theres a bunch of things that could use the UNIQUE keyword to save php code.
You're probably right on both.
CREATE TABLE links ( l_from varchar(255) binary NOT NULL default '', l_to int(8) unsigned NOT NULL default '0', KEY l_from (l_from(10)), KEY l_to (l_to) ) TYPE=MyISAM;
- Similar to the question about the orphanedlinks table, why is l_from
a string, while l_to is an integer?
I think l_from could be an integer as well. Would probably increase lookup speed, and save memory. Some things might get more complicated, though, like "what links here".
CREATE TABLE old ( old_id int(8) unsigned NOT NULL auto_increment, old_namespace tinyint(2) unsigned NOT NULL default '0', old_title varchar(255) binary NOT NULL default '', old_text mediumtext NOT NULL, old_comment tinyblob NOT NULL, old_user int(5) unsigned NOT NULL default '0', old_user_text varchar(255) binary NOT NULL default '', old_timestamp varchar(14) binary NOT NULL default '', old_minor_edit tinyint(1) NOT NULL default '0', old_flags tinyblob NOT NULL, UNIQUE KEY old_id (old_id), KEY old_title (old_title(20)), KEY old_timestamp (old_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
- What is table "old" used for? Deleted articles? Versions of
articles that are no longer "the most current"?
The latter. Deleted articles go into "archive".
CREATE TABLE oldimage ( oi_name varchar(255) binary NOT NULL default '', oi_archive_name varchar(255) binary NOT NULL default '', oi_size int(8) unsigned NOT NULL default '0', oi_description tinyblob NOT NULL, oi_user int(5) unsigned NOT NULL default '0', oi_user_text varchar(255) binary NOT NULL default '', oi_timestamp varchar(14) binary NOT NULL default '', KEY oi_name (oi_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
- So, history IS stored for each image? Then I recommend the scheme
of having a separate table for unique image names, and their corresponding "image id"
That's what "Image" does...
CREATE TABLE random ( ra_current tinyint(1) unsigned NOT NULL default '0', ra_title varchar(255) binary NOT NULL default '' ) TYPE=MyISAM PACK_KEYS=1;
- What is this "random" table for?
You got me there. Could be a relict. Lee once tried some stunts to get a faster "random page". If that's where it comes from, it's obsolete now.
CREATE TABLE recentchanges ( rc_timestamp varchar(14) binary NOT NULL default '', rc_cur_time varchar(14) binary NOT NULL default '', rc_user int(10) unsigned NOT NULL default '0', rc_user_text varchar(255) binary NOT NULL default '', rc_namespace tinyint(3) unsigned NOT NULL default '0', rc_title varchar(255) binary NOT NULL default '', rc_comment varchar(255) binary NOT NULL default '', rc_minor tinyint(3) unsigned NOT NULL default '0', rc_new tinyint(3) unsigned NOT NULL default '0', rc_cur_id int(10) unsigned NOT NULL default '0', rc_this_oldid int(10) unsigned NOT NULL default '0', rc_last_oldid int(10) unsigned NOT NULL default '0' ) TYPE=MyISAM PACK_KEYS=1;
- Could the recent changes table be replaced with an appropriately
defined "view"? Or are views something MySQL doesn't do very well?
That table is pretty new. I think Lee's still working on it, probably Brion as well.
CREATE TABLE site_stats ( ss_row_id int(8) unsigned NOT NULL default '0', ss_total_views bigint(20) unsigned default '0', ss_total_edits bigint(20) unsigned default '0', ss_good_articles bigint(20) unsigned default '0', UNIQUE KEY ss_row_id (ss_row_id) ) TYPE=MyISAM;
- What is ss_row_id and ss_good_articles? How does this table work?
Does it contain stats for each article? Shouldn't that go in the table that already contains the info for an article? That would lessen the number of SQL updates, collapsing two UPDATE statements into one.
That's where the "overall statistics" are parked. ss_good_articles is the number of articles you see on the Main Page.
CREATE TABLE user ( user_id int(5) unsigned NOT NULL auto_increment, user_name varchar(255) binary NOT NULL default '', user_rights tinyblob NOT NULL, user_password tinyblob NOT NULL, user_newpassword tinyblob NOT NULL, user_email tinytext NOT NULL, user_options blob NOT NULL, user_watch mediumblob NOT NULL, user_newtalk tinyint(1) unsigned NOT NULL default '1', UNIQUE KEY user_id (user_id), KEY user_name (user_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
- Does user_rights correspond to ar_flags?
May be. I'm not that familiar with the archive table.
- What is "newpassword" for?
When you let the system mail you a new password, you can log in with both the old and the new one, until you use the new one, which then becomes the "standard".
- Are all the "options" really necessary?
Yes!
- It might be the SQL way to make a "watchlist" table, where each row
contains a user id, and an article id. Or maybe this would be useless in a database that doesn't have subqueries.
You're right on this one. Guess noone really paid attention to that before...
- What is the user_newtalk field?
When your talk page was edited, that flag is set, so it displays a "*" next to your talk link when you're logged in.
Hope that helps,
Magnus
On Sat, 2002-11-16 at 01:10, Magnus Manske wrote:
Jonathan Walther wrote:
- Similar to the question about the orphanedlinks table, why is l_from
a string, while l_to is an integer?
I think l_from could be an integer as well. Would probably increase lookup speed, and save memory. Some things might get more complicated, though, like "what links here".
On closer inspection, I think using an integer would make whatlinkshere *simpler*. Currently, every linking page's title has to be parsed into namespace/title and checked for redirect status: SELECT DISTINCT l_from FROM links WHERE l_to={$lid} (now loop over the results with:) SELECT cur_id,cur_is_redirect FROM cur WHERE cur_namespace={$ns} AND cur_title='{$t}'
If we used an integer index, I think we could do this in one step:
SELECT cur_id,cur_namespace,cur_title,cur_is_redirect FROM cur,links WHERE l_from=cur_id AND l_to={$lid}
Of course, I'm the guy flipping through "Learn SQL in a weekend" as I type, so take all my SQL advice with a suitably large grain of salt. ;)
- It might be the SQL way to make a "watchlist" table, where each row
contains a user id, and an article id. Or maybe this would be useless in a database that doesn't have subqueries.
You're right on this one. Guess noone really paid attention to that before...
I think I'm converted to this idea; subqueries might help for listing talk pages of watched pages, but I think we can work around it.
The most common case for the watchlist is checking whether the current article is in the list; this can be done a lot faster by checking two integers than by loading up a huge string and parsing it on every page load.
Additionally, when a page is renamed, it should stay in your watchlist under the new name. (I hate clearing out "X moved to Y" redirects from my watchlist...)
-- brion vibber (brion @ pobox.com)
----- Original Message ----- From: "Brion Vibber" brion@pobox.com To: wikitech-l@wikipedia.org Sent: Sunday, November 17, 2002 11:35 PM Subject: Re: [Wikitech-l] questions about the SQL structure
On Sat, 2002-11-16 at 01:10, Magnus Manske wrote:
Jonathan Walther wrote:
- Similar to the question about the orphanedlinks table, why is
l_from
a string, while l_to is an integer?
I think l_from could be an integer as well. Would probably increase lookup speed, and save memory. Some things might get more complicated, though, like "what links here".
On closer inspection, I think using an integer would make whatlinkshere *simpler*. Currently, every linking page's title has to be parsed into namespace/title and checked for redirect status: SELECT DISTINCT l_from FROM links WHERE l_to={$lid} (now loop over the results with:) SELECT cur_id,cur_is_redirect FROM cur WHERE cur_namespace={$ns} AND cur_title='{$t}'
If we used an integer index, I think we could do this in one step:
SELECT cur_id,cur_namespace,cur_title,cur_is_redirect FROM cur,links WHERE l_from=cur_id AND l_to={$lid}
Of course, I'm the guy flipping through "Learn SQL in a weekend" as I type, so take all my SQL advice with a suitably large grain of salt. ;)
An even better move would be to use a 64-bit number for the index rather than an integer. If this were to be done it would become possible to assign article id numbers even to articles which did not exist by using a cryptographic checksum function to create an article ID directly from the article title text. This then gives the simpler query benefits that Jonathan Walter was talking about earlier without the need to actually create a whole load of empty articles some of which will inevitably be orphaned before they are actually used. It is necessary to use at least a 64-bit number because anything smaller gives too much chance of a hash collision occurring and two articles accidentally ending up with the same id.
By the way an SQL clause like -- SELECT a FROM b WHERE c = NULL -- should always select zero records because comparing anything to NULL always produces FALSE as a result. The correct way to do this in standard SQL s -- SELECT a FROM b WHERE c IS NULL.
Cheers
Derek
On Fri, 2002-11-15 at 23:14, Jonathan Walther wrote:
If someone can answer some of the following questions, that would be great. I'm going to paste each SQL table definition, then ask questions about it. I'll number each question so I can keep track of which ones still aren't answered at the end of the week. :-)
I'm sure Ed Poor will have more insightful and relevant questions than the ones I'm posting below; Brion and you other developers, please don't be offended at the tone of the questions. I seek knowledge only.
Disclaimer: I'm not sure how all of this works myself. I came here to work on localization, all this database stuff is an ugly side trip for me.
See also 'schema.doc' in the 'docs' subdirectory in the source tarball, which lists the usage of some stuff.
CREATE TABLE archive ( ar_namespace tinyint(2) unsigned NOT NULL default '0', ar_title varchar(255) binary NOT NULL default '', ar_text mediumtext NOT NULL, ar_comment tinyblob NOT NULL, ar_user int(5) unsigned NOT NULL default '0', ar_user_text varchar(255) binary NOT NULL default '', ar_timestamp varchar(14) binary NOT NULL default '', ar_minor_edit tinyint(1) NOT NULL default '0', ar_flags tinyblob NOT NULL ) TYPE=MyISAM PACK_KEYS=1;
- What is the "archive" table for? Is that where articles are stored?
'archive' is where deleted articles are stored. They can be viewed or undeleted by a sysop via Special:Undelete or through manual queries.
- Why is there a separate ar_user_text property? Is it because MySQL
doesn't support subqueries?
Probably.
- What is ar_flags used for? And why is it a tinyblob?
I believe the flags field is not used for any purpose.
- Could we save a lot of disk space by replacing ar_title with
article_id, and having all the article titles and their corresponding ids in a separate table?
Not significantly for that table. Might be a useful trick elsewhere, though.
CREATE TABLE brokenlinks ( bl_from int(8) unsigned NOT NULL default '0', bl_to varchar(255) binary NOT NULL default '', KEY bl_from (bl_from), KEY bl_to (bl_to(10)) ) TYPE=MyISAM;
- Why is a broken link table necessary? Couldn't it be sufficient to
do a query to find out which links are broken? Or is this another "we have no subqueries" issue?
I believe the primary use of the brokenlinks table is for generating the 'most wanted' list.
- I notice bl_from is an integer, while bl_to is a string; why the
difference? I thought articles didn't have identifying id numbers; if so, what is bl_from then?
All articles have ID numbers; that's cur_id in the cur table. bl_from and l_to reference cur_id numbers. Articles that don't exist don't have cur_id numbers of course, so cannot be referenced by id, but only by title.
CREATE TABLE cur ( cur_id int(8) unsigned NOT NULL auto_increment, cur_namespace tinyint(2) unsigned NOT NULL default '0', cur_title varchar(255) binary NOT NULL default '', cur_text mediumtext NOT NULL, cur_comment tinyblob NOT NULL, cur_user int(5) unsigned NOT NULL default '0', cur_user_text varchar(255) binary NOT NULL default '', cur_timestamp varchar(14) binary NOT NULL default '', cur_restrictions tinyblob NOT NULL, cur_counter bigint(20) unsigned NOT NULL default '0', cur_ind_title varchar(255) NOT NULL default '', cur_ind_text mediumtext NOT NULL, cur_is_redirect tinyint(1) unsigned NOT NULL default '0', cur_minor_edit tinyint(1) unsigned NOT NULL default '0', cur_is_new tinyint(1) unsigned NOT NULL default '0', UNIQUE KEY cur_id (cur_id), KEY cur_namespace (cur_namespace), KEY cur_title (cur_title(20)), KEY cur_timestamp (cur_timestamp), FULLTEXT KEY cur_ind_title (cur_ind_title), FULLTEXT KEY cur_ind_text (cur_ind_text) ) TYPE=MyISAM PACK_KEYS=1;
- What is the table cur for? I find myself understanding nothing about
it; I even looked at the 3 default entries that are inserted, and they didn't clue me in.
The cur table contains the current revision of every article in the database. It's kind of important. ;)
The default entries aren't very informative because an empty database is, well, pretty empty!
CREATE TABLE image ( img_name varchar(255) binary NOT NULL default '', img_size int(8) unsigned NOT NULL default '0', img_description tinyblob NOT NULL, img_user int(5) unsigned NOT NULL default '0', img_user_text varchar(255) binary NOT NULL default '', img_timestamp varchar(14) binary NOT NULL default '', KEY img_name (img_name(10)), KEY img_size (img_size), KEY img_timestamp (img_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
- Is img_user_text present for the "MySQL has no subquery
functionality" reason?
Prolly.
- What format are timestamps in?
Text strings formatted like so: 'YYYYMMDDHHMMSS'. Note that we will have a year 10000 problem unless we upgrade the software first! ;)
In phase II they were TIMESTAMP fields, which I believe came out formatted that way by default. This was annoying as TIMESTAMPs are updated automatically every time you touch the row unless you explicitly set them to themselves. (That's why you'll occasionally see some legacy cruft in update queries that set cur_timestamp=cur_timestamp or whatnot.)
There's probably a better way to do that.
- Is there special PHP code that ensures an image with the same name,
but differing descriptions or img_user properties are present in the table? Does MySQL not have the "UNIQUE" keyword which could eliminate all need for that code, or is it wanted to have a record of all changes to the description of an image? If so, I would recommend a slightly different solution, similar to that for the articles table.
The current solution is more or less like that for the articles tables. There's 'image' ~= 'cur' and 'oldimage' ~= 'old'. Uploading a file with the same name as an existing one is considered creating a new revision of the same file; the old one is pushed back into the old list, and the new one put in the current place.
CREATE TABLE imagelinks ( il_from varchar(255) binary NOT NULL default '', il_to varchar(255) binary NOT NULL default '', KEY il_from (il_from(10)), KEY il_to (il_to(10)) ) TYPE=MyISAM;
- Surely we could save space by using integer article ids?
The main use of this table is generating lists of pages which link to a given image. No subqueries again...
CREATE TABLE ipblocks ( ipb_address varchar(40) binary NOT NULL default '', ipb_user int(8) unsigned NOT NULL default '0', ipb_by int(8) unsigned NOT NULL default '0', ipb_reason tinyblob NOT NULL, ipb_timestamp varchar(14) binary NOT NULL default '', KEY ipb_address (ipb_address), KEY ipb_user (ipb_user) ) TYPE=MyISAM PACK_KEYS=1;
- No questions, but why is the KEY keyword used? Is this a MySQL
trigger for making indexes? Also, theres a bunch of things that could use the UNIQUE keyword to save php code.
Sounds like.
CREATE TABLE links ( l_from varchar(255) binary NOT NULL default '', l_to int(8) unsigned NOT NULL default '0', KEY l_from (l_from(10)), KEY l_to (l_to) ) TYPE=MyISAM;
- Similar to the question about the orphanedlinks table, why is l_from
a string, while l_to is an integer?
The primary use of the links table is to generate the 'Whatlinkshere' listing. No subqueries, and we need the titles to list the links, not the id numbers.
CREATE TABLE old ( old_id int(8) unsigned NOT NULL auto_increment, old_namespace tinyint(2) unsigned NOT NULL default '0', old_title varchar(255) binary NOT NULL default '', old_text mediumtext NOT NULL, old_comment tinyblob NOT NULL, old_user int(5) unsigned NOT NULL default '0', old_user_text varchar(255) binary NOT NULL default '', old_timestamp varchar(14) binary NOT NULL default '', old_minor_edit tinyint(1) NOT NULL default '0', old_flags tinyblob NOT NULL, UNIQUE KEY old_id (old_id), KEY old_title (old_title(20)), KEY old_timestamp (old_timestamp) ) TYPE=MyISAM PACK_KEYS=1;
- What is table "old" used for? Deleted articles? Versions of
articles that are no longer "the most current"?
Versions of articles that are no longer 'the most current', yes. It would be more efficient probably to dump the title and namespace fields and simply reference the cur_id.
CREATE TABLE oldimage ( oi_name varchar(255) binary NOT NULL default '', oi_archive_name varchar(255) binary NOT NULL default '', oi_size int(8) unsigned NOT NULL default '0', oi_description tinyblob NOT NULL, oi_user int(5) unsigned NOT NULL default '0', oi_user_text varchar(255) binary NOT NULL default '', oi_timestamp varchar(14) binary NOT NULL default '', KEY oi_name (oi_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
- So, history IS stored for each image? Then I recommend the scheme
of having a separate table for unique image names, and their corresponding "image id"
...
CREATE TABLE random ( ra_current tinyint(1) unsigned NOT NULL default '0', ra_title varchar(255) binary NOT NULL default '' ) TYPE=MyISAM PACK_KEYS=1;
- What is this "random" table for?
Stores ~1000 randomly selected page names. When someone clicks 'Random page', one of those is selected rather than one of the whole bunch; every once in a while the list is regenerated. Allegedly this improves performance.
CREATE TABLE recentchanges ( rc_timestamp varchar(14) binary NOT NULL default '', rc_cur_time varchar(14) binary NOT NULL default '', rc_user int(10) unsigned NOT NULL default '0', rc_user_text varchar(255) binary NOT NULL default '', rc_namespace tinyint(3) unsigned NOT NULL default '0', rc_title varchar(255) binary NOT NULL default '', rc_comment varchar(255) binary NOT NULL default '', rc_minor tinyint(3) unsigned NOT NULL default '0', rc_new tinyint(3) unsigned NOT NULL default '0', rc_cur_id int(10) unsigned NOT NULL default '0', rc_this_oldid int(10) unsigned NOT NULL default '0', rc_last_oldid int(10) unsigned NOT NULL default '0' ) TYPE=MyISAM PACK_KEYS=1;
- Could the recent changes table be replaced with an appropriately
defined "view"? Or are views something MySQL doesn't do very well?
Yes, probably. It's all very hackish, and if it's not helping much, a cleaner and well-performaning replacement is welcome.
CREATE TABLE site_stats ( ss_row_id int(8) unsigned NOT NULL default '0', ss_total_views bigint(20) unsigned default '0', ss_total_edits bigint(20) unsigned default '0', ss_good_articles bigint(20) unsigned default '0', UNIQUE KEY ss_row_id (ss_row_id) ) TYPE=MyISAM;
- What is ss_row_id and ss_good_articles? How does this table work?
Does it contain stats for each article? Shouldn't that go in the table that already contains the info for an article? That would lessen the number of SQL updates, collapsing two UPDATE statements into one.
This contains a single row for the whole database. The good articles count is generated by faulty math anyway, though.
CREATE TABLE user ( user_id int(5) unsigned NOT NULL auto_increment, user_name varchar(255) binary NOT NULL default '', user_rights tinyblob NOT NULL, user_password tinyblob NOT NULL, user_newpassword tinyblob NOT NULL, user_email tinytext NOT NULL, user_options blob NOT NULL, user_watch mediumblob NOT NULL, user_newtalk tinyint(1) unsigned NOT NULL default '1', UNIQUE KEY user_id (user_id), KEY user_name (user_name(10)) ) TYPE=MyISAM PACK_KEYS=1;
- Does user_rights correspond to ar_flags?
No. user_rights is a comma-separated list of priveledges. Currently the only ones defined are 'sysop' and 'developer', so possible values are: '' (normal users) 'sysop' ("sysops" or "administrators") 'sysop,developer' ("developers")
- What is "newpassword" for?
If you click 'e-mail me a new password', a new password is generated and stored in this field. Then, you can log in either with this new temporary password (in which case it becomes your regular password until you manually change it) or with your current password. This prevents you from having your account locked out if you accidentally (or someone else maliciously) clicks the button with your user name given, but you don't intend to use (or can't get it) it.
- Are all the "options" really necessary?
Yes?
- It might be the SQL way to make a "watchlist" table, where each row
contains a user id, and an article id. Or maybe this would be useless in a database that doesn't have subqueries.
Could do.
- What is the user_newtalk field?
Set to 1 if your user talk page is edited, sets that annoying asterisk in the corner of your screen that doesn't get cleared till you look at the darn page.
-- brion vibber (brion @ pobox.com)
On Sat, Nov 16, 2002 at 01:28:31AM -0800, Brion Vibber wrote:
- Why is a broken link table necessary? Couldn't it be sufficient to
do a query to find out which links are broken? Or is this another "we have no subqueries" issue?
I believe the primary use of the brokenlinks table is for generating the 'most wanted' list.
All articles have ID numbers; that's cur_id in the cur table. bl_from and l_to reference cur_id numbers. Articles that don't exist don't have cur_id numbers of course, so cannot be referenced by id, but only by title.
What if there were no "broken links", only articles that had no text? ie, the text field would be NULL? And we could do with a single "links" table by collapsing article and image links together. And one could find the "dangling links" simply by searching for those articles with the text property set to NULL. The following query would do the trick:
SELECT * from cur where cur_text = NULL;
The current solution is more or less like that for the articles tables. There's 'image' ~= 'cur' and 'oldimage' ~= 'old'. Uploading a file with the same name as an existing one is considered creating a new revision of the same file; the old one is pushed back into the old list, and the new one put in the current place.
Hm. Maybe images could be collapsed into the "cur" table too.
The primary use of the links table is to generate the 'Whatlinkshere' listing. No subqueries, and we need the titles to list the links, not the id numbers.
I think we could reduce to the following tables:
current_articles, previous_articles, deleted_articles, links
We just add a "type" field to the articles tables to indicate "wiki text" or "image"; this gives us flexibility to support other types of object too.
Stores ~1000 randomly selected page names. When someone clicks 'Random page', one of those is selected rather than one of the whole bunch; every once in a while the list is regenerated. Allegedly this improves performance.
That makes sense, and it very well may improve performance.
- What is ss_row_id and ss_good_articles? How does this table work?
Does it contain stats for each article? Shouldn't that go in the table that already contains the info for an article? That would lessen the number of SQL updates, collapsing two UPDATE statements into one.
This contains a single row for the whole database. The good articles count is generated by faulty math anyway, though.
What is a "good article"? What uses that table? What would happen if it went away?
No. user_rights is a comma-separated list of priveledges. Currently the only ones defined are 'sysop' and 'developer', so possible values are: '' (normal users) 'sysop' ("sysops" or "administrators") 'sysop,developer' ("developers")
Ah. Oddly enough, I would recommend splitting that out into two tables, each containing just user ids; one called "developers", and the other "sysops".
The newpassword feature sounds nifty. I like it.
Jonathan
Jonathan Walther wrote:
What if there were no "broken links", only articles that had no text? ie, the text field would be NULL?
No, "broken links" are links to articles that don't exist at all, not to "blank" articles.
Hm. Maybe images could be collapsed into the "cur" table too.
But they have special properties, which would have to be blank in every "normal" article. Also, searches etc. would be slower.
I think we could reduce to the following tables:
current_articles, previous_articles, deleted_articles, links
Then we could drop deleted_articles as well, just set some "cur_deleted" tag in "current_articles". But, you'll be the one to change every line of source code for that, I guess...
We just add a "type" field to the articles tables to indicate "wiki text" or "image"; this gives us flexibility to support other types of object too.
"image" is already used for other files (e.g., ogg).
What is a "good article"? What uses that table? What would happen if it went away?
A "good article" contains at least a comma...
Ah. Oddly enough, I would recommend splitting that out into two tables, each containing just user ids; one called "developers", and the other "sysops".
OK, but why bother? Seems to work, and saves some queries.
Magnus
On Sat, Nov 16, 2002 at 11:22:47AM +0100, Magnus Manske wrote:
No, "broken links" are links to articles that don't exist at all, not to "blank" articles.
Correct. I am proposing that there should be no broken links because all articles are created, even if only blank articles. Periodically a process could sweep blank articles with nothing linking to them from the database. For instance,
REMOVE FROM current_articles WHERE ...
A single SQL statement could do it; Ed, can you back me up on this? The SQL statement would do a complement of the union of article id's in the link table and the current_articles table to find articles not linked, and then further narrow it down to articles with NULL text properties.
Then we could drop deleted_articles as well, just set some "cur_deleted" tag in "current_articles". But, you'll be the one to change every line of source code for that, I guess...
Tonight, I got a vision dancing through my head of an Apache module, called mod_wiki. I'm investigating it's possibilities.
Jonathan
On Sat, Nov 16, 2002 at 02:35:01AM -0800, Jonathan Walther wrote:
On Sat, Nov 16, 2002 at 11:22:47AM +0100, Magnus Manske wrote:
No, "broken links" are links to articles that don't exist at all, not to "blank" articles.
Correct. I am proposing that there should be no broken links because all articles are created, even if only blank articles. Periodically a process could sweep blank articles with nothing linking to them from the database. For instance,
REMOVE FROM current_articles WHERE ...
Ok, here is what the SQL query would look like:
REMOVE FROM current_articles WHERE text=NULL AND NOT article_id IN ( SELECT destination FROM links DISTINCT );
You see? No need for a separate table of dangling links, if you make sure that every article comes into existance. Subqueries are really nice.
Cheers!
<Location /wiki> SetHandler mod-wikipedia </Location>
Jonathan
Jonathan Walther wrote:
Correct. I am proposing that there should be no broken links because all articles are created, even if only blank articles. Periodically a process could sweep blank articles with nothing linking to them from the database. For instance, REMOVE FROM current_articles WHERE ...
So, you'll replace the broken_links table by more entries in the article table. Granted, we'd have one table less, but more articles instead. And these dummy articles would have to be distinguished from the "real" ones in *every* query regarding existing/broken articles. More storage (for the dummy articles), more code, for one table less.
Sure, it can be done, but I don't see the advantage. Maybe it's just me.
Magnus
You see? No need for a separate table of dangling links, if you make sure that every article comes into existance. Subqueries are really nice.
That's a bad idea, because often you will have articles with many links to non-existent pages, which would all have to be created. This would blow the CUR table out of proportion, with no real advantage.
Note that my impression is that the main reason the site is so slow is simply that MySQL doesn't handle very large tables very well (maybe subqueries would help here, I don't know). That's why we should try to avoid making them bigger while we still use it.
Regards,
Erik
On Sat, Nov 16, 2002 at 06:08:00PM +0100, Erik Moeller wrote:
You see? No need for a separate table of dangling links, if you make sure that every article comes into existance. Subqueries are really nice.
That's a bad idea, because often you will have articles with many links to non-existent pages, which would all have to be created. This would blow the CUR table out of proportion, with no real advantage.
I don't think so. With the right data definitions, the number of entries in the cur table would no more than double, and the amount of space taken up would definately NOT double; I doubt if it would add even 10 megs to the table size.
Note that my impression is that the main reason the site is so slow is simply that MySQL doesn't handle very large tables very well (maybe subqueries would help here, I don't know). That's why we should try to avoid making them bigger while we still use it.
Postgres on the other hand, does have mechanisms for dealing with large tables. First, you create an index. Or several indexes. Then you run the vacuum optimizer program every night. It just gets faster and faster. This is the advantage of using a real database.
If we port this to Postgres, and then expand yet again, we'll be in good shape if some kind soul wants to donate an Oracle license, because we won't have to worry about MySQLisms or PostgreSQLisms in doing the port to Oracle.
Jonathan
I don't think so. With the right data definitions, the number of entries in the cur table would no more than double, and the amount of space taken up would definately NOT double; I doubt if it would add even 10 megs to the table size.
Many Wikipedia users try to insert as many relevant links as possible, making the relationship between "broken" and real links often 10:1. I don't think it would make much sense to create blank rows for each one of them just to have simpler table definitions.
Postgres on the other hand, does have mechanisms for dealing with large tables. First, you create an index. Or several indexes. Then you run the vacuum optimizer program every night. It just gets faster and faster. This is the advantage of using a real database.
First, we're not on Postgres yet. I appreciate that you're helping with the port, but let's not make any changes that would negatively affect the system we're using now. We're already slow enough as it is.
MySQL also has indexes :-). It just doesn't use them in some special cases; it would probably already give us a significant performance boost to figure out what these cases are and to fix the code accordingly.
Note that Yahoo! uses MySQL, so it's quite possible to run very large sites with it.
Regards,
Erik
On Sun, Nov 17, 2002 at 12:06:00AM +0100, Erik Moeller wrote:
Many Wikipedia users try to insert as many relevant links as possible, making the relationship between "broken" and real links often 10:1. I don't think it would make much sense to create blank rows for each one of them just to have simpler table definitions.
That doesn't sound right. There are definately more "good" links in the database than dangling ones. Maybe you could run a query and tell us the results?
First, we're not on Postgres yet. I appreciate that you're helping with the port, but let's not make any changes that would negatively affect the system we're using now. We're already slow enough as it is.
If you can send over some "typical queries" I'm about ready to do testing.
Jonathan
On Sat, 2002-11-16 at 15:09, Jonathan Walther wrote:
On Sun, Nov 17, 2002 at 12:06:00AM +0100, Erik Moeller wrote:
Many Wikipedia users try to insert as many relevant links as possible, making the relationship between "broken" and real links often 10:1. I don't think it would make much sense to create blank rows for each one of them just to have simpler table definitions.
That doesn't sound right. There are definately more "good" links in the database than dangling ones. Maybe you could run a query and tell us the results?
Assuming we're talking about pages that are linked to rather than the raw total of links:
mysql> select count(distinct bl_to) from brokenlinks; +-----------------------+ | count(distinct bl_to) | +-----------------------+ | 188126 | +-----------------------+
mysql> select count(distinct l_to) from links; +----------------------+ | count(distinct l_to) | +----------------------+ | 115708 | +----------------------+
It's certainly not 10:1, but more like 5:3 in favor of not yet existing pages on the English wiki.
-- brion vibber (brion @ pobox.com)
I realize I'm arriving late to the database discussion, but has the use of an abstraction library been considered? There is a PHP library called ADODB (http://php.weblogs.com/ADODB/) that is used by PostNuke and others. I've personally used it on smaller projects very successfully. It's modeled after Microsoft's [[ADO]] library that performs a similar function (as I understand). An abstraction layer like this gives you the flexibility to use whatever database backend you want.
Description from Freshmeat.net:
"ADODB is a set of advanced PHP database abstraction classes. It supports MySQL, PostgreSQL, Interbase/Firebird, Informix, Sybase SQL Anywhere, Oracle, MS SQL 7, Sybase, DB2, FrontBase, Foxpro, Access, ADO, and generic ODBC. A metatype system is built in, making it possible to figure out that types such as CHAR, TEXT, and STRING are equivalent in different databases. It also features an SQL to HTML popup menu and SQL to HTML table support. It has code to support record paging and blob/clob support."
Of course nothing is ever free, so I suspect there could be a slight price to pay in terms of performance. I don't know enough about ADODB to say what the performance costs are, nor could I find any discussion on their site. However, I suspect that the cost of an extra PHP function call or two is small compared to the overall cost of performing the SQL query.
Anyway, it's just a suggestion. Others more knowledgeable about Wikipedia's current implementation and requirements are probably better judges of whether or not this could be useful.
Steve Rawlinson
On Sat, Nov 16, 2002 at 02:15:40PM -0800, Jonathan Walther wrote:
If we port this to Postgres, and then expand yet again, we'll be in good shape if some kind soul wants to donate an Oracle license, because we won't have to worry about MySQLisms or PostgreSQLisms in doing the port to Oracle.
Don't even dare to suggest such thing. Should Wikipedia ever cease to be runable on 100% Open Source system, it will cause immediate fork of the project.
Tomasz Wegrzanowski wrote:
On Sat, Nov 16, 2002 at 02:15:40PM -0800, Jonathan Walther wrote:
If we port this to Postgres, and then expand yet again, we'll be in good shape if some kind soul wants to donate an Oracle license, because we won't have to worry about MySQLisms or PostgreSQLisms in doing the port to Oracle.
Don't even dare to suggest such thing. Should Wikipedia ever cease to be runable on 100% Open Source system, it will cause immediate fork of the project.
Although I haven't followed this discussion closely, I think Jonathan was only saying that we should avoid MySQLisms and PostgreSQLisms so that we *could* run on Oracle. That makes sense to me, even though we never would run on Oracle.
We should use, to whatever extent possible, "vanilla" SQL so that we can easily port to a different DB engine if we want. I don't think we would ever use Oracle, and I am *certain* we would never make changes so that our software would *require* Oracle... for exactly the reason you mention.
--Jimbo
On 11/16/02 4:44 AM, "Jonathan Walther" krooger@debian.org wrote:
On Sat, Nov 16, 2002 at 01:28:31AM -0800, Brion Vibber wrote:
- Why is a broken link table necessary? Couldn't it be sufficient to
do a query to find out which links are broken? Or is this another "we have no subqueries" issue?
I believe the primary use of the brokenlinks table is for generating the 'most wanted' list.
All articles have ID numbers; that's cur_id in the cur table. bl_from and l_to reference cur_id numbers. Articles that don't exist don't have cur_id numbers of course, so cannot be referenced by id, but only by title.
What if there were no "broken links", only articles that had no text? ie, the text field would be NULL? And we could do with a single "links" table by collapsing article and image links together. And one could find the "dangling links" simply by searching for those articles with the text property set to NULL. The following query would do the trick:
We do distinguish between articles that have been never created and articles that are currently blank, and we need to.
Not that this is necessarily a bad idea, but I'd want to be very clear if/how this would alter current functionality.
On Sat, Nov 16, 2002 at 09:26:49AM -0500, The Cunctator wrote:
We do distinguish between articles that have been never created and articles that are currently blank, and we need to.
Quite right; so I can extend the query to see if the article is in the previous_articles table as well.
So, like this:
REMOVE FROM current_articles WHERE text=NULL AND NOT article_id IN ( SELECT destination FROM links DISTINCT ) AND NOT article_id IN ( SELECT article_id FROM previous_articles DISTINCT ) ;
As you can see, we doing this entirely with SQL; no need for extra parsing code on the PHP side at all.
Not that this is necessarily a bad idea, but I'd want to be very clear if/how this would alter current functionality.
That's what I'm hoping someone on this list will answer. I see no reason it should alter current functionality at ALL, although it should reduce the amount of PHP code needed.
Jonathan
As long as we're talking about different open source databases, there's also SAP DB (http://www.sapdb.org/). This is a once-commercial database that is now under open source, similar to Netscape/Mozilla. As this once was commercial, running databases for the largest companies, it seems to outclass both MySQL and PostgreSQL in speed, reliability, and supported SQL language. The strange thing is that it is hardly known in the open source scene. I didn't try it myself yet, either.
Anyone willing to give it a try? (I'm busy with something else right now...)
Magnus
On Sun, Nov 17, 2002 at 11:26:03AM +0100, Magnus Manske wrote:
As long as we're talking about different open source databases, there's also SAP DB (http://www.sapdb.org/). This is a once-commercial database that is now under open source, similar to Netscape/Mozilla. As this once was commercial, running databases for the largest companies, it seems to outclass both MySQL and PostgreSQL in speed, reliability, and supported SQL language. The strange thing is that it is hardly known in the open source scene. I didn't try it myself yet, either.
There's a reason for that. It's a bitch to get up and running, and tuned to run anywhere near at a reasonable performance. People that use the SAP DB invariably end up paying SAP to set it up for them.
Oh, and benchmarks show it's not superior to, only comparable to, PostgreSQL and Oracle.
Jonathan
On Sat, 2002-11-16 at 06:26, The Cunctator wrote:
On 11/16/02 4:44 AM, "Jonathan Walther" krooger@debian.org wrote:
On Sat, Nov 16, 2002 at 01:28:31AM -0800, Brion Vibber wrote:
All articles have ID numbers; that's cur_id in the cur table. bl_from and l_to reference cur_id numbers. Articles that don't exist don't have cur_id numbers of course, so cannot be referenced by id, but only by title.
What if there were no "broken links", only articles that had no text? ie, the text field would be NULL? And we could do with a single "links" table by collapsing article and image links together. And one could find the "dangling links" simply by searching for those articles with the text property set to NULL. The following query would do the trick:
We do distinguish between articles that have been never created and articles that are currently blank, and we need to.
Hmm, if I understand correctly, a blank article will have cur_text='', while a nonexistent article with a db entry would have cur_text=NULL. This is quite distinguishable.
Not that this is necessarily a bad idea, but I'd want to be very clear if/how this would alter current functionality.
In principle, it wouldn't affect functionality. On the backend, it would enable the various X_title and X_namespace string fields in other tables to be replaced with straightforward numeric references to the cur table. Less string parsing should give better performance.
However, I don't think this can be done with MySQL 3.x as subqueries are not available. Ergo, moving to postgresql or waiting for MySQL 4.1 (could be a while!) would be necessary to implement that.
While we're discussing future plans of totally restructuring the database, may I also request comments on: http://meta.wikipedia.org/wiki/Thoughts_on_language_integration
Putting data of all languages (and other subwikis such as meta and sep11) into a single database could simplify inter-language and inter-subwiki linking (including sharing of uploaded files).
Oh, a reminder before I forget -- the *_user_text fields store the IP address for non-logged in users, "conversion script" for usemod->phase II conversion step, or the old username of a usemod-era edit by a logged in user, for all of whom the *_user id field is 0.
-- brion vibebr (brion @ pobox.com)
wikitech-l@lists.wikimedia.org