Jonathan,
I thank you for the flattering remarks about my technical ability.
Bear in mind, though, that my mere 1 week of experience with the
Wikipedia database, combined with my utter lack of knowledge of the code
it supports, make me reluctant to "back up" any particular proposal to
change the database.
So for, the only change I've felt confident about was Jimbo's proposed
transformation of T.M.C.'s username to 'TMC'. The query Brion wrote was
just a data update, not a change in structure.
Some general rules:
1. If it ain't broke, don't fix it.
2. Don't "fix" something you don't understand.
3. Hacking at something usually leaves knife marks.
Ed Poor
On Sat, 2002-11-16 at 07:09, Pierre Abbat wrote:
> LearII uploaded the following, which appears to be an attempt to read from
> one port on one box and write to a port on another box. He also uploaded
> another PHP file, which was an attempt to list the files on the server. PHP
> doesn't run in the upload directory, and he left out the semicolons, but
> would someone please block him?
That reminds me -- although PHP scripts do not run in our upload
directory, HTML files containing JavaScript are possible. If no one
catches them, a wikipedia user could be tricked into viewing the file
and, if they have Javascript enabled, their login cookies could be
stolen, including the password hash if 'remember password across
sessions' is enabled.
Three ways around this come to mind:
1) A tighter cookie path could prevent untrusted files in /upload/ from
reading the cookie. But, we have three paths to worry about which need
to access the cookie:
/ (main page, raw host url)
/wiki/ (normal pages)
/w/ (directly linking the script for various options; edit etc)
Currently we can serve all three only by setting the cookie path to /,
which lets the whole site read them. Either we need to change the paths
(force / to redirect to /wiki/Main_Page and use /wiki/Foo?bar instead of
/w/wiki.phtml?title=Foo&bar) or think of something else.
2) Don't allow HTML files to be uploaded, or serve them with plain text
mimetype. This could be annoying.
3) Allow HTML files to be uploaded, but attempt to strip out javascript.
This could be annoying.
-- brion vibber (brion @ pobox.com)
If someone can answer some of the following questions, that would be
great. I'm going to paste each SQL table definition, then ask questions
about it. I'll number each question so I can keep track of which ones
still aren't answered at the end of the week. :-)
I'm sure Ed Poor will have more insightful and relevant questions than
the ones I'm posting below; Brion and you other developers, please don't
be offended at the tone of the questions. I seek knowledge only.
CREATE TABLE archive (
ar_namespace tinyint(2) unsigned NOT NULL default '0',
ar_title varchar(255) binary NOT NULL default '',
ar_text mediumtext NOT NULL,
ar_comment tinyblob NOT NULL,
ar_user int(5) unsigned NOT NULL default '0',
ar_user_text varchar(255) binary NOT NULL default '',
ar_timestamp varchar(14) binary NOT NULL default '',
ar_minor_edit tinyint(1) NOT NULL default '0',
ar_flags tinyblob NOT NULL
) TYPE=MyISAM PACK_KEYS=1;
1) What is the "archive" table for? Is that where articles are stored?
2) Why is there a separate ar_user_text property? Is it because MySQL
doesn't support subqueries?
3) What is ar_flags used for? And why is it a tinyblob?
4) Could we save a lot of disk space by replacing ar_title with
article_id, and having all the article titles and their corresponding
ids in a separate table?
CREATE TABLE brokenlinks (
bl_from int(8) unsigned NOT NULL default '0',
bl_to varchar(255) binary NOT NULL default '',
KEY bl_from (bl_from),
KEY bl_to (bl_to(10))
) TYPE=MyISAM;
5) Why is a broken link table necessary? Couldn't it be sufficient to
do a query to find out which links are broken? Or is this another "we
have no subqueries" issue?
6) I notice bl_from is an integer, while bl_to is a string; why the
difference? I thought articles didn't have identifying id numbers; if
so, what is bl_from then?
CREATE TABLE cur (
cur_id int(8) unsigned NOT NULL auto_increment,
cur_namespace tinyint(2) unsigned NOT NULL default '0',
cur_title varchar(255) binary NOT NULL default '',
cur_text mediumtext NOT NULL,
cur_comment tinyblob NOT NULL,
cur_user int(5) unsigned NOT NULL default '0',
cur_user_text varchar(255) binary NOT NULL default '',
cur_timestamp varchar(14) binary NOT NULL default '',
cur_restrictions tinyblob NOT NULL,
cur_counter bigint(20) unsigned NOT NULL default '0',
cur_ind_title varchar(255) NOT NULL default '',
cur_ind_text mediumtext NOT NULL,
cur_is_redirect tinyint(1) unsigned NOT NULL default '0',
cur_minor_edit tinyint(1) unsigned NOT NULL default '0',
cur_is_new tinyint(1) unsigned NOT NULL default '0',
UNIQUE KEY cur_id (cur_id),
KEY cur_namespace (cur_namespace),
KEY cur_title (cur_title(20)),
KEY cur_timestamp (cur_timestamp),
FULLTEXT KEY cur_ind_title (cur_ind_title),
FULLTEXT KEY cur_ind_text (cur_ind_text)
) TYPE=MyISAM PACK_KEYS=1;
7) What is the table cur for? I find myself understanding nothing about
it; I even looked at the 3 default entries that are inserted, and they
didn't clue me in.
CREATE TABLE image (
img_name varchar(255) binary NOT NULL default '',
img_size int(8) unsigned NOT NULL default '0',
img_description tinyblob NOT NULL,
img_user int(5) unsigned NOT NULL default '0',
img_user_text varchar(255) binary NOT NULL default '',
img_timestamp varchar(14) binary NOT NULL default '',
KEY img_name (img_name(10)),
KEY img_size (img_size),
KEY img_timestamp (img_timestamp)
) TYPE=MyISAM PACK_KEYS=1;
8) Is img_user_text present for the "MySQL has no subquery
functionality" reason?
9) What format are timestamps in?
10) Is there special PHP code that ensures an image with the same name,
but differing descriptions or img_user properties are present in the
table? Does MySQL not have the "UNIQUE" keyword which could eliminate
all need for that code, or is it wanted to have a record of all changes
to the description of an image? If so, I would recommend a slightly
different solution, similar to that for the articles table.
CREATE TABLE imagelinks (
il_from varchar(255) binary NOT NULL default '',
il_to varchar(255) binary NOT NULL default '',
KEY il_from (il_from(10)),
KEY il_to (il_to(10))
) TYPE=MyISAM;
11) Surely we could save space by using integer article ids?
CREATE TABLE ipblocks (
ipb_address varchar(40) binary NOT NULL default '',
ipb_user int(8) unsigned NOT NULL default '0',
ipb_by int(8) unsigned NOT NULL default '0',
ipb_reason tinyblob NOT NULL,
ipb_timestamp varchar(14) binary NOT NULL default '',
KEY ipb_address (ipb_address),
KEY ipb_user (ipb_user)
) TYPE=MyISAM PACK_KEYS=1;
12) No questions, but why is the KEY keyword used? Is this a MySQL
trigger for making indexes? Also, theres a bunch of things that could
use the UNIQUE keyword to save php code.
CREATE TABLE links (
l_from varchar(255) binary NOT NULL default '',
l_to int(8) unsigned NOT NULL default '0',
KEY l_from (l_from(10)),
KEY l_to (l_to)
) TYPE=MyISAM;
13) Similar to the question about the orphanedlinks table, why is l_from
a string, while l_to is an integer?
CREATE TABLE old (
old_id int(8) unsigned NOT NULL auto_increment,
old_namespace tinyint(2) unsigned NOT NULL default '0',
old_title varchar(255) binary NOT NULL default '',
old_text mediumtext NOT NULL,
old_comment tinyblob NOT NULL,
old_user int(5) unsigned NOT NULL default '0',
old_user_text varchar(255) binary NOT NULL default '',
old_timestamp varchar(14) binary NOT NULL default '',
old_minor_edit tinyint(1) NOT NULL default '0',
old_flags tinyblob NOT NULL,
UNIQUE KEY old_id (old_id),
KEY old_title (old_title(20)),
KEY old_timestamp (old_timestamp)
) TYPE=MyISAM PACK_KEYS=1;
14) What is table "old" used for? Deleted articles? Versions of
articles that are no longer "the most current"?
CREATE TABLE oldimage (
oi_name varchar(255) binary NOT NULL default '',
oi_archive_name varchar(255) binary NOT NULL default '',
oi_size int(8) unsigned NOT NULL default '0',
oi_description tinyblob NOT NULL,
oi_user int(5) unsigned NOT NULL default '0',
oi_user_text varchar(255) binary NOT NULL default '',
oi_timestamp varchar(14) binary NOT NULL default '',
KEY oi_name (oi_name(10))
) TYPE=MyISAM PACK_KEYS=1;
15) So, history IS stored for each image? Then I recommend the scheme
of having a separate table for unique image names, and their
corresponding "image id"
CREATE TABLE random (
ra_current tinyint(1) unsigned NOT NULL default '0',
ra_title varchar(255) binary NOT NULL default ''
) TYPE=MyISAM PACK_KEYS=1;
16) What is this "random" table for?
CREATE TABLE recentchanges (
rc_timestamp varchar(14) binary NOT NULL default '',
rc_cur_time varchar(14) binary NOT NULL default '',
rc_user int(10) unsigned NOT NULL default '0',
rc_user_text varchar(255) binary NOT NULL default '',
rc_namespace tinyint(3) unsigned NOT NULL default '0',
rc_title varchar(255) binary NOT NULL default '',
rc_comment varchar(255) binary NOT NULL default '',
rc_minor tinyint(3) unsigned NOT NULL default '0',
rc_new tinyint(3) unsigned NOT NULL default '0',
rc_cur_id int(10) unsigned NOT NULL default '0',
rc_this_oldid int(10) unsigned NOT NULL default '0',
rc_last_oldid int(10) unsigned NOT NULL default '0'
) TYPE=MyISAM PACK_KEYS=1;
17) Could the recent changes table be replaced with an appropriately
defined "view"? Or are views something MySQL doesn't do very well?
CREATE TABLE site_stats (
ss_row_id int(8) unsigned NOT NULL default '0',
ss_total_views bigint(20) unsigned default '0',
ss_total_edits bigint(20) unsigned default '0',
ss_good_articles bigint(20) unsigned default '0',
UNIQUE KEY ss_row_id (ss_row_id)
) TYPE=MyISAM;
18) What is ss_row_id and ss_good_articles? How does this table work?
Does it contain stats for each article? Shouldn't that go in the table
that already contains the info for an article? That would lessen the
number of SQL updates, collapsing two UPDATE statements into one.
CREATE TABLE user (
user_id int(5) unsigned NOT NULL auto_increment,
user_name varchar(255) binary NOT NULL default '',
user_rights tinyblob NOT NULL,
user_password tinyblob NOT NULL,
user_newpassword tinyblob NOT NULL,
user_email tinytext NOT NULL,
user_options blob NOT NULL,
user_watch mediumblob NOT NULL,
user_newtalk tinyint(1) unsigned NOT NULL default '1',
UNIQUE KEY user_id (user_id),
KEY user_name (user_name(10))
) TYPE=MyISAM PACK_KEYS=1;
19) Does user_rights correspond to ar_flags?
20) What is "newpassword" for?
21) Are all the "options" really necessary?
22) It might be the SQL way to make a "watchlist" table, where each row
contains a user id, and an article id. Or maybe this would be useless
in a database that doesn't have subqueries.
23) What is the user_newtalk field?
Cheers!
Jonathan
--
Geek House Productions, Ltd.
Providing Unix & Internet Contracting and Consulting,
QA Testing, Technical Documentation, Systems Design & Implementation,
General Programming, E-commerce, Web & Mail Services since 1998
Phone: 604-435-1205
Email: djw(a)reactor-core.org
Webpage: http://reactor-core.org
Address: 2459 E 41st Ave, Vancouver, BC V5R2W2
I mean, types of pages? Here are the ones I've noticed.
Regular pages include articles, talk pages, and user pages.
Here are the non-regular pages I've noticed:
Blocked IP's
Recent Changes
Watch List
User Preferences Page
History (now called Older Versions)
Related Changes
What Links Here (and maybe Where This Links To)
Which ones did I miss? I'm doing this enumeration so I can look at the
database schema and see what information is actually needed, and what
isn't, and what would allow for optimal access and use of space.
For instance, right now, the "options" field could be a simple integer
index which refers to a specific CSS file; when a user changes their
options they would get linked to a different CSS file, that is all. I
don't know, maybe that is already what happens.
Jonathan
--
Geek House Productions, Ltd.
Providing Unix & Internet Contracting and Consulting,
QA Testing, Technical Documentation, Systems Design & Implementation,
General Programming, E-commerce, Web & Mail Services since 1998
Phone: 604-435-1205
Email: djw(a)reactor-core.org
Webpage: http://reactor-core.org
Address: 2459 E 41st Ave, Vancouver, BC V5R2W2
Many vandals hide behind shared HTTP proxies, which makes them difficult
to ban.
Here's an idea:
Add a new attribute to user accounts:
* "authenticated" users are users **who have supplied a non-throwaway
E-mail address**: authentication to be done by sending them an E-mail
which they have to reply to, in the same way as mailing list authentication.
We can then "greylist" IP addresses or ranges, so that only
''authenticated'' logged-in users can post from behind these addresses.
We can point out to new users from these ISPs that the reason why they
are being asked to authenticate is that other users from the same ISP
have acted as vandals.
The good bit:
* At the same time, non-greylisted IP addresses can still allow
anonymous or non-authenticated user account edits, so we stay "open" to
>99.99% of all users.
We should greylist just the IP address for a proxy, or the whole /19
range for a user IP address: this is the minimum routable block on the
Internet, and will generally catch all users from a particular region.
This significantly increases the costs to vandals, and provides
traceability back to providers, or even real identities if necessary.
Vandals can go on making new accounts as many times as they like, but
they have to incur the costs of setting up new provider accounts every
time we ban their user account. (I believe that ISPs share phone
numbers and credit card numbers of persistent abusers, so these people
will either end up without access, or using rogue providers, who we can
then blacklist. )
Then, we can reserve "blacklisting" only for IP addresses that are
beyond hope, such as individual users who are non-cooperative, or
providers without a workable anti-abuse policy. "Blacklisting" should
then ban all editing.
We can also refuse to accept authentication E-mails from E-mail
providers who do not have a good abuse policy.
Neil
Simply viewing an article currently involves performing a lot of
queries. Since it's our most common event, we should perhaps be
directing optimizing effort in that direction.
(if logged in) check user info:
User::loadFromDatabase:SQL: SELECT user_name,user_password,
user_newpassword,user_email,user_options,user_rights,user_newtalk
FROM user WHERE user_id=1
Check if page exists:
LinkCache::addLink:SQL: SELECT cur_id FROM cur
WHERE (cur_namespace=0 AND cur_title='Test:stuff')
Load the content:
Article::loadContent:SQL: SELECT cur_text,cur_timestamp,cur_user,
cur_counter FROM cur WHERE cur_id=19
[These last two can be combined into one query.]
Now, separately for each and every linked page:
Check if it exists:
LinkCache::addLink:SQL: SELECT cur_id FROM cur
WHERE (cur_namespace=0 AND cur_title='A_link')
(if stub threshold set) check its size and redirect status:
SQL: SELECT length(cur_text) AS x, cur_namespace, cur_is_redirect
FROM cur WHERE cur_id='96'
[Checking the size can be done with checking existence, and it may be
better to chalk up a list of all linked pages during initial parsing,
then check them all in one big query.]
Check if current page is restricted for editing:
wfGetSQL:SQL: SELECT cur_restrictions FROM cur WHERE (cur_id=19)
[We should get cur_restrictions along with the rest in the initial
query.]
(if logged in) check if on watchlist:
wfGetSQL:SQL: SELECT user_watch FROM user WHERE (user_id=1)
[For users with big watchlists like me, it may be inefficient to load
and parse the entire list on every single page view. It should perhaps
either be grabbed along with the rest of the user info, or stored in a
separate table where the database can parse the gobs of strings for us.]
Check for existence of talk page (twice - sidebar and bottom bar):
LinkCache::addLink:SQL: SELECT cur_id FROM cur
WHERE (cur_namespace=1 AND cur_title='Test:stuff')
LinkCache::addLink:SQL: SELECT cur_id FROM cur
WHERE (cur_namespace=1 AND cur_title='Test:stuff')
[Should only need to do this once and store the result...]
...output the page...
Update counters:
ViewCountUpdate::doUpdate:SQL: UPDATE cur
SET cur_counter=(1+cur_counter),cur_timestamp=cur_timestamp
WHERE cur_id=19
SiteStatsUpdate::doUpdate:SQL: UPDATE site_stats
SET ss_total_views=(ss_total_views+1),
ss_total_edits=(ss_total_edits),
ss_good_articles=(ss_good_articles) WHERE ss_row_id=1
[Note that that single row of site_stats is updated on every single view
of any page in the database.]
-- brion vibber (brion @ pobox.com)
I just came upon an article [[Connected]] containing a definition of a
connected graph. I added a definition of a connected set and got an edit
conflict, containing lots of text about graphs. The web page title was "Edit
conflict: Connected". I added my text to the bottom and hit the save button.
I then found that someone had turned [[Connected]] into a redirect behind my
back and my definition of a connected set is now at the bottom of the [[Graph
theory]] page!
AFAICT this is the same as bug 602823. The bug is 5/2 months old and has not
been fixed.
phma
From the Postgres FAQ:
4.6) What is the maximum size for a row, table, database?
These are the limits:
Maximum size for a database? unlimited (60GB databases
exist)
Maximum size for a table? 16 TB
Maximum size for a row? unlimited in 7.1 and later
Maximum size for a field? 1GB in 7.1 and later
Maximum number of rows in a table? unlimited
Maximum number of columns in a table? 250-1600 depending on column
types
Maximum number of indexes on a table? unlimited
Of course, these are not actually unlimited, but limited to available
disk
space and memory/swap space. Performance may suffer when these values
get
unusually large.
The maximum table size of 16TB does not require large file support from
the operating system. Large tables are stored as multiple 1GB files.
The maximum table size and maximum number of columns can be increased if
the default block size is increased to 32k.
Jonathan
--
Geek House Productions, Ltd.
Providing Unix & Internet Contracting and Consulting,
QA Testing, Technical Documentation, Systems Design & Implementation,
General Programming, E-commerce, Web & Mail Services since 1998
Phone: 604-435-1205
Email: djw(a)reactor-core.org
Webpage: http://reactor-core.org
Address: 2459 E 41st Ave, Vancouver, BC V5R2W2
Could somebody change the displayed name for links to subject pages from "View
Article" to "View Subject Page" (or something similar). Per our own
definition, my user page is not an article and neither are any
Wikipedia:namespaced pages.
Is sourceforge a better place for this? Better question: is anybody reviewing
sourceforge bug reports and feature requests anymore?
-- Daniel Mayer (aka mav)
elian wrote:
>Toby Bartels wrote:
>>Or not have moderation.
>>Other than approving posts by nonmembers, why do we need it?
>>When was this decided?
>Never. The question was about administration, not moderation. Choosing
>taglines, defining welcome greetings and all the technical stuff.
Well, I certainly trust Ed to do *that*.
-- Toby