On 2013-10-24 10:01 AM, Élie Roux wrote:
1. If I change this limit to 1023 in the structure of
the database
('page_title' field of the 'page' base), will other things (such as
search engine) break? Is there a way to change it more cleanly?
Other things
won't break. But the limit is hardcoded into the software
so it won't actually change the limit. However if you change both places
you will encounter bugs unless you make sure to make a huge number of
other changes to the database and also handle user names by either
changing every column that accepts a user name or editing User.php to
use a limit other than what Title uses.
2. Could I propose you to make this limit 1023 instead
of 255 (or to
make it configurable easily)? This would allow at least 256 characters
(even for asian languages) instead of 256 bytes, which seems more
consistant with the fact that MediaWiki is well internationalized.
Just to clarify
(You're switching between 255 and 256 through your
message) it's 255 bytes. The choice of using 255 bytes for title length
wasn't an arbitrary choice. 255 bytes – not characters – was picked
because 255 bytes is the maximum string length that can be represented
using 1 byte. ie: 1 byte (8 bits -> 2^8=265 values not including 0) can
represent the range 0-255 so a varchar uses 1 byte to declare the length
of the varchar's contents. If you use ANY length larger than 255 the
varchar will always require 2 bytes to store the length of the column.
Changing the maximum length of a title will require a huge number of
changes and is not easy to make configurable so we should do this only
one time changing the maximum length of title to the maximum we can
feasibly make it.
* It's hardcoded in Title::secureAndSplit.
* User::getCanonicalName will need some extra code to restrict
usernames to 255 bytes as it currently depends on Title to do that.
(And increasing the user name length is far more problematic than
the title length)
* docs/title.txt and TitleTest.php will need an update.
* The canonical page.page_title needs an ALTER COLUMN.
* A number of other core tables will need column alters; (archive,
page, template, image, category)links.(a, p, t, i, c)l_title,
category.cat_title, (lang, iw)links.(l,iw)l_title,
recentchanges.rc_title, and so on.
* All extensions that store titles in the database will need to do
similar ALTER COLUMN updates or else they will have unexpected errors.
o It might be worth doing this first. Since a larger varchar
length in the database won't cause any bugs while the software
still uses 255 bytes.
The next (and tbh final) title max length shouldn't be something
arbitrary, we should pick the maximum we can theoretically store (max we
can store with 2 bytes/a 16bit uint and the max we can possibly store
are actually the same thing, at least on MySQL in regards to VARCHAR).
The next step up from 1 byte representing 0-255 is not 0-1023 as you're
thinking. 2 bytes can represent a length of 0-65535 bytes.
However we actually cannot use VARCHAR(65535) for the title. There are a
number of other limits we hit which cap varchar column lengths below
what can be represented with a 16bit uint length.
* MySQL Has a maximum row length of 65,535 bytes[1]; This includes the
storage of whatever all the columns on the table require in the row
storage.
* PostgreSQL doesn't seem to have the same max row size issues as
MySQL because it's row max – depending on whether you ask the about
page[2] or wiki FAQ[3] – is either 1.6TB or 400GB. And the column
max size is 1GB.
* However PostgreSQL seems to say "indexes can not be created on
columns longer than about 2,000 characters"[1]. I don't know the
precise details but it might make our limit 2000 bytes. ((We'll need
some more input on someone who knows PostgreSQL))
* The varchar WP page[4] says Oracle's limit is 4000 bytes.
* Before MySQL 5.0.3 VARCHAR colums could only be declared a maximum
of 255. ((This means we'll have to drop support for 5.0.2 and change
our "5.0.2 or later" MySQL requirements to "5.0.3 or later"))
* MyISAM's index prefix maximum is 1000 bytes and InnoDB is 767
(unless you use a dynamic/compressed field and innodb_large_prefix).
This means that [[A{1000 bytes}A]] and [[A{1000 bytes}B]] cannot
both exist as the index prefix is used to ensure uniqueness. This
limit will be database dependent. And the only fix would be to add a
new column containing a hash of the title text and drop the
uniqueness constraint on page_title.
Deciding the title max we should use will probably need some more
information than what I've gathered so far.
((Side topic: We use `varchar(n) binary` for the title now. However
anyone that feels like changing this to `varchar(n) CHARACTER SET utf8`
needs to be wary that MySQL triples the (n) so it can store "<n> utf8
chars" instead of "<n> bytes that happen to be utf8" so they may need
to
divide whatever we use as the new max by 3 (unless MySQL does additional
enforcing of char lengths in which case anyone doing that is stuck doing
significant changes to how MW determines the maxlength of a title).[1]))
[1]
https://dev.mysql.com/doc/refman/5.6/en/column-count-limit.html
[2]
http://www.postgresql.org/about/
[3]
https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_…
[4]
https://en.wikipedia.org/wiki/Varchar
[5]
https://dev.mysql.com/doc/refman/5.6/en/innodb-restrictions.html
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [
http://danielfriesen.name/]