On 2013-10-24 10:01 AM, Élie Roux wrote:
- If I change this limit to 1023 in the structure of the database
('page_title' field of the 'page' base), will other things (such as search engine) break? Is there a way to change it more cleanly?
Other things won't break. But the limit is hardcoded into the software so it won't actually change the limit. However if you change both places you will encounter bugs unless you make sure to make a huge number of other changes to the database and also handle user names by either changing every column that accepts a user name or editing User.php to use a limit other than what Title uses.
- Could I propose you to make this limit 1023 instead of 255 (or to
make it configurable easily)? This would allow at least 256 characters (even for asian languages) instead of 256 bytes, which seems more consistant with the fact that MediaWiki is well internationalized.
Just to clarify (You're switching between 255 and 256 through your message) it's 255 bytes. The choice of using 255 bytes for title length wasn't an arbitrary choice. 255 bytes – not characters – was picked because 255 bytes is the maximum string length that can be represented using 1 byte. ie: 1 byte (8 bits -> 2^8=265 values not including 0) can represent the range 0-255 so a varchar uses 1 byte to declare the length of the varchar's contents. If you use ANY length larger than 255 the varchar will always require 2 bytes to store the length of the column.
Changing the maximum length of a title will require a huge number of changes and is not easy to make configurable so we should do this only one time changing the maximum length of title to the maximum we can feasibly make it.
* It's hardcoded in Title::secureAndSplit. * User::getCanonicalName will need some extra code to restrict usernames to 255 bytes as it currently depends on Title to do that. (And increasing the user name length is far more problematic than the title length) * docs/title.txt and TitleTest.php will need an update. * The canonical page.page_title needs an ALTER COLUMN. * A number of other core tables will need column alters; (archive, page, template, image, category)links.(a, p, t, i, c)l_title, category.cat_title, (lang, iw)links.(l,iw)l_title, recentchanges.rc_title, and so on. * All extensions that store titles in the database will need to do similar ALTER COLUMN updates or else they will have unexpected errors. o It might be worth doing this first. Since a larger varchar length in the database won't cause any bugs while the software still uses 255 bytes.
The next (and tbh final) title max length shouldn't be something arbitrary, we should pick the maximum we can theoretically store (max we can store with 2 bytes/a 16bit uint and the max we can possibly store are actually the same thing, at least on MySQL in regards to VARCHAR).
The next step up from 1 byte representing 0-255 is not 0-1023 as you're thinking. 2 bytes can represent a length of 0-65535 bytes. However we actually cannot use VARCHAR(65535) for the title. There are a number of other limits we hit which cap varchar column lengths below what can be represented with a 16bit uint length.
* MySQL Has a maximum row length of 65,535 bytes[1]; This includes the storage of whatever all the columns on the table require in the row storage. * PostgreSQL doesn't seem to have the same max row size issues as MySQL because it's row max – depending on whether you ask the about page[2] or wiki FAQ[3] – is either 1.6TB or 400GB. And the column max size is 1GB. * However PostgreSQL seems to say "indexes can not be created on columns longer than about 2,000 characters"[1]. I don't know the precise details but it might make our limit 2000 bytes. ((We'll need some more input on someone who knows PostgreSQL)) * The varchar WP page[4] says Oracle's limit is 4000 bytes. * Before MySQL 5.0.3 VARCHAR colums could only be declared a maximum of 255. ((This means we'll have to drop support for 5.0.2 and change our "5.0.2 or later" MySQL requirements to "5.0.3 or later")) * MyISAM's index prefix maximum is 1000 bytes and InnoDB is 767 (unless you use a dynamic/compressed field and innodb_large_prefix). This means that [[A{1000 bytes}A]] and [[A{1000 bytes}B]] cannot both exist as the index prefix is used to ensure uniqueness. This limit will be database dependent. And the only fix would be to add a new column containing a hash of the title text and drop the uniqueness constraint on page_title.
Deciding the title max we should use will probably need some more information than what I've gathered so far.
((Side topic: We use `varchar(n) binary` for the title now. However anyone that feels like changing this to `varchar(n) CHARACTER SET utf8` needs to be wary that MySQL triples the (n) so it can store "<n> utf8 chars" instead of "<n> bytes that happen to be utf8" so they may need to divide whatever we use as the new max by 3 (unless MySQL does additional enforcing of char lengths in which case anyone doing that is stuck doing significant changes to how MW determines the maxlength of a title).[1]))
[1] https://dev.mysql.com/doc/refman/5.6/en/column-count-limit.html [2] http://www.postgresql.org/about/ [3] https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a... [4] https://en.wikipedia.org/wiki/Varchar [5] https://dev.mysql.com/doc/refman/5.6/en/innodb-restrictions.html
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]