I just discovered that some code I've been working on, which passes all its unit tests against sqllite, fails when I run it against tools.db.svc.wikimedia.cloud. It's apparently due differences in the two drivers (see for example, this bug https://github.com/sqlalchemy/alembic/issues/912). The error I get (using SQL Alchemy 2.0.0rc3) is:
sqlalchemy.exc.CompileError: (in table 'bot_log', column 'title'): VARCHAR requires a length on dialect mysql
So, I've got two questions:
1) Has anybody seen this problem before and figured out how to work around it? I suppose I could just declare an explicit length for that column, but what part of VARCHAR did they not understand?
2) Is there a scratch database instance I can run unit tests against? I'd rather not do this kind of testing against anything that's in production, to ensure I don't accidentally cause any damage.
On Wed, Jan 25, 2023 at 8:43 AM Roy Smith roy@panix.com wrote:
I suppose I could just declare an explicit length for that column, but what part of VARCHAR did they not understand?
Yea, it's variable length but it still wants you to give it a maximum length.
upstream says: "The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters you want to store. "
So I think that is indeed your solution to just add _max_ length. But it can still vary up to this length.
Thanks. This field is intended to hold a wikipedia page title (including the namespace). Would 255 be long enough to hold any legal title?
On Jan 25, 2023, at 11:53 AM, Daniel Zahn dzahn@wikimedia.org wrote:
Yea, it's variable length but it still wants you to give it a maximum length.
On Wed, Jan 25, 2023 at 12:33 PM Roy Smith roy@panix.com wrote:
Thanks. This field is intended to hold a wikipedia page title (including the namespace). Would 255 be long enough to hold any legal title?
Looks like that should be the actual limit, yea.
"A pagename cannot exceed 255 bytes in length. Be aware that non-ASCII characters may take up to four bytes in UTF-8 encoding, so the total number of characters that can fit into a title may be less than 255."
found at https://en.wikipedia.org/wiki/Wikipedia:Page_name#Technical_restrictions_and...
The next problem is that apparently mysql can't index a VARCHAR(255). You get:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1709, 'Index column size too large. The maximum column size is 767 bytes.')
One solution I saw mentioned was to not index the string, but to compute a hash, store that as a separate column, and index the hash. Which feels like I'm working for the database instead of the other way around.
On Jan 25, 2023, at 3:52 PM, Daniel Zahn dzahn@wikimedia.org wrote:
On Wed, Jan 25, 2023 at 12:33 PM Roy Smith roy@panix.com wrote:
Thanks. This field is intended to hold a wikipedia page title (including the namespace). Would 255 be long enough to hold any legal title?
Looks like that should be the actual limit, yea.
"A pagename cannot exceed 255 bytes in length. Be aware that non-ASCII characters may take up to four bytes in UTF-8 encoding, so the total number of characters that can fit into a title may be less than 255."
found at https://en.wikipedia.org/wiki/Wikipedia:Page_name#Technical_restrictions_and...
-- Daniel Zahn dzahn@wikimedia.org Site Reliability Engineer _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
MediaWiki uses a VARBINARY(255) NOT NULL for page_title
in doubt, you would use VARCHAR(255) over other length, since that's the largest value for a VARCHAR (a value larger than 255 would automatically become a MEDIUMTEXT column)
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1709,
'Index column size too large. The maximum column size is 767 bytes.')
This depends on the column collation. if you store ASCII characters, that's one byte per character. If you store utf-8 characters, each of them could be several bytes (also depending if you only support the Basic Multilingual Plane, as "old utf8", or all of them, "utf8_mb4"). It is possible to define the index as the first N bytes of the column (generally more than enough for what you will need).
But in this case, using varbinary should solve your issue;
Regards
Hi,
On 1/25/23 11:42, Roy Smith wrote:
- Is there a scratch database instance I can run unit tests against?
I'd rather not do this kind of testing against anything that's in production, to ensure I don't accidentally cause any damage.
Normally I just spin up a MariaDB container locally in podman/docker. But you can also create a second tool named "$tool-dev" or "$tool-beta"[1] to play around with.
[1] For example, https://toolsadmin.wikimedia.org/tools/id/ldap-beta
-- Kunal / Legoktm
Note also that the 255 byte limit for a title is without the namespace; if your bot_log.title includes the namespace name, 255 bytes might not be enough after all.
Am Do., 26. Jan. 2023 um 00:05 Uhr schrieb Kunal Mehta legoktm@debian.org:
Hi,
On 1/25/23 11:42, Roy Smith wrote:
- Is there a scratch database instance I can run unit tests against?
I'd rather not do this kind of testing against anything that's in production, to ensure I don't accidentally cause any damage.
Normally I just spin up a MariaDB container locally in podman/docker. But you can also create a second tool named "$tool-dev" or "$tool-beta"[1] to play around with.
[1] For example, https://toolsadmin.wikimedia.org/tools/id/ldap-beta
-- Kunal / Legoktm _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/