Revision: 34388 `ug1`.`ug_user`=`user_id`";
Ugh. Please don't hardcode backticks: not only are they unneeded here, but they horribly break any chance we have of cross-database compatibility.
-- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200805071519 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
Greg Sabino Mullane schreef:
Revision: 34388 `ug1`.`ug_user`=`user_id`";
Ugh. Please don't hardcode backticks: not only are they unneeded here, but they horribly break any chance we have of cross-database compatibility.
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
Roan Kattouw (Catrope)
[1] http://www.mediawiki.org/wiki/User_talk:Catrope#API_probleem
Ugh... this looks more like bad usage of Database functions on the API's part rather than a breakage in the Database functions. When I rewrote tableName talking to brion for feedback on what various inputs should be defined as, we cleaned up and strictly defined what inputting something into tableName meant: * Strings in the form 'table' were loose input which were expected to have a shared DB assigned if they were shared, prefixed with the right database, and quoted. * Strings in the form '`table`' or '`table.with.dots`' were literal, when one is seen with quotes on both sides, it should be left alone. * Strings in the form 'database.table', '`database`.table', 'database.`table`' are a database.table pair, no prefixes should be added, and the shared database should not override it. These should only be quoted.
But it looks like tableName is getting input it shouldn't be. The API is basically inputting: "`database`.`table` dbt ON dbt.column=value" Which, honestly is not a table. That's a full blown SQL JOIN query.
This is a result of the chain of calls: 1) ApiQueryUsers.php calls: $this->addTables($tables); With the input: "$userTable AS u1 LEFT JOIN $ug ON ug_user=u1.user_id LEFT JOIN $ipb ON ipb_user=u1.user_id LEFT JOIN $userTable AS u2 ON ipb_by=u2.user_id"; 2) ApiQueryBase.php's addTables adds those to $this->tables; 3) ApiQueryUsers.php calls: $this->select(__METHOD__); 3) ApiQueryBase.php's select calls: $db->select($this->tables, $this->fields, $this->where, $method, $this->options); 4) Database's select takes $this->tables and calls tableName for each one: 5) tableName receives the input "`user` AS u1 LEFT JOIN `user_groups` ON ug_user=u1.user_id LEFT JOIN `ipblocks` ON ipb_user=u1.user_id LEFT JOIN `user` AS u2 ON ipb_by=u2.user_id" 6) Because there is a ` at the start, but no ` at the end, and tableName is only supposed to handle table names it splits into "`user` AS u1 LEFT JOIN `user_groups` ON ug_user=u1.user_id LEFT JOIN `ipblocks` ON ipb_user=u1" and "user_id LEFT JOIN `ipblocks` ON ipb_user=u1.user_id LEFT JOIN `user` AS u2 ON ipb_by=u2.user_id" and quotes the second one because there is no ` at the start.
Additionally if you haven't caught it ApiQueryBase::addTables also calls $tables = $this->getDB()->tableName($tables) . ' ' . $alias; just a heads up, but that $alias normally has no `'s to quote it, right now it should probably check if it should quote the $alias just so that doesn't end up as an error while we sort out what to do to make tableName handle input it's not supposed to handle correctly.
We've got multiple options on how to fix this: 1) Find a better way to make complex queries using the Database class that doesn't involve building complex queries, and then trying to shove them into the simple handlers. This would greatly improve cross-db compatibility if done. 2) Have the API properly quote things it sends to simple database methods (using clean functions we could put into the Database class, such as tableQuote). This should be fine for working, as the MySQL handler is the only one which tests for ` the other DB tableName's don't do this. So with tableQuote always outputting the correct type of quoting it shouldn't break in other databases. That improves cross-db compatibility, but not as good as (1). 3) Don't use the Simple Database handlers. The API is already constructing near full SQL queries, calling simple DB methods to build the last portion, and expecting them to handle complex data without messing it up is kludge work and is bound to fail over and over. The select call inside of ApiQueryBase is only a light step away from simply dropping the data it has into a sql query and sending that directly to Query. Database::select is actually completely database generic, none of the other classes override it. It uses other functions inside of the Database class to handle the portions of data which are different with different database platforms. Api's select could easily call those same database methods to handle it's data in a cross-database compatible way then send the finalized data to query. 4) We could use some sort of code character, like the parser's \x07 which is never supposed to show up anywhere and have either tableName or select test for it and consider it as a signal of "I've fully handled all database stuff inside this string, don't touch it one bit, or else." And have query, select, or tableName strip it out when done with it.
4 is the quickest and easiest to do. Though of course, it's still a hack.... Then again, the Parser is a pretty big one to. 1 is fairly complex to do, but it would aid anyone doing joins and such and provide a new simple method to abstract the complex stuff. 2 may take some handling around the Api, though I'm not for it that much because of the reasons in 3 for not using simple functions to handle complex data. IMHO 3 is the best method for both cross-db compatibility and simplicity.
Input?
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Roan Kattouw wrote:
Revision 34353 [1] (by you) caused some breakage in Database::select(), which I worked around in in r34388 [2], a workaround Greg didn't like (see wikitech-l). Could you look into this piece of code again and see if you can get it to behave properly for complex things?
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34353 [2] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34388
Roan Kattouw wrote:
Greg Sabino Mullane schreef:
Revision: 34388 `ug1`.`ug_user`=`user_id`";
Ugh. Please don't hardcode backticks: not only are they unneeded here, but they horribly break any chance we have of cross-database compatibility.
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
Roan Kattouw (Catrope)
[1] http://www.mediawiki.org/wiki/User_talk:Catrope#API_probleem
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Oh... I forgot one: 5) Have tableName test for words like ON, JOIN, etc... which signal this is a sql query, not a table.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan wrote:
Ugh... this looks more like bad usage of Database functions on the API's part rather than a breakage in the Database functions. When I rewrote tableName talking to brion for feedback on what various inputs should be defined as, we cleaned up and strictly defined what inputting something into tableName meant:
- Strings in the form 'table' were loose input which were expected to
have a shared DB assigned if they were shared, prefixed with the right database, and quoted.
- Strings in the form '`table`' or '`table.with.dots`' were literal,
when one is seen with quotes on both sides, it should be left alone.
- Strings in the form 'database.table', '`database`.table',
'database.`table`' are a database.table pair, no prefixes should be added, and the shared database should not override it. These should only be quoted.
But it looks like tableName is getting input it shouldn't be. The API is basically inputting: "`database`.`table` dbt ON dbt.column=value" Which, honestly is not a table. That's a full blown SQL JOIN query.
This is a result of the chain of calls:
- ApiQueryUsers.php calls:
$this->addTables($tables); With the input: "$userTable AS u1 LEFT JOIN $ug ON ug_user=u1.user_id LEFT JOIN $ipb ON ipb_user=u1.user_id LEFT JOIN $userTable AS u2 ON ipb_by=u2.user_id"; 2) ApiQueryBase.php's addTables adds those to $this->tables; 3) ApiQueryUsers.php calls: $this->select(__METHOD__); 3) ApiQueryBase.php's select calls: $db->select($this->tables, $this->fields, $this->where, $method, $this->options); 4) Database's select takes $this->tables and calls tableName for each one: 5) tableName receives the input "`user` AS u1 LEFT JOIN `user_groups` ON ug_user=u1.user_id LEFT JOIN `ipblocks` ON ipb_user=u1.user_id LEFT JOIN `user` AS u2 ON ipb_by=u2.user_id" 6) Because there is a ` at the start, but no ` at the end, and tableName is only supposed to handle table names it splits into "`user` AS u1 LEFT JOIN `user_groups` ON ug_user=u1.user_id LEFT JOIN `ipblocks` ON ipb_user=u1" and "user_id LEFT JOIN `ipblocks` ON ipb_user=u1.user_id LEFT JOIN `user` AS u2 ON ipb_by=u2.user_id" and quotes the second one because there is no ` at the start.
Additionally if you haven't caught it ApiQueryBase::addTables also calls $tables = $this->getDB()->tableName($tables) . ' ' . $alias; just a heads up, but that $alias normally has no `'s to quote it, right now it should probably check if it should quote the $alias just so that doesn't end up as an error while we sort out what to do to make tableName handle input it's not supposed to handle correctly.
We've got multiple options on how to fix this:
- Find a better way to make complex queries using the Database class
that doesn't involve building complex queries, and then trying to shove them into the simple handlers. This would greatly improve cross-db compatibility if done. 2) Have the API properly quote things it sends to simple database methods (using clean functions we could put into the Database class, such as tableQuote). This should be fine for working, as the MySQL handler is the only one which tests for ` the other DB tableName's don't do this. So with tableQuote always outputting the correct type of quoting it shouldn't break in other databases. That improves cross-db compatibility, but not as good as (1). 3) Don't use the Simple Database handlers. The API is already constructing near full SQL queries, calling simple DB methods to build the last portion, and expecting them to handle complex data without messing it up is kludge work and is bound to fail over and over. The select call inside of ApiQueryBase is only a light step away from simply dropping the data it has into a sql query and sending that directly to Query. Database::select is actually completely database generic, none of the other classes override it. It uses other functions inside of the Database class to handle the portions of data which are different with different database platforms. Api's select could easily call those same database methods to handle it's data in a cross-database compatible way then send the finalized data to query. 4) We could use some sort of code character, like the parser's \x07 which is never supposed to show up anywhere and have either tableName or select test for it and consider it as a signal of "I've fully handled all database stuff inside this string, don't touch it one bit, or else." And have query, select, or tableName strip it out when done with it.
4 is the quickest and easiest to do. Though of course, it's still a hack.... Then again, the Parser is a pretty big one to. 1 is fairly complex to do, but it would aid anyone doing joins and such and provide a new simple method to abstract the complex stuff. 2 may take some handling around the Api, though I'm not for it that much because of the reasons in 3 for not using simple functions to handle complex data. IMHO 3 is the best method for both cross-db compatibility and simplicity.
Input?
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Roan Kattouw wrote:
Revision 34353 [1] (by you) caused some breakage in Database::select(), which I worked around in in r34388 [2], a workaround Greg didn't like (see wikitech-l). Could you look into this piece of code again and see if you can get it to behave properly for complex things?
Roan Kattouw (Catrope)
[1] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34353 [2] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34388
Roan Kattouw wrote:
Greg Sabino Mullane schreef:
Revision: 34388 `ug1`.`ug_user`=`user_id`";
Ugh. Please don't hardcode backticks: not only are they unneeded here, but they horribly break any chance we have of cross-database compatibility.
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
Roan Kattouw (Catrope)
[1] http://www.mediawiki.org/wiki/User_talk:Catrope#API_probleem
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, May 7, 2008 at 7:43 PM, DanTMan dan_the_man@telus.net wrote:
Oh... I forgot one: 5) Have tableName test for words like ON, JOIN, etc... which signal this is a sql query, not a table.
This is the obvious way. If anyone has a database table called "JOIN" they're already severe masochists and will probably enjoy being punished by breakage of database functions. In particular, this would enable LEFT JOINs to be done using the wrappers, which I'm pretty sure is impossible right now.
On Wed, May 7, 2008 at 3:23 PM, Roan Kattouw roan.kattouw@home.nl wrote:
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
You know, every other place in the entire code base where we have inner joins, we just use implicit joins with WHERE. So in other words, don't use ON at all. Instead of SELECT * FROM page INNER JOIN revision ON page_id=rev_page WHERE page_id=731, use SELECT * FROM page, revision WHERE page_id=731 AND page_id=rev_page (don't ever try either for arbitrary pages, BTW :)). The queries are totally equivalent. There should be no reason to use any other syntax for inner joins, although stylistically I do prefer ON clauses.
I can't patch your code, however, since as usual, the API uses its own syntax for database queries and non-API developers like me have no idea how it works.
Well... I was worried about people with tables like "onsomething" or whatever... But I suppose I can test strictly for things with whitespace around them "/(^|\s)(ON|JOIN)(\s|$)/i" that should solve the issue.
But I still think 3 is good. IMHO we should probably do 5 and 3 for best practice and avoidance of future issues. If someone feels like it, having 1 done would be great to. Think of it, helper functions for various types of joins. Unions to!
Actually, I used to do something to the Database class. I actually had a tweak for select and a few other functions. I basically added a extra parameter $doQuery to the end which defaulted to true. Basically it meant that you could explicitly send false and have select return a SQL statement for you instead. That way you could go as far as to use select, then manually UNION them together. But what would be absolutely beautiful would be: list( $user, $ug, $ugS) = $db->tableNameN('user', 'user_groups', $ug = $db->tableName("`$wgSharedDB`.`{$wgSharedPrefix}user_groups`"); $res = $db->union( $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ug, conds, etc..., /*doQuery=*/false ), $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ugS, conds, etc..., /*doQuery=*/false ), ); ^_^ Which basically, would have select return sql which join uses to construct a join statement for. And then union, well... Unions them together.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Simetrical wrote:
On Wed, May 7, 2008 at 7:43 PM, DanTMan dan_the_man@telus.net wrote:
Oh... I forgot one: 5) Have tableName test for words like ON, JOIN, etc... which signal this is a sql query, not a table.
This is the obvious way. If anyone has a database table called "JOIN" they're already severe masochists and will probably enjoy being punished by breakage of database functions. In particular, this would enable LEFT JOINs to be done using the wrappers, which I'm pretty sure is impossible right now.
On Wed, May 7, 2008 at 3:23 PM, Roan Kattouw roan.kattouw@home.nl wrote:
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
You know, every other place in the entire code base where we have inner joins, we just use implicit joins with WHERE. So in other words, don't use ON at all. Instead of SELECT * FROM page INNER JOIN revision ON page_id=rev_page WHERE page_id=731, use SELECT * FROM page, revision WHERE page_id=731 AND page_id=rev_page (don't ever try either for arbitrary pages, BTW :)). The queries are totally equivalent. There should be no reason to use any other syntax for inner joins, although stylistically I do prefer ON clauses.
I can't patch your code, however, since as usual, the API uses its own syntax for database queries and non-API developers like me have no idea how it works.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, May 7, 2008 at 8:44 PM, DanTMan dan_the_man@telus.net wrote:
Well... I was worried about people with tables like "onsomething" or whatever... But I suppose I can test strictly for things with whitespace around them "/(^|\s)(ON|JOIN)(\s|$)/i" that should solve the issue.
Or use \b, that's what it's there for.
But I still think 3 is good. IMHO we should probably do 5 and 3 for best practice and avoidance of future issues.
Maybe, given how the API has its own database handling layer anyway, seemingly.
Actually, I used to do something to the Database class. I actually had a tweak for select and a few other functions. I basically added a extra parameter $doQuery to the end which defaulted to true. Basically it meant that you could explicitly send false and have select return a SQL statement for you instead. That way you could go as far as to use select, then manually UNION them together. But what would be absolutely beautiful would be: list( $user, $ug, $ugS) = $db->tableNameN('user', 'user_groups', $ug = $db->tableName("`$wgSharedDB`.`{$wgSharedPrefix}user_groups`"); $res = $db->union( $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ug, conds, etc..., /*doQuery=*/false ), $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ugS, conds, etc..., /*doQuery=*/false ), ); ^_^ Which basically, would have select return sql which join uses to construct a join statement for. And then union, well... Unions them together.
Operator overloading would be the awesomest for this. Consider this Python statement:
res = SELECT + (field.field1, field.field2) + FROM( table.user, table.page ) + WHERE( field.page_title == field.user_name and field.page_namespace == NS_USER and field.user_id == 1729 ) + UNION + ...
with all operators (+, ==, and) overloaded, and SELECT/FROM/WHERE/UNION object names, and "field" and "table" dummy objects that just exist for __get__ methods (so table.user would be the same as $dbr->tableName( 'user' )).
Of course, I don't seriously suggest such a thing, so I guess it's off-topic, but I've pondered it before and thought it would be fun to mention as a counterpoint. The fact that it's possible is somewhat mind-boggling. Maybe an argument against operator overloading? :)
I did think of \b, however I'm wary of it. I tied the test to whitespace or starting/ending of string because of the remote case where someone has something like -on- as their table name. Because \b matches breaks there are a fair number of symbols which it also considers to be word breaks because they are not word characters. And it's possible for some of those to be valid inside of table names.
^_^ Actually your python example is kinda interesting. I was actually contemplating something not deep like what you were mentioning, but something which does do some sort of overloading. It came from the usage of tableName to generate table names with aliases. Imagine if tableName, or something else returned a object rather than a flat string. And that object could contain a table name, and an alias for that. It would even work beautifully in current uses because it would override the toString method and when used in string concatenation it would output a valid SQL string, with the Alias to if necessary. And could have extra methods for any special type of syntax for table alias pairs. We could even tweak tableName to accept an optional alias. And tweak the inputs of the other functions to allow some special table alias pair.
But yes... It is a bit of a tangent. However, it is worth looking at as a side project. Honestly, I see piles of shitty database abstraction systems which try and rewrite your SQL to make them fully 'cross database compatible' when most people consider that a load of bullshit because no matter how much you abstract you still need to understand a system you migrate to and it doesn't just magically work. But what I don't see a single instance of, is what MediaWiki has done a bit of. Converting the construction of SQL in strings, into a set of helper functions which aid in readability, portability, and cross-compliance of most sql calls. Which even have the amazing bonus of properly escaping things where they come in to eliminate most of the issues of accidentally forgetting to secure something which could be used for SQL injection. Honestly, would you rather call $db->select, or have to remember to mysql_real_escape_string every bit of data manually? The former always escapes cleanly and avoids having you forget to do it.
We may also want to add some helpers for a few things people might want to do in conditions. Basically IN, AND, OR, >, <, not, and a few other things I've found myself needing to construct a mini condition string rather than using the nice assignment that escapes everything.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Simetrical wrote:
On Wed, May 7, 2008 at 8:44 PM, DanTMan dan_the_man@telus.net wrote:
Well... I was worried about people with tables like "onsomething" or whatever... But I suppose I can test strictly for things with whitespace around them "/(^|\s)(ON|JOIN)(\s|$)/i" that should solve the issue.
Or use \b, that's what it's there for.
But I still think 3 is good. IMHO we should probably do 5 and 3 for best practice and avoidance of future issues.
Maybe, given how the API has its own database handling layer anyway, seemingly.
Actually, I used to do something to the Database class. I actually had a tweak for select and a few other functions. I basically added a extra parameter $doQuery to the end which defaulted to true. Basically it meant that you could explicitly send false and have select return a SQL statement for you instead. That way you could go as far as to use select, then manually UNION them together. But what would be absolutely beautiful would be: list( $user, $ug, $ugS) = $db->tableNameN('user', 'user_groups', $ug = $db->tableName("`$wgSharedDB`.`{$wgSharedPrefix}user_groups`"); $res = $db->union( $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ug, conds, etc..., /*doQuery=*/false ), $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ugS, conds, etc..., /*doQuery=*/false ), ); ^_^ Which basically, would have select return sql which join uses to construct a join statement for. And then union, well... Unions them together.
Operator overloading would be the awesomest for this. Consider this Python statement:
res = SELECT + (field.field1, field.field2) + FROM( table.user, table.page ) + WHERE( field.page_title == field.user_name and field.page_namespace == NS_USER and field.user_id == 1729 ) + UNION + ...
with all operators (+, ==, and) overloaded, and SELECT/FROM/WHERE/UNION object names, and "field" and "table" dummy objects that just exist for __get__ methods (so table.user would be the same as $dbr->tableName( 'user' )).
Of course, I don't seriously suggest such a thing, so I guess it's off-topic, but I've pondered it before and thought it would be fun to mention as a counterpoint. The fact that it's possible is somewhat mind-boggling. Maybe an argument against operator overloading? :)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, the test has been committed into core. http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34422
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan wrote:
I did think of \b, however I'm wary of it. I tied the test to whitespace or starting/ending of string because of the remote case where someone has something like -on- as their table name. Because \b matches breaks there are a fair number of symbols which it also considers to be word breaks because they are not word characters. And it's possible for some of those to be valid inside of table names.
^_^ Actually your python example is kinda interesting. I was actually contemplating something not deep like what you were mentioning, but something which does do some sort of overloading. It came from the usage of tableName to generate table names with aliases. Imagine if tableName, or something else returned a object rather than a flat string. And that object could contain a table name, and an alias for that. It would even work beautifully in current uses because it would override the toString method and when used in string concatenation it would output a valid SQL string, with the Alias to if necessary. And could have extra methods for any special type of syntax for table alias pairs. We could even tweak tableName to accept an optional alias. And tweak the inputs of the other functions to allow some special table alias pair.
But yes... It is a bit of a tangent. However, it is worth looking at as a side project. Honestly, I see piles of shitty database abstraction systems which try and rewrite your SQL to make them fully 'cross database compatible' when most people consider that a load of bullshit because no matter how much you abstract you still need to understand a system you migrate to and it doesn't just magically work. But what I don't see a single instance of, is what MediaWiki has done a bit of. Converting the construction of SQL in strings, into a set of helper functions which aid in readability, portability, and cross-compliance of most sql calls. Which even have the amazing bonus of properly escaping things where they come in to eliminate most of the issues of accidentally forgetting to secure something which could be used for SQL injection. Honestly, would you rather call $db->select, or have to remember to mysql_real_escape_string every bit of data manually? The former always escapes cleanly and avoids having you forget to do it.
We may also want to add some helpers for a few things people might want to do in conditions. Basically IN, AND, OR, >, <, not, and a few other things I've found myself needing to construct a mini condition string rather than using the nice assignment that escapes everything.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Simetrical wrote:
On Wed, May 7, 2008 at 8:44 PM, DanTMan dan_the_man@telus.net wrote:
Well... I was worried about people with tables like "onsomething" or whatever... But I suppose I can test strictly for things with whitespace around them "/(^|\s)(ON|JOIN)(\s|$)/i" that should solve the issue.
Or use \b, that's what it's there for.
But I still think 3 is good. IMHO we should probably do 5 and 3 for best practice and avoidance of future issues.
Maybe, given how the API has its own database handling layer anyway, seemingly.
Actually, I used to do something to the Database class. I actually had a tweak for select and a few other functions. I basically added a extra parameter $doQuery to the end which defaulted to true. Basically it meant that you could explicitly send false and have select return a SQL statement for you instead. That way you could go as far as to use select, then manually UNION them together. But what would be absolutely beautiful would be: list( $user, $ug, $ugS) = $db->tableNameN('user', 'user_groups', $ug = $db->tableName("`$wgSharedDB`.`{$wgSharedPrefix}user_groups`"); $res = $db->union( $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ug, conds, etc..., /*doQuery=*/false ), $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ugS, conds, etc..., /*doQuery=*/false ), ); ^_^ Which basically, would have select return sql which join uses to construct a join statement for. And then union, well... Unions them together.
Operator overloading would be the awesomest for this. Consider this Python statement:
res = SELECT + (field.field1, field.field2) + FROM( table.user, table.page ) + WHERE( field.page_title == field.user_name and field.page_namespace == NS_USER and field.user_id == 1729 ) + UNION + ...
with all operators (+, ==, and) overloaded, and SELECT/FROM/WHERE/UNION object names, and "field" and "table" dummy objects that just exist for __get__ methods (so table.user would be the same as $dbr->tableName( 'user' )).
Of course, I don't seriously suggest such a thing, so I guess it's off-topic, but I've pondered it before and thought it would be fun to mention as a counterpoint. The fact that it's possible is somewhat mind-boggling. Maybe an argument against operator overloading? :)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Sorry, mixed up the definition of preg_match and broke things in that patch. Fixed in r34423.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan wrote:
Ok, the test has been committed into core. http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=34422
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan wrote:
I did think of \b, however I'm wary of it. I tied the test to whitespace or starting/ending of string because of the remote case where someone has something like -on- as their table name. Because \b matches breaks there are a fair number of symbols which it also considers to be word breaks because they are not word characters. And it's possible for some of those to be valid inside of table names.
^_^ Actually your python example is kinda interesting. I was actually contemplating something not deep like what you were mentioning, but something which does do some sort of overloading. It came from the usage of tableName to generate table names with aliases. Imagine if tableName, or something else returned a object rather than a flat string. And that object could contain a table name, and an alias for that. It would even work beautifully in current uses because it would override the toString method and when used in string concatenation it would output a valid SQL string, with the Alias to if necessary. And could have extra methods for any special type of syntax for table alias pairs. We could even tweak tableName to accept an optional alias. And tweak the inputs of the other functions to allow some special table alias pair.
But yes... It is a bit of a tangent. However, it is worth looking at as a side project. Honestly, I see piles of shitty database abstraction systems which try and rewrite your SQL to make them fully 'cross database compatible' when most people consider that a load of bullshit because no matter how much you abstract you still need to understand a system you migrate to and it doesn't just magically work. But what I don't see a single instance of, is what MediaWiki has done a bit of. Converting the construction of SQL in strings, into a set of helper functions which aid in readability, portability, and cross-compliance of most sql calls. Which even have the amazing bonus of properly escaping things where they come in to eliminate most of the issues of accidentally forgetting to secure something which could be used for SQL injection. Honestly, would you rather call $db->select, or have to remember to mysql_real_escape_string every bit of data manually? The former always escapes cleanly and avoids having you forget to do it.
We may also want to add some helpers for a few things people might want to do in conditions. Basically IN, AND, OR, >, <, not, and a few other things I've found myself needing to construct a mini condition string rather than using the nice assignment that escapes everything.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Simetrical wrote:
On Wed, May 7, 2008 at 8:44 PM, DanTMan dan_the_man@telus.net wrote:
Well... I was worried about people with tables like "onsomething" or whatever... But I suppose I can test strictly for things with whitespace around them "/(^|\s)(ON|JOIN)(\s|$)/i" that should solve the issue.
Or use \b, that's what it's there for.
But I still think 3 is good. IMHO we should probably do 5 and 3 for best practice and avoidance of future issues.
Maybe, given how the API has its own database handling layer anyway, seemingly.
Actually, I used to do something to the Database class. I actually had a tweak for select and a few other functions. I basically added a extra parameter $doQuery to the end which defaulted to true. Basically it meant that you could explicitly send false and have select return a SQL statement for you instead. That way you could go as far as to use select, then manually UNION them together. But what would be absolutely beautiful would be: list( $user, $ug, $ugS) = $db->tableNameN('user', 'user_groups', $ug = $db->tableName("`$wgSharedDB`.`{$wgSharedPrefix}user_groups`"); $res = $db->union( $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ug, conds, etc..., /*doQuery=*/false ), $db-join( $db->select( $user, conds, etc..., /*doQuery=*/false ), $ugS, conds, etc..., /*doQuery=*/false ), ); ^_^ Which basically, would have select return sql which join uses to construct a join statement for. And then union, well... Unions them together.
Operator overloading would be the awesomest for this. Consider this Python statement:
res = SELECT + (field.field1, field.field2) + FROM( table.user, table.page ) + WHERE( field.page_title == field.user_name and field.page_namespace == NS_USER and field.user_id == 1729 ) + UNION + ...
with all operators (+, ==, and) overloaded, and SELECT/FROM/WHERE/UNION object names, and "field" and "table" dummy objects that just exist for __get__ methods (so table.user would be the same as $dbr->tableName( 'user' )).
Of course, I don't seriously suggest such a thing, so I guess it's off-topic, but I've pondered it before and thought it would be fun to mention as a counterpoint. The fact that it's possible is somewhat mind-boggling. Maybe an argument against operator overloading? :)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, May 8, 2008 at 4:21 AM, DanTMan dan_the_man@telus.net wrote:
I did think of \b, however I'm wary of it. I tied the test to whitespace or starting/ending of string because of the remote case where someone has something like -on- as their table name. Because \b matches breaks there are a fair number of symbols which it also considers to be word breaks because they are not word characters. And it's possible for some of those to be valid inside of table names.
The table can also be named ` on `. Whereas on the other hand, INNER JOIN table ON(cond) is perfectly valid syntax which your test would get wrong.
But what I don't see a single instance of, is what MediaWiki has done a bit of. Converting the construction of SQL in strings, into a set of helper functions which aid in readability, portability, and cross-compliance of most sql calls. Which even have the amazing bonus of properly escaping things where they come in to eliminate most of the issues of accidentally forgetting to secure something which could be used for SQL injection. Honestly, would you rather call $db->select, or have to remember to mysql_real_escape_string every bit of data manually? The former always escapes cleanly and avoids having you forget to do it.
Actually, I think the SQL queries themselves are most readable, by and large. Escaping can be done printf-style, with query( "SELECT * FROM table WHERE name=%s", $var ), the way Python and C (for instance) do it. (Of course, I imagine that might be fairly tricky with some complicated dynamically-adjusted queries.) The abstraction is maybe the most valuable part, so that things like FORCE INDEX can be included for MySQL and easily removed by other engines -- without having to parse a string.
We may also want to add some helpers for a few things people might want to do in conditions. Basically IN, AND, OR, >, <, not, and a few other things I've found myself needing to construct a mini condition string rather than using the nice assignment that escapes everything.
You can do IN and AND:
$conds = array( 'id' => array( 1, 2, 3 ), 'foreignkey' => 17 )
gives "id IN (1,2,3) AND foreignkey = 17" or equivalent. <, >, OR, and perhaps NOT I occasionally miss, but I don't think there's any good reason to abstract those if it would be awkward (and I bet it would be).
On Thu, May 8, 2008 at 5:56 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Unfortunately, this isn't about INNER JOINs (which I'm smart enough to rewrite), but about LEFT JOINs, which are pretty much impossible to construct cleanly right now. I would very much welcome support for LEFT JOINs in the Database class, so I can port it to ApiQueryBase as well.
Yes, sorry. The first example was in fact an INNER JOIN. The others were LEFT JOINs, which usually in core code seem to end up just using query(). But that's obviously not ideal for anyone. I think with the changes to tableName we can now use LEFT JOIN with select(), yes?
On Thu, May 8, 2008 at 6:11 AM, DanTMan dan_the_man@telus.net wrote:
It looks like Semantic MediaWiki is sending in an ugly "DISTINCT $pagetable.page_title as title, $pagetable.page_namespace as namespace, $pagetable.page_id as id", So now I need to tweak the test to check for DISTINCT and AS or else we'll have yet another bug report.
That's the field list, though, not the table list. Why would your change affect it? I'd imagine that that field list could become array( 'page_title', 'page_namespace', 'page_id' ), with $options []= 'DISTINCT', in any case (and surrounding code changed to not use the aliases). You could commit the change yourself, you know, SMW is in the repo.
Ok, then how about we draft up a short list of symbols to test, or not test against. Simplifying down to word breaks or such leaves with to much that could break. As for spaces inside of table names, I'm pretty sure anyone insane enough to use a space inside of a name is already going to be having issues with SQL. If you're using ` on ` as a table name, you're pretty sure things are going to break if you don't go and quote that yourself. Underscores or dashes are always preferred over spaces.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Simetrical wrote:
On Thu, May 8, 2008 at 4:21 AM, DanTMan dan_the_man@telus.net wrote:
I did think of \b, however I'm wary of it. I tied the test to whitespace or starting/ending of string because of the remote case where someone has something like -on- as their table name. Because \b matches breaks there are a fair number of symbols which it also considers to be word breaks because they are not word characters. And it's possible for some of those to be valid inside of table names.
The table can also be named ` on `. Whereas on the other hand, INNER JOIN table ON(cond) is perfectly valid syntax which your test would get wrong.
But what I don't see a single instance of, is what MediaWiki has done a bit of. Converting the construction of SQL in strings, into a set of helper functions which aid in readability, portability, and cross-compliance of most sql calls. Which even have the amazing bonus of properly escaping things where they come in to eliminate most of the issues of accidentally forgetting to secure something which could be used for SQL injection. Honestly, would you rather call $db->select, or have to remember to mysql_real_escape_string every bit of data manually? The former always escapes cleanly and avoids having you forget to do it.
Actually, I think the SQL queries themselves are most readable, by and large. Escaping can be done printf-style, with query( "SELECT * FROM table WHERE name=%s", $var ), the way Python and C (for instance) do it. (Of course, I imagine that might be fairly tricky with some complicated dynamically-adjusted queries.) The abstraction is maybe the most valuable part, so that things like FORCE INDEX can be included for MySQL and easily removed by other engines -- without having to parse a string.
We may also want to add some helpers for a few things people might want to do in conditions. Basically IN, AND, OR, >, <, not, and a few other things I've found myself needing to construct a mini condition string rather than using the nice assignment that escapes everything.
You can do IN and AND:
$conds = array( 'id' => array( 1, 2, 3 ), 'foreignkey' => 17 )
gives "id IN (1,2,3) AND foreignkey = 17" or equivalent. <, >, OR, and perhaps NOT I occasionally miss, but I don't think there's any good reason to abstract those if it would be awkward (and I bet it would be).
On Thu, May 8, 2008 at 5:56 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Unfortunately, this isn't about INNER JOINs (which I'm smart enough to rewrite), but about LEFT JOINs, which are pretty much impossible to construct cleanly right now. I would very much welcome support for LEFT JOINs in the Database class, so I can port it to ApiQueryBase as well.
Yes, sorry. The first example was in fact an INNER JOIN. The others were LEFT JOINs, which usually in core code seem to end up just using query(). But that's obviously not ideal for anyone. I think with the changes to tableName we can now use LEFT JOIN with select(), yes?
On Thu, May 8, 2008 at 6:11 AM, DanTMan dan_the_man@telus.net wrote:
It looks like Semantic MediaWiki is sending in an ugly "DISTINCT $pagetable.page_title as title, $pagetable.page_namespace as namespace, $pagetable.page_id as id", So now I need to tweak the test to check for DISTINCT and AS or else we'll have yet another bug report.
That's the field list, though, not the table list. Why would your change affect it? I'd imagine that that field list could become array( 'page_title', 'page_namespace', 'page_id' ), with $options []= 'DISTINCT', in any case (and surrounding code changed to not use the aliases). You could commit the change yourself, you know, SMW is in the repo.
Hmm... it is inside of the fields, not the table. Well, I don't know what was wrong, but I do know that when I applied that change SMW suddenly started working for me, so it can't hurt.
What I find horribly disturbing is this:
|$res = $db->query( "SELECT title FROM $tablename", 'SMW::getQueryResult:DEBUG');
|
That's a blatant select statement which should be sent through $db->select.
But I'll look into improvements into Database methods sometime later. I'd honestly like to abstract queries away from raw SQL into using methods. We're doing PHP, not SQL, let the SQL portion of the code handle SQL while we write in PHP. Doing it right could also potentially allow the complex things like SMW, API, and DPL which build complex queries to abstract those as well with the helper functions, or heck... We could come up with some sort of actual Query builder class.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, May 8, 2008 at 10:28 AM, DanTMan dan_the_man@telus.net wrote:
Ok, then how about we draft up a short list of symbols to test, or not test against. Simplifying down to word breaks or such leaves with to much that could break.
Such as? Nothing adhering to anything resembling normal table-naming practices will be affected.
As for spaces inside of table names, I'm pretty sure anyone insane enough to use a space inside of a name is already going to be having issues with SQL. If you're using ` on ` as a table name, you're pretty sure things are going to break if you don't go and quote that yourself. Underscores or dashes are always preferred over spaces.
It's just as valid as `-on-`, and probably just as likely. For that matter, `on` is a valid table name, which your regex would fail on too (as would mine).
Of course, you could just be more complete, and exclude only a case like
/\s*$tablenameregex\s+((LEFT|INNER|OUTER|CROSS|LEFT\s+OUTER|RIGHT|RIGHT\s+OUTER|FULL\s+OUTER)\s+)?JOIN\s+$tablenameregex\s*/
where $tablenameregex is something like
([a-z0-9_]*|`[^`]*`|"[^`]"|[[^]]])
In other words, you could parse the SQL. Just make sure you do it correctly. :)
What I find horribly disturbing is this:
|$res = $db->query( "SELECT title FROM $tablename", 'SMW::getQueryResult:DEBUG');
|
That's a blatant select statement which should be sent through $db->select.
Yes, there are a fair few of those in core too.
But I'll look into improvements into Database methods sometime later. I'd honestly like to abstract queries away from raw SQL into using methods. We're doing PHP, not SQL, let the SQL portion of the code handle SQL while we write in PHP. Doing it right could also potentially allow the complex things like SMW, API, and DPL which build complex queries to abstract those as well with the helper functions, or heck... We could come up with some sort of actual Query builder class.
Personally I would caution against getting too abstract. The current database abstraction layer uses familiar PHP constructions like arrays in a very natural way, on the whole. Adding tons of methods like join() is not necessarily a great way to go.
On Thu, May 8, 2008 at 10:28 AM, DanTMan dan_the_man@telus.net wrote:
Ok, then how about we draft up a short list of symbols to test, or not test against. Simplifying down to word breaks or such leaves with to much that could break.
Such as? Nothing adhering to anything resembling normal table-naming practices will be affected.
As for spaces inside of table names, I'm pretty sure anyone insane enough to use a space inside of a name is already going to be having issues with SQL. If you're using ` on ` as a table name, you're pretty sure things are going to break if you don't go and quote that yourself. Underscores or dashes are always preferred over spaces.
It's just as valid as `-on-`, and probably just as likely. For that matter, `on` is a valid table name, which your regex would fail on too (as would mine).
Of course, you could just be more complete, and exclude only a case like
/\s*$tablenameregex\s+((LEFT|INNER|OUTER|CROSS|LEFT\s+OUTER|RIGHT|RIGHT\s+OUTER|FULL\s+OUTER)\s+)?JOIN\s+$tablenameregex\s*/
where $tablenameregex is something like
([a-z0-9_]*|`[^`]*`|"[^`]"|[[^]]])
In other words, you could parse the SQL. Just make sure you do it correctly. :)
What I find horribly disturbing is this:
|$res = $db->query( "SELECT title FROM $tablename", 'SMW::getQueryResult:DEBUG');
|
That's a blatant select statement which should be sent through $db->select.
Yes, there are a fair few of those in core too.
But I'll look into improvements into Database methods sometime later. I'd honestly like to abstract queries away from raw SQL into using methods. We're doing PHP, not SQL, let the SQL portion of the code handle SQL while we write in PHP. Doing it right could also potentially allow the complex things like SMW, API, and DPL which build complex queries to abstract those as well with the helper functions, or heck... We could come up with some sort of actual Query builder class.
Personally I would caution against getting too abstract. The current database abstraction layer uses familiar PHP constructions like arrays in a very natural way, on the whole. Adding tons of methods like join() is not necessarily a great way to go.
Simetrical wrote:
On Thu, May 8, 2008 at 10:28 AM, DanTMan dan_the_man@telus.net wrote:
Ok, then how about we draft up a short list of symbols to test, or not test against. Simplifying down to word breaks or such leaves with to much that could break.
Such as? Nothing adhering to anything resembling normal table-naming practices will be affected.
As for spaces inside of table names, I'm pretty sure anyone insane enough to use a space inside of a name is already going to be having issues with SQL. If you're using ` on ` as a table name, you're pretty sure things are going to break if you don't go and quote that yourself. Underscores or dashes are always preferred over spaces.
It's just as valid as `-on-`, and probably just as likely. For that matter, `on` is a valid table name, which your regex would fail on too (as would mine).
Of course, you could just be more complete, and exclude only a case like
/\s*$tablenameregex\s+((LEFT|INNER|OUTER|CROSS|LEFT\s+OUTER|RIGHT|RIGHT\s+OUTER|FULL\s+OUTER)\s+)?JOIN\s+$tablenameregex\s*/
where $tablenameregex is something like
([a-z0-9_]*|`[^`]*`|"[^`]"|[[^]]])
In other words, you could parse the SQL. Just make sure you do it correctly. :)
`table` is already handled before the test for tablename stuff. We're just trying to see if anything that is part of a SQL query shows up inside. So completeness isn't what's being tested. You noted the use of () next to ON, rather than just whitespace. Any similar types of characters? If not all that would be needed is to add in () and it should be fine: "/(^|\s|))(JOIN|ON|AS)((|\s|$)/i" Technically... = should never really show up outside either... Or maybe we're overcomplicating this...
/((\S{,64}|`[^`]{,64}`).)?(\S{,64}|`[^`]{,64}`)/iS All we really want is to know if the input is valid. And all we're trying to catch is: - table - database.table And the variants of them with some amount of backticks. Yes " is usable if ANSI_QUOTES is enabled, but that is in no way widely compatible with MySQL itself. So none of the code should ever output that. Additionally tables and databases are not allowed any more than 64 characters, not to mention we shouldn't be using ones that long.
That regex should match any valid input, and exclude any invalid input which would simply be outputted.
What I find horribly disturbing is this:
|$res = $db->query( "SELECT title FROM $tablename", 'SMW::getQueryResult:DEBUG');
|
That's a blatant select statement which should be sent through $db->select.
Yes, there are a fair few of those in core too.
But I'll look into improvements into Database methods sometime later. I'd honestly like to abstract queries away from raw SQL into using methods. We're doing PHP, not SQL, let the SQL portion of the code handle SQL while we write in PHP. Doing it right could also potentially allow the complex things like SMW, API, and DPL which build complex queries to abstract those as well with the helper functions, or heck... We could come up with some sort of actual Query builder class.
Personally I would caution against getting too abstract. The current database abstraction layer uses familiar PHP constructions like arrays in a very natural way, on the whole. Adding tons of methods like join() is not necessarily a great way to go.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan schreef:
`table` is already handled before the test for tablename stuff. We're just trying to see if anything that is part of a SQL query shows up inside. So completeness isn't what's being tested. You noted the use of () next to ON, rather than just whitespace. Any similar types of characters? If not all that would be needed is to add in () and it should be fine: "/(^|\s|))(JOIN|ON|AS)((|\s|$)/i" Technically... = should never really show up outside either... Or maybe we're overcomplicating this...
Yeah, maybe we are. In an ideal world, we'd have wrapper functions for JOINs and other stuff too, so we don't *need* to pass SQL to Database::select().
BTW, Simetrical, I just documented ApiQueryBase, so if you still wanna know how the API builds its queries, look at [1] tomorrow.
Roan Kattouw (Catrope)
On Thu, May 8, 2008 at 11:34 AM, DanTMan dan_the_man@telus.net wrote:
`table` is already handled before the test for tablename stuff. We're just trying to see if anything that is part of a SQL query shows up inside. So completeness isn't what's being tested.
Doing it properly and parsing it is the only thing that will give correct results in pathological cases.
You noted the use of () next to ON, rather than just whitespace. Any similar types of characters?
Sure. You could have comments, you could probably have all sorts of weird stuff. Granted that comments would be pathological, but no more than a table named "-on-". There may be additional non-pathological cases.
Or maybe we're overcomplicating this...
/((\S{,64}|`[^`]{,64}`).)?(\S{,64}|`[^`]{,64}`)/iS All we really want is to know if the input is valid. And all we're trying to catch is:
- table
- database.table
And the variants of them with some amount of backticks.
Better version:
/^\s*([a-z0-9_]+.|`[^`]+`.)?([a-z0-9_]+|`[^`]+`)\s*$/i
There's no reason to complicate the regex with explicit maximum lengths. If additional characters are allowed in unquoted table names, add those to the [a-z0-9_] class. This should avoid all valid joins and similar. (It still fails on comments, though! :P )
Yes " is usable if ANSI_QUOTES is enabled, but that is in no way widely compatible with MySQL itself. So none of the code should ever output that.
Well, technically the code isn't outputting it, it's just passing it to the abstraction layer. I figured we should be generous. Notice that I also put in the MSSQLism [table name]. Not strictly any point, you're right, but it gives me a warm and fuzzy feeling to permit standards compliance in what we pass to abstraction layers, at least, even if MySQL sucks too much to allow it to actually be run. :)
Simetrical wrote:
On Thu, May 8, 2008 at 11:34 AM, DanTMan dan_the_man@telus.net wrote:
`table` is already handled before the test for tablename stuff. We're just trying to see if anything that is part of a SQL query shows up inside. So completeness isn't what's being tested.
Doing it properly and parsing it is the only thing that will give correct results in pathological cases.
You noted the use of () next to ON, rather than just whitespace. Any similar types of characters?
Sure. You could have comments, you could probably have all sorts of weird stuff. Granted that comments would be pathological, but no more than a table named "-on-". There may be additional non-pathological cases.
Or maybe we're overcomplicating this...
/((\S{,64}|`[^`]{,64}`).)?(\S{,64}|`[^`]{,64}`)/iS All we really want is to know if the input is valid. And all we're trying to catch is:
- table
- database.table
And the variants of them with some amount of backticks.
Better version:
/^\s*([a-z0-9_]+.|`[^`]+`.)?([a-z0-9_]+|`[^`]+`)\s*$/i
There's no reason to complicate the regex with explicit maximum lengths. If additional characters are allowed in unquoted table names, add those to the [a-z0-9_] class. This should avoid all valid joins and similar. (It still fails on comments, though! :P )
"/^\s*(\S+|`[^`]+`)(.\S+|.`[^`]+`)?\s*$/i" All we need to test for is non-whitespace. Anything which is completely unbroken by whitespace can't be a query at all. "SELECT*FROMfoo" ;) And the backquote testing makes sure that `...` still passes and you can still pass ` foo ` in and have it work. The second group is there so that `foo`.bar, and foo.`bar` don't break. And I handily moved the condition to the second group to optimize the regex. The \S+ alone will completely match 'foo.bar' without any backtracking or anything from the processor. Though, considering your \s*'s I'll probably have to run the table and database through trim.
Yes " is usable if ANSI_QUOTES is enabled, but that is in no way widely compatible with MySQL itself. So none of the code should ever output that.
Well, technically the code isn't outputting it, it's just passing it to the abstraction layer. I figured we should be generous. Notice that I also put in the MSSQLism [table name]. Not strictly any point, you're right, but it gives me a warm and fuzzy feeling to permit standards compliance in what we pass to abstraction layers, at least, even if MySQL sucks too much to allow it to actually be run. :)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
On Thu, May 8, 2008 at 12:12 PM, DanTMan dan_the_man@telus.net wrote:
"/^\s*(\S+|`[^`]+`)(.\S+|.`[^`]+`)?\s*$/i" All we need to test for is non-whitespace. Anything which is completely unbroken by whitespace can't be a query at all. "SELECT*FROMfoo" ;)
But "database . table" is valid, since you make the point about whitespace. So
/^\s*(\S+|`[^`]+`)(\s*.\s*(\S+|`[^`]+`))?\s*$/
should work.
And I handily moved the condition to the second group to optimize the regex. The \S+ alone will completely match 'foo.bar' without any backtracking or anything from the processor.
I don't think we need to worry much about optimizing this regex. :)
Simetrical wrote:
On Thu, May 8, 2008 at 12:12 PM, DanTMan dan_the_man@telus.net wrote:
"/^\s*(\S+|`[^`]+`)(.\S+|.`[^`]+`)?\s*$/i" All we need to test for is non-whitespace. Anything which is completely unbroken by whitespace can't be a query at all. "SELECT*FROMfoo" ;)
But "database . table" is valid, since you make the point about whitespace. So
/^\s*(\S+|`[^`]+`)(\s*.\s*(\S+|`[^`]+`))?\s*$/
should work.
Heh... ok, then I'll trim it. Won't break anything because if you pass in `'s they'll armor against the stripping.
And I handily moved the condition to the second group to optimize the regex. The \S+ alone will completely match 'foo.bar' without any backtracking or anything from the processor.
I don't think we need to worry much about optimizing this regex. :)
^_^ Just don't want the regex engine to pointlessly backtrack over something so simple. Soo...
/^\s*(\S+|`[^`]+`)(\s*.\s*(\S+|`[^`]+`))?\s*$/iS
I do want to use an S flag since it's going to be used very often...
^_^ I'll commit this to trunk and port it into a 1.12 patch I'm hosting yet again... Well, once I'm not dazed enough by sleep I only have a partial understanding about how my own regex works, heh... http://wiki-tools.com/wiki/Shared_Tables_in_1.12 ^_^ How else do you think I'm using shared interwiki when I only have 2 wiki on 1.13... Ok, actually... I only have 1 wiki on 1.13... ^_^ The other one is on the titlerewrite branch... Which as you can see, is horribly broken... ie: Needs work, heh... http://titlerewrite.dev.wiki-tools.com/wiki/Main_Page Plus... I need to setup cron to bring over changes from the branch.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
DanTMan wrote:
Simetrical wrote:
There's no reason to complicate the regex with explicit maximum lengths. If additional characters are allowed in unquoted table names, add those to the [a-z0-9_] class. This should avoid all valid joins and similar. (It still fails on comments, though! :P )
"/^\s*(\S+|`[^`]+`)(.\S+|.`[^`]+`)?\s*$/i"
Oh god, make the madness stop! :)
Parsing here sucks for several reasons:
* actual syntax differs between different DB backends * big ugly regexes are hard to read * it feels like "magic" trying to treat different strings differently based on content, which is always icky
Generally we try to make natural use of different *data types* for different kinds of input here.
Since we're talking about a case where we want to make an *exception* from the standard behavior -- string table names being for internal processing, leading to prefixing and quoting -- we should explicitly mark it as such.
Long ago, I tossed around the idea of using a 'RawSql' or similar data type to tell the query-building functions that yes, we were sure, we really want to pass some raw SQL here -- we know what we're doing, so please don't escape it for us.
This might look like:
$db->select( 'page', array( 'page_namespace', 'page_title' ), array( 'page_id' => new RawSql('RAND()*1000' ) );
or whatever.
For the case of wild & crazy custom joins, it might be:
$db->select( new RawSql("$page LEFT JOIN $barfo ON page_id=barfo_page"), array( 'barfo_key', 'page_namespace', 'page_title' ),
or whatever.
Now, I don't know if this is the best system ever, but I like that it's explicit about the use of unprocessed (and thus potentially unsafe) data, which'll make it easier to spot potential trouble spots when maintaining the code later.
- -- brion vibber (brion @ wikimedia.org)
^_^ That's basically the same idea as the \x07 suggestion of basically some sort of character to signal "I handled this... Don't touch it, or else..." cept an object would handle it nicer... I like it...
Actually, it would be nice to subclass the RawSQL to create abstraction for the few other <, >, not that we can't otherwise do. To still quote things nicely.
And then we can politely say 'tableName' ONLY handles table names, nothing else... It's not supposed to, and it wont... And when someone says "tableName broke my SQL"... You're not supposed to do that... Fix your code... SQL Error == Bad code... Need fixing...
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
DanTMan wrote:
Simetrical wrote:
There's no reason to complicate the regex with explicit maximum lengths. If additional characters are allowed in unquoted table names, add those to the [a-z0-9_] class. This should avoid all valid joins and similar. (It still fails on comments, though! :P )
"/^\s*(\S+|`[^`]+`)(.\S+|.`[^`]+`)?\s*$/i"
Oh god, make the madness stop! :)
Parsing here sucks for several reasons:
- actual syntax differs between different DB backends
- big ugly regexes are hard to read
- it feels like "magic" trying to treat different strings differently
based on content, which is always icky
Generally we try to make natural use of different *data types* for different kinds of input here.
Since we're talking about a case where we want to make an *exception* from the standard behavior -- string table names being for internal processing, leading to prefixing and quoting -- we should explicitly mark it as such.
Long ago, I tossed around the idea of using a 'RawSql' or similar data type to tell the query-building functions that yes, we were sure, we really want to pass some raw SQL here -- we know what we're doing, so please don't escape it for us.
This might look like:
$db->select( 'page', array( 'page_namespace', 'page_title' ), array( 'page_id' => new RawSql('RAND()*1000' ) );
or whatever.
For the case of wild & crazy custom joins, it might be:
$db->select( new RawSql("$page LEFT JOIN $barfo ON page_id=barfo_page"), array( 'barfo_key', 'page_namespace', 'page_title' ),
or whatever.
Now, I don't know if this is the best system ever, but I like that it's explicit about the use of unprocessed (and thus potentially unsafe) data, which'll make it easier to spot potential trouble spots when maintaining the code later.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkgjMD4ACgkQwRnhpk1wk47TNACfQY+MG899wp4CgFHy20q3FM97 ZicAoIa1P2gkJtzl844MiWU+my7y/VW/ =5Fj+ -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
DanTMan wrote:
^_^ That's basically the same idea as the \x07 suggestion of basically some sort of character to signal "I handled this... Don't touch it, or else..." cept an object would handle it nicer... I like it...
Yay for type safety! :)
- -- brion
Heh... ya... And just for a little more compatibility, we can even override the toString method and have it output the string version of the query.
What do you thing about a SqlBuilder class? Basically when doing a real complex SQL query, instead of writing everything by hand, you would instance a new SqlBuilder passing a database object to it. You build up queries by first selecting what type it's based on (SELECT, DELETE, INSERT, etc...). Then you use a variety of add functions for adding a variety of conditions, values, joins, and options. Then when you are finished you just run the query or doQuery, depending on what we decide to name it. And it gives back the result.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
DanTMan wrote:
^_^ That's basically the same idea as the \x07 suggestion of basically some sort of character to signal "I handled this... Don't touch it, or else..." cept an object would handle it nicer... I like it...
Yay for type safety! :)
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkgjOuYACgkQwRnhpk1wk47QSQCgsoLij8hKsVoI1D7o4qgLzP9z q50AoKOIABSMURTNgaiOd7YoSwSxRiuJ =z0eD -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, May 8, 2008 at 12:54 PM, Brion Vibber brion@wikimedia.org wrote:
Long ago, I tossed around the idea of using a 'RawSql' or similar data type to tell the query-building functions that yes, we were sure, we really want to pass some raw SQL here -- we know what we're doing, so please don't escape it for us.
This might look like:
$db->select( 'page', array( 'page_namespace', 'page_title' ), array( 'page_id' => new RawSql('RAND()*1000' ) );
or whatever.
Hmm, yes. I wish there were a slightly nicer way to write it. Maybe
array( 'page_id' => sql('RAND()*1000') )
or
array( 'page_id' => raw('RAND()*1000') ).
raw() could be used generically to indicate that the input should not be escaped in any fashion. Of course, it would take work for it to actually be used by all our magical escaping stuff.
What I really wish is that we could do this kind of type marking transparently, with different types for clean HTML, clean SQL, wikitext, etc., plus a type for unsafe input, and then appropriate concatenation methods that escape everything as necessary. That would be possible in a language supporting operator overloading, but not PHP, where there's no effective way to extend strings. (And anyway everything is a function, not a method, so stuff like str_replace() would just break hopelessly.)
Alternatively, we could always say that an array input means implode the array, whereas string input is always literal. That would work. But it would be somewhat confusing.
On Thu, May 8, 2008 at 12:54 PM, DanTMan dan_the_man@telus.net wrote:
Soo...
/^\s*(\S+|`[^`]+`)(\s*.\s*(\S+|`[^`]+`))?\s*$/iS
I do want to use an S flag since it's going to be used very often...
This is moot, but:
1) /i is unnecessary since you have no letters in the pattern.
2) Whether /S is useful is a matter for benchmarking, and the distinction should be completely trivial when the string you're analyzing is like 20 characters long.
3) I realized that you're wrong, you can't just rely on \S. Input "table1,table2" should not be backtick-quoted, it should be left alone. You need to use something like [a-z0-9_], as I said (maybe some more characters than that are valid).
But who cares. :)
Simetrical schreef:
On Wed, May 7, 2008 at 3:23 PM, Roan Kattouw roan.kattouw@home.nl wrote:
They're far from unneeded. Database::select() is broken in that it adds the backticks wrongly, generating invalid SQL [1]. I'll poke dantman to fix this, but in the meantime we should have a working API.
You know, every other place in the entire code base where we have inner joins, we just use implicit joins with WHERE. So in other words, don't use ON at all. Instead of SELECT * FROM page INNER JOIN revision ON page_id=rev_page WHERE page_id=731, use SELECT * FROM page, revision WHERE page_id=731 AND page_id=rev_page (don't ever try either for arbitrary pages, BTW :)). The queries are totally equivalent. There should be no reason to use any other syntax for inner joins, although stylistically I do prefer ON clauses.
Unfortunately, this isn't about INNER JOINs (which I'm smart enough to rewrite), but about LEFT JOINs, which are pretty much impossible to construct cleanly right now. I would very much welcome support for LEFT JOINs in the Database class, so I can port it to ApiQueryBase as well.
I can't patch your code, however, since as usual, the API uses its own syntax for database queries and non-API developers like me have no idea how it works.
If you're interested, take a look at the ApiQueryBase class. The basic idea is that all ApiQuery* classes inherit helper functions from ApiQueryBase which build the arrays Database::select() takes one element at a time. This is done because many WHERE clauses and selected fields have to be included or omitted dynamically (i.e. depending on the query parameters).
Roan Kattouw (Catrope)
DanTMan schreef:
We've got multiple options on how to fix this:
- Find a better way to make complex queries using the Database class
that doesn't involve building complex queries, and then trying to shove them into the simple handlers. This would greatly improve cross-db compatibility if done.
That should definitely happen. More specifically, we should have support for LEFT/RIGHT JOINs in the Database class. I think this kind of obsoletes 2, 3 and 4, because the only occasion on which the API sends full-blown SQL to Database::select() is to do LEFT JOINs.
Roan Kattouw (Catrope)
:/ Actually, the API people aren't the only ones doing ugly things like that. It looks like Semantic MediaWiki is sending in an ugly "DISTINCT $pagetable.page_title as title, $pagetable.page_namespace as namespace, $pagetable.page_id as id", So now I need to tweak the test to check for DISTINCT and AS or else we'll have yet another bug report.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Roan Kattouw wrote:
DanTMan schreef:
We've got multiple options on how to fix this:
- Find a better way to make complex queries using the Database class
that doesn't involve building complex queries, and then trying to shove them into the simple handlers. This would greatly improve cross-db compatibility if done.
That should definitely happen. More specifically, we should have support for LEFT/RIGHT JOINs in the Database class. I think this kind of obsoletes 2, 3 and 4, because the only occasion on which the API sends full-blown SQL to Database::select() is to do LEFT JOINs.
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org