-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
So we've finally god the database schema updates in and have tried doing the MediaWiki software update.
Unfortunately, the updates of image table rows have brought the DB server running Commons to its knees. Tim's working on redoing the row upgrade system to be less burdensome...
- -- brion
On 24/08/07, Brion Vibber brion@wikimedia.org wrote:
So we've finally god the database schema updates in and have tried doing the MediaWiki software update.
Freudian slip?
Unfortunately, the updates of image table rows have brought the DB server running Commons to its knees. Tim's working on redoing the row upgrade system to be less burdensome...
Guns don't kill software developers, irate DBAs do.
Rob Church
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rob Church wrote:
On 24/08/07, Brion Vibber brion@wikimedia.org wrote:
So we've finally god the database schema updates in and have tried doing the MediaWiki software update.
Freudian slip?
BEHOLD, I BRING YOU THE TEN DATABASE UPDATES! OBEY THEM OR YE SHALL BE STRICKEN DOWN WITH A PLAGUE OF SERVER OVERLOAD!
Unfortunately, the updates of image table rows have brought the DB server running Commons to its knees. Tim's working on redoing the row upgrade system to be less burdensome...
Guns don't kill software developers, irate DBAs do.
*watches over his shoulder for Domas*
- -- brion
Brion Vibber wrote:
So we've finally god the database schema updates in and have tried doing the MediaWiki software update.
Unfortunately, the updates of image table rows have brought the DB server running Commons to its knees. Tim's working on redoing the row upgrade system to be less burdensome...
Ok, the image metadata for sha-1 hashes is now updated when actually needed for deletion (or in a batch process) instead of on metadata read, which was what was bogging down the system.
We've upgraded to current trunk, and things seem more or less ok. :)
Only regression I've come across so far was breakage of image height specification, which was a pretty easy fix. (r25136)
-- brion vibber (brion @ wikimedia.org)
On 8/25/07, Brion Vibber brion@wikimedia.org wrote:
Ok, the image metadata for sha-1 hashes is now updated when actually needed for deletion (or in a batch process) instead of on metadata read, which was what was bogging down the system.
Please setup a batch job to eventually populate the sha-1 metadata for non-deleted images. We'd like to use it for duplicate image detection.
We're already doing this against the deleted images... A bot downloads the image, computes that sha-1, checks the filearchive table based on the SHA1, and if there is a match it complains in IRC and the new image is tagged.
This would be easier to perform if we could skip the download/compute sha1 step.. and being able to test against non-deleted images would be handy too.
Of course, it doesn't detect anything that isn't bit-identical, but catching bit-identical duplicates is still useful.
On 8/26/07, Gregory Maxwell gmaxwell@gmail.com wrote:
We're already doing this against the deleted images... A bot downloads the image, computes that sha-1, checks the filearchive table based on the SHA1, and if there is a match it complains in IRC and the new image is tagged.
What would be really useful is an sha archive of the internet ;P Imagine that we can find the source of an image by just looking it up in the archive.
I guess no such thing exists yet. But it may be worth setting up such database, although it will become enormous and take ages to populate. But with a seti-at-home-like setup it might as well be feasable. Ok just daydreaming :)
Bryan
On 8/26/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
What would be really useful is an sha archive of the internet ;P Imagine that we can find the source of an image by just looking it up in the archive.
That actually should be doable for the major image search engines. I'll try to get the idea passed around a bit at least.
On 8/25/07, Erik Moeller erik@wikimedia.org wrote:
On 8/26/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
What would be really useful is an sha archive of the internet ;P Imagine that we can find the source of an image by just looking it up in the archive.
That actually should be doable for the major image search engines. I'll try to get the idea passed around a bit at least.
If you do end up in one of these conversations: Also try to get a feel for how they'd feel about also generating a lookup key for fuzzy matching.
SHA-1 will allow us to catch bit identical duplicates, but it fails if someone resizes, crops, recompresses, or strips EXIF. Even if their change isn't visible. It would be a good first step but it is trivial to evade, even accidentally.
I've been working on writing software for doing fuzzy image matching. It has been a low priority project that I've worked on off-and-on for the last few months so it is slow in coming, but I will eventually produce something good or someone else will beat me to it.
It isn't something that we should allow to slow down the introduction of exact match searches, but it would be good to have the contacts ready when we can propose doing something more.
Also related to this subject is the request I sent to the board a while back on contacting copyright violation detection companies. I never heard any response:
---------- Forwarded message ---------- From: Gregory Maxwell gmaxwell@wikimedia.org Date: Feb 28, 2007 7:34 PM Subject: Contacting copyright violation detection companies. To: board-l@lists.wikimedia.org
There are several commercial companies that exist to to help copyright holders locate web sites which are infringing their copyrights.
They exact method of operation differs from company to company, but all appear to involve the company running a web spider that goes out and looks for possibly infringing content and all that I've found sell this as a service to content holders.
For example, one company is: Digimarc (http://www.digimarc.com/). With digimarc's approach content holders add invisible watermarks to their content which digimarc web spiders detect. Digimark also offers a no-cost software tool for Windows/Mac which decodes and displays any embedded watermarks.
What I'd like to do is contact one or more of these companies to explore opportunities for us to cooperate for our mutual benefit.
I see a number of benefits and a number of potential risks:
Benefits: * Reduction in copyright violating content on our projects. * Increased speed in detection of copyright violations. * An independent indicator of the effectiveness of our communities' ability to detect copyright violations. * An opportunity to make public statements about our efforts and differentiate ourselves from many other web 2.0 services and highly our higher goals * Increased evidence of due care on our part which may be useful in future legal disputes. * Improved efficiency - since some of these services would spider us anyways. Cooperation may yield decreased bandwidth usage, and without our cooperation our method of notice will be DMCA takedown requests. * Establishing a relationship before a possible change in legal climate switches these companies into a 'charge the service provider' business model.
Risks: * Incorrect detection: some companies may falsely claim ownership of public domain content. * The detection company may consider us a potential customer and nag us to purchase services. * Loss of goodwill from interacting with companies whose purpose can be publicly unpopular. --Forcing the takedown of illegally copied videos on youtube garnishes enough dislike, but many of these companies also play in the Digital restrictions management space.
It also may be possible that such companies might be interested in a live media feed, possibly a service we could sell them, or possibly income we forgo in the spirit of cooperation and mutual benefit. I suspect that we're an unattractive enough target and good enough policing ourselves that no one would be interested in paying.
I believe that the first risk can be resolved by setting this up to provide input to the community rather than some sort of automatic upload restriction. The second point is harder to address.
What I'm looking for is permission to make contact and see what possibilities exist, I would then report back to the board with my findings.
I'm also interested any guidance related to what we are willing to do which I could use in my initial discussions. For example, would we be willing to run a non-free company provided watermark detection tool to avoid having to send all our uploads off for checking?
I've been on the lookout to researchers interested in developing open source fuzzy image comparison tools for our own checking purposes (for example, to detect uploads of previously deleted content). I think that such tools will be important in the long term, but the proposed cooperation would not be mutually exclusive and would serve a different but related purpose.
(I'm not on the board list, so remember to copy me on replies)
Gregory Maxwell wrote:
On 8/25/07, Brion Vibber brion@wikimedia.org wrote:
Ok, the image metadata for sha-1 hashes is now updated when actually needed for deletion (or in a batch process) instead of on metadata read, which was what was bogging down the system.
Please setup a batch job to eventually populate the sha-1 metadata for non-deleted images. We'd like to use it for duplicate image detection.
It's been running since Friday.
We're already doing this against the deleted images... A bot downloads the image, computes that sha-1, checks the filearchive table based on the SHA1, and if there is a match it complains in IRC and the new image is tagged.
Why do that in a bot instead of the wiki?? :)
-- brion vibber (brion @ wikimedia.org)
On 26/08/07, Brion Vibber brion@wikimedia.org wrote:
Why do that in a bot instead of the wiki?? :)
Presumably because the batch process wasn't running when the bot was established, or because they weren't aware it was happening.
Rob Church
Rob Church wrote:
On 26/08/07, Brion Vibber brion@wikimedia.org wrote:
Why do that in a bot instead of the wiki?? :)
Presumably because the batch process wasn't running when the bot was established, or because they weren't aware it was happening.
Well, we actually have a blacklist for image uploads that works on sha-1 hash; it's just updated manually (and probably hasn't been updated in a couple years. ;)
-- brion vibber (brion @ wikimedia.org)
On 8/27/07, Brion Vibber brion@wikimedia.org wrote:
We're already doing this against the deleted images... A bot downloads the image, computes that sha-1, checks the filearchive table based on the SHA1, and if there is a match it complains in IRC and the new image is tagged.
Why do that in a bot instead of the wiki?? :)
Because bots are an immediate solution that people can write within their comfort-zone, whereas learning the MediaWiki codebase would be much more effort; as well as thinking that you need to either grovel, or be in an obscure clique of developers to have the patch applied. (They are wrong on both points).
On 8/26/07, Andrew Garrett andrew@epstone.net wrote:
Because bots are an immediate solution that people can write within their comfort-zone, whereas learning the MediaWiki codebase would be much more effort; as well as thinking that you need to either grovel, or be in an obscure clique of developers to have the patch applied. (They are wrong on both points).
Well, they're close enough to right about the second point. How many "patch, need-review" on Bugzilla? 149, over 50% more than the number of shell bugs. And yet we have a couple of times as many developers as sysadmins, at least.
On 8/27/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Well, they're close enough to right about the second point. How many "patch, need-review" on Bugzilla? 149, over 50% more than the number of shell bugs. And yet we have a couple of times as many developers as sysadmins, at least.
Well, from the point of view of someone with a couple of patches awaiting review, it doesn't necessarily seem too hard for people to get involved to do a medium to large amount of work on MediaWiki, but people making very few or small contributions (like me :) ) may not feel like they can contribute very easily.
On 27/08/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Well, they're close enough to right about the second point. How many "patch, need-review" on Bugzilla? 149, over 50% more than the number of shell bugs. And yet we have a couple of times as many developers as sysadmins, at least.
There are various reasons for this, just as there are (in addition to lack of shell users' time) reasons for the shell bugs being unfulfilled, but I think it's important to note that "need-review" means, "this bug has a patch which has not been reviewed by anyone other than the patch author" - in other words, *anyone* with a reasonable grasp of MediaWiki is free to leave comments on a possible patch.
A patch may be for an area of the code base which nobody available to review the patch is comfortable with changing, or able to test - patches against texvc often lurk for quite some time, mainly because if you don't have the environment set up, then compiling texvc is a pain in the backside. In general, I don't like applying l10n patches for right-to-left or Asian languages because I can pretty much guarantee my text editor will cause it to go wrong, for example.
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Rob Church
On 8/27/07, Rob Church robchur@gmail.com wrote:
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Well in that spirit, could someone take a look at:
http://bugzilla.wikimedia.org/show_bug.cgi?id=8753
...which would solve the whole nofollow debate that came up several months ago by allowing nofollow to be applied to nonlocal interwiki links (in our case, non-Wikimedia sites), and:
http://bugzilla.wikimedia.org/show_bug.cgi?id=10890
...an API improvement, which I imagine that anyone could test though there are only a few people who do lots of work on the API.
On 27/08/07, Stephen Bain stephen.bain@gmail.com wrote:
Well in that spirit, could someone take a look at:
Note the comments on the bug; in particular, http://bugzilla.wikimedia.org/show_bug.cgi?id=8753#c2 - this is an example of a bug where it's not clear whether there is a problem to solve.
http://bugzilla.wikimedia.org/show_bug.cgi?id=10890
...an API improvement, which I imagine that anyone could test though there are only a few people who do lots of work on the API.
Have you run an EXPLAIN SELECT and checked the performance of the modified queries?
Rob Church
On 8/27/07, Rob Church robchur@gmail.com wrote:
On 27/08/07, Stephen Bain stephen.bain@gmail.com wrote:
Well in that spirit, could someone take a look at:
Note the comments on the bug; in particular, http://bugzilla.wikimedia.org/show_bug.cgi?id=8753#c2 - this is an example of a bug where it's not clear whether there is a problem to solve.
Hmm, it seems clear to me that some people want to put nofollow on external interwiki links but not on internal interwiki links.
At Wikimania I spoke with Domas regarding query perf on real wiki data. Doing EXPLAIN SELECT on a local db is inaccurate for two reasons: * devs tend to run the latest MySQL on their machines, not the 4.xxx that wiki is using (possibly with additional settings that might affect performance) * devs tend to have much smaller sample wiki (I use simple.wiki) to test, and mysql optimizer might choose a different optimization strategy.
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
--Yurik
On 8/27/07, Rob Church robchur@gmail.com wrote:
On 27/08/07, Stephen Bain stephen.bain@gmail.com wrote:
Well in that spirit, could someone take a look at:
Note the comments on the bug; in particular, http://bugzilla.wikimedia.org/show_bug.cgi?id=8753#c2 - this is an example of a bug where it's not clear whether there is a problem to solve.
http://bugzilla.wikimedia.org/show_bug.cgi?id=10890
...an API improvement, which I imagine that anyone could test though there are only a few people who do lots of work on the API.
Have you run an EXPLAIN SELECT and checked the performance of the modified queries?
Rob Church
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 27/08/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
At Wikimania I spoke with Domas regarding query perf on real wiki data. Doing EXPLAIN SELECT on a local db is inaccurate for two reasons:
- devs tend to run the latest MySQL on their machines, not the 4.xxx
that wiki is using (possibly with additional settings that might affect performance)
- devs tend to have much smaller sample wiki (I use simple.wiki) to
test, and mysql optimizer might choose a different optimization strategy.
I'm aware of both of these concerns, however, remembering to do an EXPLAIN SELECT is, I think, preferable to not doing one at all.
Rob Church
On 8/28/07, Rob Church robchur@gmail.com wrote:
Note the comments on the bug; in particular, http://bugzilla.wikimedia.org/show_bug.cgi?id=8753#c2 - this is an example of a bug where it's not clear whether there is a problem to solve.
Those comments really related to the first patch which applied nofollow to all interwiki links; the second patch allowed nofollow to be applied only to non-local links. This would have allowed what was more or less the consensus position as I read it after the long discussions about nofollow and interwiki links through March, April and May on Wikien-l.
This message from Anthere was a pretty good statement of what was agreed on:
http://lists.wikimedia.org/pipermail/wikien-l/2007-May/070291.html
The patch does avoid the need for a whitelist by using the existing "local" property on interwikis.
Have you run an EXPLAIN SELECT and checked the performance of the modified queries?
No, for the reasons Yuri mentions (internet broadly sucks in Australia and I can't afford to download any really useful data, plus I run MySQL 5). I thought about doing one on the toolserver but apparently that runs MySQL 5 too.
The query is, however, almost exactly the same as the existing one, it just uses a different (existing, but apparently unused) index on the table. The performance should be the same.
On 8/27/07, Rob Church robchur@gmail.com wrote:
There are various reasons for this, just as there are (in addition to lack of shell users' time) reasons for the shell bugs being unfulfilled, but I think it's important to note that "need-review" means, "this bug has a patch which has not been reviewed by anyone other than the patch author" - in other words, *anyone* with a reasonable grasp of MediaWiki is free to leave comments on a possible patch.
But in practice they don't, since it's almost pointless. Getting a developer to properly review a patch is both necessary and sufficient for it to be committed. A non-developer, or developer from outside the area or who otherwise isn't comfortable committing it, does nothing by reviewing the patch, because it still has to be reviewed by a developer to be committed. They can, of course, still point out flaws and get them fixed so they're likely to be more quickly committed by a developer, but the reviewing developer can equally do that, and there's no glory in reviewing things but not committing them.
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password='d3a0021210aace9fb3550d16552ebe22'; +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ 1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password='da2183e5d268388e0793759df74fbb6a'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
Anyone up for designing a robust SQL sanitizer?
(That's my password on my test wiki, BTW, I think it's a single space. Don't bother whipping out the rainbow tables. ;) )
On 8/27/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
. . .
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
. . .
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
. . .
Whoa, that got screwed up. :(
Ouch, thanks Simetrical. I guess "user_password" should not be allowed in any "explain" queries. Any other risky ones out there?
On the other hand - this only introduces another possibility of a login - same as going through the regular login page. I can't think how it would be a security risk - automated running through a list of md5 hashes. We could implement a memcached solution so that no more than 10 queries would run per minute.
On 8/27/07, Simetrical Simetrical+wikilist@gmail.com wrote:
On 8/27/07, Rob Church robchur@gmail.com wrote:
There are various reasons for this, just as there are (in addition to lack of shell users' time) reasons for the shell bugs being unfulfilled, but I think it's important to note that "need-review" means, "this bug has a patch which has not been reviewed by anyone other than the patch author" - in other words, *anyone* with a reasonable grasp of MediaWiki is free to leave comments on a possible patch.
But in practice they don't, since it's almost pointless. Getting a developer to properly review a patch is both necessary and sufficient for it to be committed. A non-developer, or developer from outside the area or who otherwise isn't comfortable committing it, does nothing by reviewing the patch, because it still has to be reviewed by a developer to be committed. They can, of course, still point out flaws and get them fixed so they're likely to be more quickly committed by a developer, but the reviewing developer can equally do that, and there's no glory in reviewing things but not committing them.
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
A patch may offer to solve a problem that doesn't need solving, or that shouldn't be solved in the mainline code, or it may offer a solution that hasn't really been thoroughly discussed. A patch may be unacceptable in quality (bad code practice, bad documentation, performance nightmare) or incomplete (schema change with no associated updater provided). Some of the "patches" marked on BugZilla are not, in fact, patches - comments with, "you need to change line X to line Y" aren't actually very easy to review at all, for instance.
Then if someone actually reviewed them, they could mark them WONTFIX, or review the patches as unacceptable and remove the patch keyword.
There are, no doubt, a multitude of counter-arguments to these points, many of which are fair. I think the main reminder here is that if you want your patch reviewed, *** keep pestering somebody *** - it may be that your patch was noticed, but forgotten about.
Ideally that wouldn't be necessary. If I have some time I think I'll start reviewing some of the open patches.
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
What we need is a simple web-based tool that allows anyone to type in a sql query, and get the result of "EXPLAIN" statement executed on the real enwiki. Obviously we must be careful not to introduce a SQL injection vulnerability.
Never mind SQL injection:
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password='d3a0021210aace9fb3550d16552ebe22'; +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ 1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password='da2183e5d268388e0793759df74fbb6a'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
Anyone up for designing a robust SQL sanitizer?
(That's my password on my test wiki, BTW, I think it's a single space. Don't bother whipping out the rainbow tables. ;) )
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Ouch, thanks Simetrical. I guess "user_password" should not be allowed in any "explain" queries. Any other risky ones out there?
Basically the entire user table needs to be considered private, although certain prescribed uses might be okay. Any query that involves only a single data item (e.g. only user_options but no other columns from user) or that involves multiple data items that are publicly associated with each other anyway (e.g. user_name plus user_real_name plus user_registration plus user_edit_count) could be okay.
As for other tables: the entire archive table. IP info in recentchanges. filearchive. These decisions have already been made for the toolserver, those could probably be copied over to this. On the other hand, I'm not sure of the usefulness of this tool for the general public when developers can easily get toolserver access to run EXPLAINs on "public" columns.
In fact, someone could write an EXPLAIN engine and run it on the toolserver. But first it needs to be ensured that there are no injection opportunities, which *should* be straightforward: I'm *pretty* sure that if you start off with "EXPLAIN" and only allow one statement (as the PHP API does), there's no way to inject any actual queries, they'll just be EXPLAINed. But personally, I'm not quite sure enough of that to try running such a script myself . . .
On the other hand - this only introduces another possibility of a login - same as going through the regular login page. I can't think how it would be a security risk - automated running through a list of md5 hashes. We could implement a memcached solution so that no more than 10 queries would run per minute.
The regular login requires a captcha. Since there are more serious issues with other tables that can't reasonably be solved using captchas or other things that merely stop automated checks, we may as well blacklist the password field entirely along with the others if this is ever offered in any way.
On 8/27/07, Gregory Maxwell gmaxwell@gmail.com wrote:
My view has long been that having this stuff external to mediawiki is good for flexibility, good for development scaling, good for site performance (bots can't break caching), but bad for reliability (some bot author vanishes and the process dies). As a result I've considered it better for us to provide more robust APIs, facilities for running these tasks (toolserver), and standards for development and documentation so the tools can outlive the attention spans of their authors (which is where we suck the most). ... So this is what I've long thought, but I'm prepared to have my eyes opened.
I think that there's a place for a robust and general workflow-management system in the software. I think Rob or someone started a page on MW.org or somewhere that dealt with specs for a software implementation of task handling, but I can't find it in about a fifteen-second search.
On 27/08/07, Simetrical Simetrical+wikilist@gmail.com wrote:
I think that there's a place for a robust and general workflow-management system in the software. I think Rob or someone started a page on MW.org or somewhere that dealt with specs for a software implementation of task handling, but I can't find it in about a fifteen-second search.
http://www.mediawiki.org/wiki/User:Robchurch/Work_queue
Rob Church
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Ouch, thanks Simetrical. I guess "user_password" should not be allowed in any "explain" queries. Any other risky ones out there?
On the other hand - this only introduces another possibility of a login - same as going through the regular login page. I can't think how it would be a security risk - automated running through a list of md5 hashes. We could implement a memcached solution so that no more than 10 queries would run per minute.
EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password REGEXP '^[0-7].*'; ... It should only take a maximum of 128 queries to read the entire MD5 one bit at a time starting with no starting info. This isn't the same as a bruteforce shortcut. You can't solve this with a delay or a captcha.
There are a bunch of fields that need to be protected (rc_ip, for example.. user prefs.. watchlist table...). To really make the tool useful is has to be very flexible in what it accepts. I looked at building something like this for toolserver users a while back and decided it would be too much work to get it right.
Gregory, thanks for the warning. The toolserver will not work for this - its using MySQL 5, which optimizes everything very differently. Plus it is using views instead of tables - again, there are some optimization bugs associated with that. This really has to be on the main wiki - this way we could even compare a large wiki vs a small wiki query optimizations.
One solution would be to disable regex - most queries don't need that.
From my experience, most of the time I need to run simple join and
lookup queries with various filters. "LIKE" has been used a few times, but more often it is simply <,=,>. I guess we can make a list of prohibited fields/tables.
On 8/27/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Ouch, thanks Simetrical. I guess "user_password" should not be allowed in any "explain" queries. Any other risky ones out there?
On the other hand - this only introduces another possibility of a login - same as going through the regular login page. I can't think how it would be a security risk - automated running through a list of md5 hashes. We could implement a memcached solution so that no more than 10 queries would run per minute.
EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND user_password REGEXP '^[0-7].*'; ... It should only take a maximum of 128 queries to read the entire MD5 one bit at a time starting with no starting info. This isn't the same as a bruteforce shortcut. You can't solve this with a delay or a captcha.
There are a bunch of fields that need to be protected (rc_ip, for example.. user prefs.. watchlist table...). To really make the tool useful is has to be very flexible in what it accepts. I looked at building something like this for toolserver users a while back and decided it would be too much work to get it right.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 8/27/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Gregory, thanks for the warning. The toolserver will not work for this - its using MySQL 5, which optimizes everything very differently. Plus it is using views instead of tables - again, there are some optimization bugs associated with that. This really has to be on the main wiki - this way we could even compare a large wiki vs a small wiki query optimizations.
It probably wouldn't be a huge security risk to open up a slave to this, but I don't expect it to happen anytime soon.
One solution would be to disable regex - most queries don't need that.
Er, what? The query doesn't depend on regex:
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND SUBSTRING(user_password, 1, 1) BETWEEN '8' AND 'f'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND SUBSTRING(user_password, 1, 1) BETWEEN '8' AND 'f'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
You could do the same thing like LIKE, or with any number of other string functions.
I guess we can make a list of prohibited fields/tables.
The only feasible option.
On 8/27/07, Simetrical Simetrical+wikilist@gmail.com wrote:
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND SUBSTRING(user_password, 1, 1) BETWEEN '8' AND 'f'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND SUBSTRING(user_password, 1, 1) BETWEEN '8' AND 'f'; +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ | 1 | SIMPLE | user | const | user_name | user_name | 257 | const | 1 | | +----+-------------+-------+-------+---------------+-----------+---------+-------+------+-------+ 1 row in set (0.00 sec)
That first one should, of course, be:
mysql> EXPLAIN SELECT * FROM user WHERE user_name='Simetrical' AND SUBSTRING(user_password, 1, 1) BETWEEN '0' AND '7';+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables | +----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+ 1 row in set (0.00 sec)
Gregory, thanks for the warning. The toolserver will not work for this - its using MySQL 5, which optimizes everything very differently. Plus it is using views instead of tables - again, there are some optimization bugs associated with that. This really has to be on the main wiki - this way we could even compare a large wiki vs a small wiki query optimizations.
It probably wouldn't be a huge security risk to open up a slave to this, but I don't expect it to happen anytime soon.
I guess we can make a list of prohibited fields/tables.
The only feasible option.
What if there was a standalone MySQL ~4.0.29 database somewhere that had loaded an Enwiki pages-meta-history.xml database dump, and which allowed people to run EXPLAIN queries on it via a web interface - wouldn't that mostly solve the problem?
By definition, we know/assume that the dumps do not contain confidential information - so rather than trying to enumerate data that needs protection, this way there is simply no sensitive data there in the first place to protect.
And in the web interface for running the EXPLAIN queries, there could also be a big "RESET" button that would copy over backups of the MySQL data files, thus resetting the database back to a known state if it became messed up somehow.
Disclaimer: I don't personally feel any desire to build such a thing, I'm just trying to understand whether it would solve the problem or not.
-- All the best, Nick.
On 8/26/07, Andrew Garrett andrew@epstone.net wrote:
Because bots are an immediate solution that people can write within their comfort-zone, whereas learning the MediaWiki codebase would be much more effort; as well as thinking that you need to either grovel, or be in an obscure clique of developers to have the patch applied. (They are wrong on both points).
It was setup a while ago, at the time we didn't have a column for the sha1 data on non-deleted images... I saw someone setting up a bot to watch for naughty things in new uploads, decided I could help make it better and shortly after I was able to give them a quick HTTP api to look up deleted edits by hash. So that was the initial reason.
And of course, it is more comfortable for a lot of people to code outside of MediaWiki. Lazyness? Perhaps. But the changes we would have needed at the time would include a schema change. Time to gratification inside MW: At least a couple of weeks. Externally, we're talking about minutes.
For workflow reasons it's actually much better to accomplish this via bots and tagging. Right now what we're doing is adding a tag like:
{{deletedimage|old image name|old deletion reason}} which expands to a handy link to the history of the deleted duplicate.
Once someone has validated that the image is okay and doesn't need to be deleted again they change the tag to {{deletedimage|old image name|..reason|verified=~~~~}} and the tag changes the category that the image is in.
We could trivially extend mediawiki to display the links to deleted duplicates, and/or non-deleted duplicates, but extending mediawiki to also participate in the workflow automation is a far more ambitious project.
It's also the case that we're doing more than just duplicate matching. For example, here are things which are already done or are being worked on:
*File type double-checking (currently offsetting bug 10823) *Malicious content scanning (with clamav) *Automated google image lookup (google image search the file name, grab the top couple results and compare hashes) *Image fingerprinting (to detect non-bit-identical duplicates) *Suspect EXIF data (corbis, getty, AP copyright data in exif tags). etc.
I'm not sure that putting all of that into MediaWiki makes sense. A lot of it works best asynchronously. A lot of it works best as part of a workflow where software and people work as peers, and we don't really have good ways for the mediawiki software to participate in workflows today.
Even things like "this was deleted as a copyvio don't upload it again" works best as a lagged process. Hard security would just result in the uploader figuring out he can twiddle a single bit in the file and upload it.
Gregory Maxwell wrote:
It's also the case that we're doing more than just duplicate matching. For example, here are things which are already done or are being worked on:
*File type double-checking (currently offsetting bug 10823) *Malicious content scanning (with clamav) *Automated google image lookup (google image search the file name, grab the top couple results and compare hashes) *Image fingerprinting (to detect non-bit-identical duplicates) *Suspect EXIF data (corbis, getty, AP copyright data in exif tags). etc.
I'm not sure that putting all of that into MediaWiki makes sense.
It does.
A lot of it works best asynchronously.
Yep!
A lot of it works best as part of a workflow where software and people work as peers, and we don't really have good ways for the mediawiki software to participate in workflows today.
A wiki is inherently an asynchronous people-oriented bit of software. ;) Certainly we'd like even more support for this.
-- brion vibber (brion @ wikimedia.org)
On 8/27/07, Brion Vibber brion@wikimedia.org wrote:
I'm not sure that putting all of that into MediaWiki makes sense.
It does.
A lot of it works best asynchronously.
Yep!
A lot of it works best as part of a workflow where software and people work as peers, and we don't really have good ways for the mediawiki software to participate in workflows today.
A wiki is inherently an asynchronous people-oriented bit of software. ;) Certainly we'd like even more support for this.
On this subject, if you say it's true it is... but is there a parallel to what I'm discussing someplace on our sites?
I've always seen a wiki as facilitating asynchronous collaboration between people, but not people and software, except in cases where software pretends to be a person (bots).
There are a LOT of people+software workflows on our projects. People have built tools to facilitate those workflows using templates, bots, and instructions for people. So far on our wikis, MediaWiki fits in by helping collaboration by acting as a version control system and data store, offering some reporting tools, and mostly staying out of the way.
So how should this work?
There have been a number of workflow enabling extensions built by people in the past. For example http://www.mediawiki.org/wiki/Extension:Tasks_Extension, but I'm not aware of any of them being used on the Wikimedia Wikis.
So... We have conditions, sometimes they are automatically detectable (duplicate images), or not (suspected copyvios). That detection happens, and something the related page needs to appear in a list as a result. Other things might happen which resolve the condition, or they don't, and software needs to move the object along through a series of steps until the issue is resolved. Today this is accomplished by a mixture of templates, bots, and obsessive humans (sometimes with userscript cybernetic augmentation ;) ). The procedures change a lot, and are sometimes invented then replaced in a few days time. Some are long standing and deal with thousands of pages or files per week.
How should this be better done?
My view has long been that having this stuff external to mediawiki is good for flexibility, good for development scaling, good for site performance (bots can't break caching), but bad for reliability (some bot author vanishes and the process dies). As a result I've considered it better for us to provide more robust APIs, facilities for running these tasks (toolserver), and standards for development and documentation so the tools can outlive the attention spans of their authors (which is where we suck the most). ... So this is what I've long thought, but I'm prepared to have my eyes opened.
On Aug 27, 2007, at 1:02 PM, Gregory Maxwell wrote:
On 8/27/07, Brion Vibber brion@wikimedia.org wrote:
I'm not sure that putting all of that into MediaWiki makes sense.
It does.
A lot of it works best asynchronously.
Yep!
A lot of it works best as part of a workflow where software and people work as peers, and we don't really have good ways for the mediawiki software to participate in workflows today.
A wiki is inherently an asynchronous people-oriented bit of software. ;) Certainly we'd like even more support for this.
On this subject, if you say it's true it is... but is there a parallel to what I'm discussing someplace on our sites?
I've always seen a wiki as facilitating asynchronous collaboration between people, but not people and software, except in cases where software pretends to be a person (bots).
There are a LOT of people+software workflows on our projects. People have built tools to facilitate those workflows using templates, bots, and instructions for people. So far on our wikis, MediaWiki fits in by helping collaboration by acting as a version control system and data store, offering some reporting tools, and mostly staying out of the way.
So how should this work?
This is something I'm interested in too... but is it time to fork a new subject thread?
There have been a number of workflow enabling extensions built by people in the past. For example http://www.mediawiki.org/wiki/Extension:Tasks_Extension, but I'm not aware of any of them being used on the Wikimedia Wikis.
So... We have conditions, sometimes they are automatically detectable (duplicate images), or not (suspected copyvios). That detection happens, and something the related page needs to appear in a list as a result. Other things might happen which resolve the condition, or they don't, and software needs to move the object along through a series of steps until the issue is resolved. Today this is accomplished by a mixture of templates, bots, and obsessive humans (sometimes with userscript cybernetic augmentation ;) ). The procedures change a lot, and are sometimes invented then replaced in a few days time. Some are long standing and deal with thousands of pages or files per week.
How should this be better done?
My view has long been that having this stuff external to mediawiki is good for flexibility, good for development scaling, good for site performance (bots can't break caching), but bad for reliability (some bot author vanishes and the process dies). As a result I've considered it better for us to provide more robust APIs, facilities for running these tasks (toolserver), and standards for development and documentation so the tools can outlive the attention spans of their authors (which is where we suck the most). ... So this is what I've long thought, but I'm prepared to have my eyes opened.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
wikitech-l@lists.wikimedia.org