Hello all,
I was noticed today by a steward that one tool (I will message the author separately) does not respect the rev_deleted-field and displays revisions, which are already deleted, to the world.
Our rules forbids displaying of data which is not visible on the Wikimedia- wikis any more. I know that the usage of the rev_delete-field by mediaiwiki is quite new (the mediawiki-devs didn't update their documentation until now AFAIS), so please look at your tools. If you use the revision-table and do showing single edits, then add a "AND rev_delete=0" to the where-clause where needed (it's of corse ok to ignore the field if you just display abstract data like "How many edits have a wiki").
Please always remember: Just because you can see the data in the database, that is no allowance to show it to the world (if in doubt: ask a root).
Sincerly, DaB.
On Mon, 2010-09-27 at 16:45 +0200, DaB. wrote:
If you use the revision-table and do showing single edits, then add a "AND rev_delete=0" to the where-clause where needed (it's of corse ok to ignore the field if you just display abstract data like "How many edits have a wiki").
For the records, that should be "AND rev_deleted=0".
Erwin
On 09/27/2010 05:45 PM, DaB. wrote:
Our rules forbids displaying of data which is not visible on the Wikimedia- wikis any more. I know that the usage of the rev_delete-field by mediaiwiki is quite new (the mediawiki-devs didn't update their documentation until now AFAIS), so please look at your tools. If you use the revision-table and do showing single edits, then add a "AND rev_delete=0" to the where-clause where needed (it's of corse ok to ignore the field if you just display abstract data like "How many edits have a wiki").
Why do we even include those rows in the default views? I don't think most people expect to get deleted rows on a simple query on the revision table, given that this was never the case under the old system (when they were moved to the archive table instead).
If someone really needs those rows, I guess we could always set up alternative views that include then. But I'd think that would be rather uncommon use case. (How many TS users are actually doing something with the archive table, anyway?)
Please always remember: Just because you can see the data in the database, that is no allowance to show it to the world (if in doubt: ask a root).
True, but there's also no need to make it too easy to accidentally expose sensitive data.
(Compare with e.g. the absence of the user_password field from the views. Technically, there shouldn't be much risk in exposing it, since it's salted and hashed, but there's little if any valid need for it either.)
On Mon, Sep 27, 2010 at 2:55 PM, Ilmari Karonen nospam@vyznev.net wrote:
Why do we even include those rows in the default views? I don't think most people expect to get deleted rows on a simple query on the revision table, given that this was never the case under the old system (when they were moved to the archive table instead).
If someone really needs those rows, I guess we could always set up alternative views that include then. But I'd think that would be rather uncommon use case. (How many TS users are actually doing something with the archive table, anyway?)
We already have two alternative views for the logging table, so this is doable. If we add copies of all indexes prefixed with rev_deleted, everything should be just as efficient in a view as you suggest, so I don't see why we couldn't. In other cases there have been issues with efficiency, but I don't see why those should exist here, offhand.
(Compare with e.g. the absence of the user_password field from the views. Technically, there shouldn't be much risk in exposing it, since it's salted and hashed, but there's little if any valid need for it either.)
There'd be tremendous risk in publishing salted and hashed passwords. You can run an automated password cracker on those and try thousands of passwords per second or more, even millions per second. You'd be able to quickly recover the password of anyone who used a dictionary word or other common password if you had access to the hashes. You can't do this online, because there's throttling -- and even just the overhead of an HTTP request makes it hard to do thousands of attempts per second (let alone millions).
Hello, At Monday 27 September 2010 22:12:05 DaB. wrote:
If we add copies of all indexes prefixed with rev_deleted, everything should be just as efficient in a view as you suggest, so I don't see why we couldn't. In other cases there have been issues with efficiency, but I don't see why those should exist here, offhand.
I tried it today. It morphs a 1.36s query (simple count-query on my username on dewiki_p) into something I canceled after 2 minutes (I tried both: if- selecting and where-clause).
And I don't see the point to allow users to see the archives-table, but not the deleted revisions (the critical stuff is in the hidden-table anyway).
Sincerly, DaB.
On 09/27/2010 11:17 PM, DaB. wrote:
Hello, At Monday 27 September 2010 22:12:05 DaB. wrote:
If we add copies of all indexes prefixed with rev_deleted, everything should be just as efficient in a view as you suggest, so I don't see why we couldn't. In other cases there have been issues with efficiency, but I don't see why those should exist here, offhand.
I tried it today. It morphs a 1.36s query (simple count-query on my username on dewiki_p) into something I canceled after 2 minutes (I tried both: if- selecting and where-clause).
That's without any new indexes, I assume.
It seems to me that we may need those new indexes anyway, if we're going to have to sprinkle "and rev_deleted = 0" everywhere. Then again, the queries that are most affected seem to be precisely such simple count queries that could be answered using the indexes alone, so if it's OK to included deleted rows in such counts, it might not be so important.
On Mon, Sep 27, 2010 at 4:17 PM, DaB. WP@daniel.baur4.info wrote:
I tried it today. It morphs a 1.36s query (simple count-query on my username on dewiki_p) into something I canceled after 2 minutes (I tried both: if- selecting and where-clause).
What exactly did you try? You'd need to change indexes on the underlying table for this to work acceptably.
And I don't see the point to allow users to see the archives-table, but not the deleted revisions (the critical stuff is in the hidden-table anyway).
I'm not clear on why toolserver users are allowed to see deleted revisions to begin with. rev_deleted is supposed to supplant oversight, and that includes things that nobody is really supposed to see. I particularly don't understand why we hide stuff from the logging table but not the revision table.
Aryeh Gregor schrieb:
And I don't see the point to allow users to see the archives-table, but not the deleted revisions (the critical stuff is in the hidden-table anyway).
I'm not clear on why toolserver users are allowed to see deleted revisions to begin with. rev_deleted is supposed to supplant oversight, and that includes things that nobody is really supposed to see. I particularly don't understand why we hide stuff from the logging table but not the revision table.
I'ts generally no good idea to remove transparency. That makes it more difficult for tools to handle non-visible revisions.
Hello, At Tuesday 28 September 2010 14:00:46 DaB. wrote:
On Mon, Sep 27, 2010 at 4:17 PM, DaB. WP@daniel.baur4.info wrote:
I tried it today. It morphs a 1.36s query (simple count-query on my username on dewiki_p) into something I canceled after 2 minutes (I tried both: if- selecting and where-clause).
What exactly did you try? You'd need to change indexes on the underlying table for this to work acceptably.
I run a non-cached simple count-query on my user-name on dewiki (like I told in my email yesterday) and got a result in a sec. Then I took our spare db-server… - oh wait, we have no spare db-server. So I took 1 of the sql-s5-db-servers out of rotation… – oh wait we have only 1 server for sql-s5. So I just changed the view of revision (like I told in my email yesterday) in a way it hides rev_user_text if rev_deleted>0, to test the runtime of a non- cached simple count. I canceled that after 1 minute or 2. Then I changed the view in the way that all rows if rev_deleted>1 are hidden (with a where- clause) and run the non-cached count again and canceled it again after a minute or two. Then I changed the view back to its original config and run my non-cached query again and it returned the count in a sec.
If you think that is done wrong or not trust me or my results, just try it yourself, use another db-server or anything. If you find a fast solution, I will be the last to not implement it.
And I don't see the point to allow users to see the archives-table, but not the deleted revisions (the critical stuff is in the hidden-table anyway).
I'm not clear on why toolserver users are allowed to see deleted revisions to begin with. rev_deleted is supposed to supplant oversight, and that includes things that nobody is really supposed to see.
We hide already the only really critical field: rev_comment, if the revision is deleted in any way. Every table on the toolserver can contain hidden stuff (even the user-table!), so the only real solution to let nobody see hidden stuff is to delete the database (and shut down the toolserver).
I particularly don't understand why we hide stuff from the logging table but not the revision table.
The logging table contains more critical stuff then the revision-table, is not queried that often and is nearly never joined with other tables.
To let me be very clear: I (and I know River too) try to hide as much hidden stuff from the toolserver-users as possible (we try even to hide some data from the roots), but it is impossible to hide everything from them, without making the toolserver useless or really, really slow. So it is always a weighting. And as long as the toolserver-users follow our rules and think a bit, no critical/hidden stuff is visible to the world: And that's what matters.
Sincerly, DaB.
On 09/28/2010 03:32 PM, DaB. wrote:
Hello, At Tuesday 28 September 2010 14:00:46 DaB. wrote:
On Mon, Sep 27, 2010 at 4:17 PM, DaB.WP@daniel.baur4.info wrote:
I tried it today. It morphs a 1.36s query (simple count-query on my username on dewiki_p) into something I canceled after 2 minutes (I tried both: if- selecting and where-clause).
What exactly did you try? You'd need to change indexes on the underlying table for this to work acceptably.
I run a non-cached simple count-query on my user-name on dewiki (like I told in my email yesterday) and got a result in a sec. Then I took our spare db-server… - oh wait, we have no spare db-server. So I took 1 of the sql-s5-db-servers out of rotation… – oh wait we have only 1 server for sql-s5. So I just changed the view of revision (like I told in my email yesterday) in a way it hides rev_user_text if rev_deleted>0, to test the runtime of a non- cached simple count. I canceled that after 1 minute or 2. Then I changed the view in the way that all rows if rev_deleted>1 are hidden (with a where- clause) and run the non-cached count again and canceled it again after a minute or two. Then I changed the view back to its original config and run my non-cached query again and it returned the count in a sec.
There's a much easier way to test that:
$ sql enwiki_p ... mysql> select count(*) from revision where rev_user = 398996; +----------+ | count(*) | +----------+ | 13445 | +----------+ 1 row in set (1.19 sec)
mysql> select count(*) from revision where rev_user = 398996 and rev_deleted = 0; +----------+ | count(*) | +----------+ | 13445 | +----------+ 1 row in set (2 min 4.93 sec)
It doesn't really matter if the "rev_deleted = 0" condition is part of the query itself or included via the view definition.
If you think that is done wrong or not trust me or my results, just try it yourself, use another db-server or anything. If you find a fast solution, I will be the last to not implement it.
Adding the indexes Aryeh suggested ought to fix it (at the cost that having a bunch of extra indexes generally entails).
toolserver-l@lists.wikimedia.org