While tweaking my MIMEStatBot[1] to handle random Query Killer attacks better, I got to thinking...
Would it be practical to set up a local index on the toolserver for (img_major_mime, img_minor_mime, img_media_type) on the image table for commonswiki (and maybe others too)?
Such an index would hopefully cut the runtime of my weekly queries from hours to seconds or less. It might also enable other tools to make better use of the MIME type info stored in the database.
Of course, populating and maintaining such an index would consume some time and storage space. But I wonder if it might not be worth the tradeoff. What do the TS admins think?
[1] http://commons.wikimedia.org/wiki/User:MIMEStatBot
Ilmari Karonen wrote:
While tweaking my MIMEStatBot[1] to handle random Query Killer attacks better, I got to thinking...
Would it be practical to set up a local index on the toolserver for (img_major_mime, img_minor_mime, img_media_type) on the image table for commonswiki (and maybe others too)?
Such an index would hopefully cut the runtime of my weekly queries from hours to seconds or less. It might also enable other tools to make better use of the MIME type info stored in the database.
Of course, populating and maintaining such an index would consume some time and storage space. But I wonder if it might not be worth the tradeoff. What do the TS admins think?
Your bot doesn't need to do a full query on the mime fields. You could just keep track of uploads and deletions, and update the counters accordingly.
On 08/30/2011 01:17 AM, Platonides wrote:
Ilmari Karonen wrote:
While tweaking my MIMEStatBot[1] to handle random Query Killer attacks better, I got to thinking...
Would it be practical to set up a local index on the toolserver for (img_major_mime, img_minor_mime, img_media_type) on the image table for commonswiki (and maybe others too)?
Such an index would hopefully cut the runtime of my weekly queries from hours to seconds or less. It might also enable other tools to make better use of the MIME type info stored in the database.
Of course, populating and maintaining such an index would consume some time and storage space. But I wonder if it might not be worth the tradeoff. What do the TS admins think?
Your bot doesn't need to do a full query on the mime fields. You could just keep track of uploads and deletions, and update the counters accordingly.
That's actually a pretty good idea... hmm.
I do see a few problems, though. For example, uploads and deletions are not the only things that can change the MIME type of a file; changes to the MIME type detection code in MediaWiki can also cause that, possibly with some delay since it may take a long time for something to trigger a metadata update. This is not just a theoretical scenario -- I've seen it happen.
Also, I'm not sure what happens if a file is oversighted without going through normal deletion first. And I've also observed several instances where MediaWiki bugs or server errors have led to missing log entries.
Ilmari Karonen wrote:
On 08/30/2011 01:17 AM, Platonides wrote:
Your bot doesn't need to do a full query on the mime fields. You could just keep track of uploads and deletions, and update the counters accordingly.
That's actually a pretty good idea... hmm.
I do see a few problems, though. For example, uploads and deletions are not the only things that can change the MIME type of a file; changes to the MIME type detection code in MediaWiki can also cause that, possibly with some delay since it may take a long time for something to trigger a metadata update. This is not just a theoretical scenario -- I've seen it happen.
Some maintenance script being run?
Also, I'm not sure what happens if a file is oversighted without going through normal deletion first. And I've also observed several instances where MediaWiki bugs or server errors have led to missing log entries.
I didn't consider that. It seems indeed a source of inconsistencies, although probably negligible.
There is potential for a little skew, and a full query may still be needed from time to time, but such "fast" schema seems a good thing to implement.
toolserver-l@lists.wikimedia.org