Brad Jorsch schreef:
On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw wrote:
In the comment above the default value there's a big warning saying you shouldn't set $wgAPIMaxResultSize to something lower than the maximum revision size. Wiki admins who do that anyway are asking for trouble.
Ok, as long as it's documented that should be good enough.
The metadata thing is also a non-issue, because it's stored in a database field that holds at most 4 MB.
It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there is no reason that maximum couldn't be increased in later versions. I don't see any code limiting the size of the stored data on the MediaWiki side of things, either.
Of course, I could be missing something.
No, just my math going wrong (2^4 is 16, not 4). Fortunately, the metadata is (currently*) extracted from the image's EXIF data, which is limited to 64 KB.
* There are plans floating around on Commons to store UI-readable and modifiable metadata for images, and using the existing img_metadata field for that is an option
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison.
Fixed in r47036 by checking for inequality (!=) rather than less-than (<), using the fact that the order of message retrieval is pretty much constant (unless of course extensions are installed/removed between two requests).
Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage returns only aboutpage instead of everything starting with aboutpage.
Doh! Of course.
There may be a better way, but one easy way to do it would be like this: $skip = !is_null($params['from']); foreach( $messages_target as $message ) { if($message === $params['from']) $skip = false; if(!$skip) $messages[$message] = wfMsg( $message ); }
Did exactly that in r47048.
I'll look into this; sorting by i2.img_name as well is less efficient according to a quick EXPLAIN, but I'm not sure if it's too inefficient (will ask a database guru). If memory serves, this problem shouldn't occur on MySQL databases, though, because they sort stuff by the primary key (img_name) by default IIRC.
According to MySQL's documentation, you're right if InnoDB tables are being used and the query is doing a table scan (or presumably using the primary key for an index scan); if it uses some other index, this may not hold true. If MyISAM is used, I believe it's insertion order, although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably other stuff screw that up.
If MySQL really is sorting by the i2.img_name by default, theoretically making that explicit shouldn't make a difference. But knowing MySQL, it wouldn't surprise me that it does.
Even if it's already sorted, that won't always stop MySQL from sorting it again, which takes time. IIRC the i2 table is scanned using the img_sha1 index, though, which means sorting by i2.img_name is probably a real filesort, but because the query sorts by i1.img_name (using the primary key) first, the filesort will only sort small subsets of the result set and not the entire result set at once (which makes a difference, because sorting happens in worse-than-linear time). I'll consult with Aryeh or Domas before adding an ORDER BY on i2.img_name.
I don't know what else could be done for search results, so I wouldn't worry about that.
prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill MySQL. Then you can do the standard thing there using cat_title as the continue parameter.
You're right about that. Done in r47050.
prop=extlinks must already be using the el_from index, so hopefully adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the continue parameter.
Actually, it can't, because the index only covers the first 40 chars of el_to, which means it can't use the index to resolve sorting conflicts past the 40th char of el_to. Whether it's actually smart enough to use the index optimally I don't know (it's MySQL after all); if it does, the filesort would presumably affect small subsets and might be acceptable. Again, I'll have to ask more performance-educated people.
list=exturlusage might be tough. The ideal thing would be to order by el_index and page_id (with page_id|el_index as the continue parameter), but I'm not at all confident that wouldn't filesort. If it does, I have no idea what we could do.
Actually, we'd have to page the other way around (first el_index, then el_from; the latter is equal to page_id) because the query has to use the el_index index to resolve the LIKE query on the el_index field. Again, only the first 60 chars of el_index are covered by the index, and there is no index that covers both el_index and el_from, so we're looking at filesorts again (again, probably on small subsets). I'll run this by the DB gurus as well.
prop=imageinfo looks somewhat tricky. I might change "continue" to be the concatenation of the image title and the appropriate start value, and change the logic something like this: $titles = array_keys( $pageIds[NS_FILE] ); sort($titles); // Unless we can depend on the order to always be the same
Actually, I believe these things are ordered by page IDs somewhere in ApiPageSet, so the order would be the same every time except when a file is moved/renamed.
$images = RepoGroup::singleton()->findFiles( $titles ); if(is_null($params['continue'])){ $skip = false; } else { $skip = true; list($continue_title, $continue_start) = explode('|', $params['continue']); } foreach ( $titles as $title ) { $start = $params['start']; if($skip) { if($title != $continue_title) continue; $skip = false; $start = $continue_start; }
if(!isset($images[$title])) { // Add the "missing" record, as in lines 143-148 of r47043 // except add a break in the !$fit case continue; } $img = $images[$title]; // Add the "not missing" record, as in lines 68-134 of r47043 // except use $start instead of $params['start']
}
Looks sane, I'll probably do this tomorrow.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
For prop=revisions, I wonder if MySQL would filesort if you order by rev_id/page_id (depending on which is being used in the WHERE clause), since I would guess that's the index being used to fetch the rows anyway. If not, that could easily work for the continue parameter in non-enum mode (enum mode is already OK).
Apart from non-enum mode, there are two other modes. In titles= mode, the query fetches rows using the PRIMARY (rev_page, rev_id) index, which can be paged by rev_page (equal to page_id) just fine. In revids= mode, it uses the rev_id (rev_id) index, which we can be on as well. All in all, this should be an easy one; I'll look into it tomorrow as well.
Roan Kattouw (Catrope)