Brad Jorsch schreef:
On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw
wrote:
In the comment above the default value
there's a big warning saying you
shouldn't set $wgAPIMaxResultSize to something lower than the maximum
revision size. Wiki admins who do that anyway are asking for trouble.
Ok, as long as it's documented that should be good enough.
The metadata thing is also a non-issue, because
it's stored in a
database field that holds at most 4 MB.
It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up
to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in
PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there
is no reason that maximum couldn't be increased in later versions. I
don't see any code limiting the size of the stored data on the MediaWiki
side of things, either.
Of course, I could be missing something.
No, just my math going wrong (2^4 is 16, not 4). Fortunately, the
metadata is (currently*) extracted from the image's EXIF data, which is
limited to 64 KB.
* There are plans floating around on Commons to store UI-readable and
modifiable metadata for images, and using the existing img_metadata
field for that is an option
amfrom is broken if ammessages is not sorted in
whatever order PHP uses
for string comparison.
Fixed in r47036 by checking for inequality (!=) rather than less-than
(<), using the fact that the order of message retrieval is pretty much
constant (unless of course extensions are installed/removed between two
requests).
Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage
returns only aboutpage instead of everything starting with aboutpage.
Doh! Of course.
There may be a better way, but one easy way to do it
would be like this:
$skip = !is_null($params['from']);
foreach( $messages_target as $message ) {
if($message === $params['from']) $skip = false;
if(!$skip) $messages[$message] = wfMsg( $message );
}
Did exactly that in r47048.
I'll look
into this; sorting by i2.img_name as well is less efficient
according to a quick EXPLAIN, but I'm not sure if it's too inefficient
(will ask a database guru). If memory serves, this problem shouldn't
occur on MySQL databases, though, because they sort stuff by the primary
key (img_name) by default IIRC.
According to MySQL's documentation, you're right if InnoDB tables are
being used and the query is doing a table scan (or presumably using the
primary key for an index scan); if it uses some other index, this may
not hold true. If MyISAM is used, I believe it's insertion order,
although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably
other stuff screw that up.
If MySQL really is sorting by the i2.img_name by default, theoretically
making that explicit shouldn't make a difference. But knowing MySQL, it
wouldn't surprise me that it does.
Even if it's already sorted, that won't always stop MySQL from sorting
it again, which takes time. IIRC the i2 table is scanned using the
img_sha1 index, though, which means sorting by i2.img_name is probably a
real filesort, but because the query sorts by i1.img_name (using the
primary key) first, the filesort will only sort small subsets of the
result set and not the entire result set at once (which makes a
difference, because sorting happens in worse-than-linear time). I'll
consult with Aryeh or Domas before adding an ORDER BY on i2.img_name.
I don't know what else could be done for search
results, so I wouldn't
worry about that.
prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill
MySQL. Then you can do the standard thing there using cat_title as the
continue parameter.
You're right about that. Done in r47050.
prop=extlinks must already be using the el_from index,
so hopefully
adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the
continue parameter.
Actually, it can't, because the index only covers the first 40 chars of
el_to, which means it can't use the index to resolve sorting conflicts
past the 40th char of el_to. Whether it's actually smart enough to use
the index optimally I don't know (it's MySQL after all); if it does, the
filesort would presumably affect small subsets and might be acceptable.
Again, I'll have to ask more performance-educated people.
list=exturlusage might be tough. The ideal thing would
be to order by
el_index and page_id (with page_id|el_index as the continue parameter),
but I'm not at all confident that wouldn't filesort. If it does, I have
no idea what we could do.
Actually, we'd have to page the other way around (first el_index, then
el_from; the latter is equal to page_id) because the query has to use
the el_index index to resolve the LIKE query on the el_index field.
Again, only the first 60 chars of el_index are covered by the index, and
there is no index that covers both el_index and el_from, so we're
looking at filesorts again (again, probably on small subsets). I'll run
this by the DB gurus as well.
prop=imageinfo looks somewhat tricky. I might change
"continue" to be
the concatenation of the image title and the appropriate start value, and
change the logic something like this:
$titles = array_keys( $pageIds[NS_FILE] );
sort($titles); // Unless we can depend on the order to always be the same
Actually, I believe these things are ordered by page IDs somewhere in
ApiPageSet, so the order would be the same every time except when a file
is moved/renamed.
$images = RepoGroup::singleton()->findFiles(
$titles );
if(is_null($params['continue'])){
$skip = false;
} else {
$skip = true;
list($continue_title, $continue_start) = explode('|',
$params['continue']);
}
foreach ( $titles as $title ) {
$start = $params['start'];
if($skip) {
if($title != $continue_title) continue;
$skip = false;
$start = $continue_start;
}
if(!isset($images[$title])) {
// Add the "missing" record, as in lines 143-148 of r47043
// except add a break in the !$fit case
continue;
}
$img = $images[$title];
// Add the "not missing" record, as in lines 68-134 of r47043
// except use $start instead of $params['start']
}
Looks sane, I'll probably do this tomorrow.
prop=info could be done by merging $titles and
$missing (if necessary),
doing an asort, and using similar logic. The continue would be just the
title.
I'll look into that tomorrow as well.
For prop=revisions, I wonder if MySQL would filesort
if you order by
rev_id/page_id (depending on which is being used in the WHERE clause),
since I would guess that's the index being used to fetch the rows
anyway. If not, that could easily work for the continue parameter in
non-enum mode (enum mode is already OK).
Apart from non-enum mode, there are two other modes. In titles= mode,
the query fetches rows using the PRIMARY (rev_page, rev_id) index, which
can be paged by rev_page (equal to page_id) just fine. In revids= mode,
it uses the rev_id (rev_id) index, which we can be on as well. All in
all, this should be an easy one; I'll look into it tomorrow as well.
Roan Kattouw (Catrope)