Re: [Mediawiki-api] [Mediawiki-api-announce] BREAKING CHANGE: Queries may return fewer results than the limit and still set a query-continue

9 Feb 2009

Brad Jorsch schreef:
...
  On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw
wrote:
  In the comment above the default value
there's a big warning saying you 
 shouldn't set $wgAPIMaxResultSize to something lower than the maximum 
 revision size. Wiki admins who do that anyway are asking for trouble.  
 Ok, as long as it's documented that should be good enough.

  The metadata thing is also a non-issue, because
it's stored in a 
 database field that holds at most 4 MB.  
 It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up
 to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in
 PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there
 is no reason that maximum couldn't be increased in later versions. I
 don't see any code limiting the size of the stored data on the MediaWiki
 side of things, either.

 Of course, I could be missing something.
  No, just my math going wrong (2^4 is 16, not 4). Fortunately, the 
metadata is (currently*) extracted from the image's EXIF data, which is 
limited to 64 KB.

* There are plans floating around on Commons to store UI-readable and 
modifiable metadata for images, and using the existing img_metadata 
field for that is an option

...
    amfrom is broken if ammessages is not sorted in
whatever order PHP uses
 for string comparison.
  Fixed in r47036 by checking for inequality (!=) rather than less-than 
 (<), using the fact that the order of message retrieval is pretty much 
 constant (unless of course extensions are installed/removed between two 
 requests).  
 Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage
 returns only aboutpage instead of everything starting with aboutpage.
  Doh! Of course.

...
  There may be a better way, but one easy way to do it
would be like this:
   $skip = !is_null($params['from']);
   foreach( $messages_target as $message ) {
       if($message === $params['from']) $skip = false;
       if(!$skip) $messages[$message] = wfMsg( $message );
   }
  Did exactly that in r47048.

...
   I'll look
into this; sorting by i2.img_name as well is less efficient 
 according to a quick EXPLAIN, but I'm not sure if it's too inefficient 
 (will ask a database guru). If memory serves, this problem shouldn't 
 occur on MySQL databases, though, because they sort stuff by the primary 
 key (img_name) by default IIRC.  
 According to MySQL's documentation, you're right if InnoDB tables are
 being used and the query is doing a table scan (or presumably using the
 primary key for an index scan); if it uses some other index, this may
 not hold true. If MyISAM is used, I believe it's insertion order,
 although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably
 other stuff screw that up.

 If MySQL really is sorting by the i2.img_name by default, theoretically
 making that explicit shouldn't make a difference. But knowing MySQL, it
 wouldn't surprise me that it does.
  Even if it's already sorted, that won't always stop MySQL from sorting 
it again, which takes time. IIRC the i2 table is scanned using the 
img_sha1 index, though, which means sorting by i2.img_name is probably a 
real filesort, but because the query sorts by i1.img_name (using the 
primary key) first, the filesort will only sort small subsets of the 
result set and not the entire result set at once (which makes a 
difference, because sorting happens in worse-than-linear time). I'll 
consult with Aryeh or Domas before adding an ORDER BY on i2.img_name.

...
  I don't know what else could be done for search
results, so I wouldn't
 worry about that. 

 prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill
 MySQL. Then you can do the standard thing there using cat_title as the
 continue parameter.
  You're right about that. Done in r47050.

...
  prop=extlinks must already be using the el_from index,
so hopefully
 adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the
 continue parameter.
  Actually, it can't, because the index only covers the first 40 chars of 
el_to, which means it can't use the index to resolve sorting conflicts 
past the 40th char of el_to. Whether it's actually smart enough to use 
the index optimally I don't know (it's MySQL after all); if it does, the 
filesort would presumably affect small subsets and might be acceptable. 
Again, I'll have to ask more performance-educated people.

...
  list=exturlusage might be tough. The ideal thing would
be to order by
 el_index and page_id (with page_id|el_index as the continue parameter),
 but I'm not at all confident that wouldn't filesort. If it does, I have
 no idea what we could do.
  Actually, we'd have to page the other way around (first el_index, then 
el_from; the latter is equal to page_id) because the query has to use 
the el_index index to resolve the LIKE query on the el_index field. 
Again, only the first 60 chars of el_index are covered by the index, and 
there is no index that covers both el_index and el_from, so we're 
looking at filesorts again (again, probably on small subsets). I'll run 
this by the DB gurus as well.

...
  prop=imageinfo looks somewhat tricky. I might change
"continue" to be
 the concatenation of the image title and the appropriate start value, and
 change the logic something like this:
   $titles = array_keys( $pageIds[NS_FILE] );
   sort($titles); // Unless we can depend on the order to always be the same
Actually, I believe these things are ordered by page IDs somewhere in 
ApiPageSet, so the order would be the same every time except when a file 
is moved/renamed.

...
    $images = RepoGroup::singleton()->findFiles(
$titles );
   if(is_null($params['continue'])){
       $skip = false;
   } else {
       $skip = true;
       list($continue_title, $continue_start) = explode('|',
$params['continue']);
   }
   foreach ( $titles as $title ) {
       $start = $params['start'];
       if($skip) {
           if($title != $continue_title) continue;
           $skip = false;
           $start = $continue_start;
       }

       if(!isset($images[$title])) {
           // Add the "missing" record, as in lines 143-148 of r47043
           // except add a break in the !$fit case
           continue;
       }

       $img = $images[$title];
       // Add the "not missing" record, as in lines 68-134 of r47043
       // except use $start instead of $params['start']
   }
  Looks sane, I'll probably do this tomorrow.

...
  prop=info could be done by merging $titles and
$missing (if necessary),
 doing an asort, and using similar logic. The continue would be just the
 title.
  I'll look into that tomorrow as well.

...
  For prop=revisions, I wonder if MySQL would filesort
if you order by
 rev_id/page_id (depending on which is being used in the WHERE clause),
 since I would guess that's the index being used to fetch the rows
 anyway. If not, that could easily work for the continue parameter in
 non-enum mode (enum mode is already OK).
  Apart from non-enum mode, there are two other modes. In titles= mode, 
the query fetches rows using the PRIMARY (rev_page, rev_id) index, which 
can be paged by rev_page (equal to page_id) just fine. In revids= mode, 
it uses the rev_id (rev_id) index, which we can be on as well. All in 
all, this should be an easy one; I'll look into it tomorrow as well.

Roan Kattouw (Catrope)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Mediawiki-api] [Mediawiki-api-announce] BREAKING CHANGE: Queries may return fewer results than the limit and still set a query-continue