Re: [Mediawiki-api] [Mediawiki-api-announce] BREAKING CHANGE: Queries may return fewer results than the limit and still set a query-continue

9 Feb 2009

      On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw wrote:
...
In the comment above the default value there's a big warning saying you 
shouldn't set $wgAPIMaxResultSize to something lower than the maximum 
revision size. Wiki admins who do that anyway are asking for trouble.
Ok, as long as it's documented that should be good enough.
...
The metadata thing is also a non-issue, because it's stored in a 
database field that holds at most 4 MB.
It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up
to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in
PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there
is no reason that maximum couldn't be increased in later versions. I
don't see any code limiting the size of the stored data on the MediaWiki
side of things, either.
Of course, I could be missing something.
...
...
rvcontinue doesn't work, it seems that $continue is never extracted from
$params.
Fixed in r47036; forgot to update the patch I applied for the extract() 
crusade I did a while ago.
Looks good to me.
...
...
amfrom is broken if ammessages is not sorted in whatever order PHP uses
for string comparison.
Fixed in r47036 by checking for inequality (!=) rather than less-than 
(<), using the fact that the order of message retrieval is pretty much 
constant (unless of course extensions are installed/removed between two 
requests).
Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage
returns only aboutpage instead of everything starting with aboutpage.
There may be a better way, but one easy way to do it would be like this:
  $skip = !is_null($params['from']);
  foreach( $messages_target as $message ) {
      if($message === $params['from']) $skip = false;
      if(!$skip) $messages[$message] = wfMsg( $message );
  }
...
...
alfrom should probably be used instead of alcontinue if alunique is
specified.
drcontinue must be used instead of drstart in "all" mode.
Fixed in r47036; forgot to update for the recent paging fixes in 
alllinks and the revamp of deletedrevs.
Looks good to me.
...
...
prop=duplicatefiles may have an issue similar to the allmessages problem
mentioned above, it looks like the results aren't sorted by i2.img_name
which could cause repeats or skips if database decides to return them in
any order other than would be given if i2.img_name were included in the
ORDER BY clause.
I'll look into this; sorting by i2.img_name as well is less efficient 
according to a quick EXPLAIN, but I'm not sure if it's too inefficient 
(will ask a database guru). If memory serves, this problem shouldn't 
occur on MySQL databases, though, because they sort stuff by the primary 
key (img_name) by default IIRC.
According to MySQL's documentation, you're right if InnoDB tables are
being used and the query is doing a table scan (or presumably using the
primary key for an index scan); if it uses some other index, this may
not hold true. If MyISAM is used, I believe it's insertion order,
although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably
other stuff screw that up.
If MySQL really is sorting by the i2.img_name by default, theoretically
making that explicit shouldn't make a difference. But knowing MySQL, it
wouldn't surprise me that it does.
...
...
All of these continues that are based on a count (which includes the
existing eloffset, euoffset, and sroffset BTW) are potentially broken.
There are two issues:

If the query isn't strictly ordered, the database is under no
obligation to return the rows in the same order the second time as it
did the first. That means that titles=A|B|C&prop=info could return A
and B with incontinue=2, but titles=A|B|C&prop=info&incontinue=2
could then return A or B again instead of C if the database orders
the rows differently for whatever reason.

Even if the results are sorted, something like
titles=A|B|C|D|E&prop=info&incontinue=2 could skip C if A or B were
deleted between the first and second queries, or repeat B if "A2"
were inserted.

The existing UI for listing external links and search results also uses 
this offset-based approach, which is pretty much why the API does as 
well. You're right about these two issues, and I'll look into them 
deeper later (after I've caught up with what happened this weekend). The 
same thing about MySQL sorting by primary key (IIRC) applies here.
I don't know what else could be done for search results, so I wouldn't
worry about that.
prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill
MySQL. Then you can do the standard thing there using cat_title as the
continue parameter.
prop=extlinks must already be using the el_from index, so hopefully
adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the
continue parameter.
list=exturlusage might be tough. The ideal thing would be to order by
el_index and page_id (with page_id|el_index as the continue parameter),
but I'm not at all confident that wouldn't filesort. If it does, I have
no idea what we could do.
prop=imageinfo looks somewhat tricky. I might change "continue" to be
the concatenation of the image title and the appropriate start value, and
change the logic something like this:
  $titles = array_keys( $pageIds[NS_FILE] );
  sort($titles); // Unless we can depend on the order to always be the same
  $images = RepoGroup::singleton()->findFiles( $titles );
  if(is_null($params['continue'])){
      $skip = false;
  } else {
      $skip = true;
      list($continue_title, $continue_start) = explode('|', $params['continue']);
  }
  foreach ( $titles as $title ) {
      $start = $params['start'];
      if($skip) {
          if($title != $continue_title) continue;
          $skip = false;
          $start = $continue_start;
      }
if(!isset($images[$title])) {
          // Add the "missing" record, as in lines 143-148 of r47043
          // except add a break in the !$fit case
          continue;
      }
$img = $images[$title];
      // Add the "not missing" record, as in lines 68-134 of r47043
      // except use $start instead of $params['start']
  }
prop=info could be done by merging $titles and $missing (if necessary),
doing an asort, and using similar logic. The continue would be just the
title.
For prop=revisions, I wonder if MySQL would filesort if you order by
rev_id/page_id (depending on which is being used in the WHERE clause),
since I would guess that's the index being used to fetch the rows
anyway. If not, that could easily work for the continue parameter in
non-enum mode (enum mode is already OK).
...
Thanks for testing and reporting all this stuff (in just 2 days).
No problem.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Mediawiki-api] [Mediawiki-api-announce] BREAKING CHANGE: Queries may return fewer results than the limit and still set a query-continue