Charlotte Webb wrote:
On 2/5/08, Tim Starling tstarling@wikimedia.org wrote:
Another way to find a previous revision is to order by rev_id. But rev_id is not monotonic with respect to time, due to various accidents of history. There are step discontinuities, such as those introduced by undeletion or the conversion script. It is however used by the older/newer links on the old revision views. This algorithm is chosen over the timestamp algorithm because at the time the revision is fetched, the timestamp is not known, only the rev_id is known. Finding out the timestamp would require an extra query.
I'm guessing a one-time re-assignment of rev_id's for edits older than [timestamp of most recent discontinuity] is out of the question :-)
Yes. :)
To summarize, there are several sources of discontinuity:
1) The old schema
Originally, all current revisions sat in the 'cur' table. Revision IDs were only assigned once another edit kicked it out of the current slot and moved it to the 'old' table.
As a result, revision IDs between _different_ pages from 2002-2005 have no clear time relationship.
Later merging of multiple pages could then make the order within a page's history inconsistent.
2) The double import from UseMod
Wikipedia originally ran on the perl-based UseModWiki software until the predecessor to MediaWiki went live in January 2002. The initial conversion only kept the *current* version of each page; older revisions were added some months later on another pass reading from an old backup.
As a result, revisions from before the conversion got revision IDs from the time of the later import.
They also would have no clear time relationship across multiple pages, as the convertor's progression was sorted first by page, then by revisions within each page.
3) Undeletion
Until 2006, revision IDs were not maintained across delete+undelete; undeleted revisions would be given fresh new IDs, regardless of their original timestamp.
4) Transwiki import
Pages moved across wikis through Special:Import or bulk imports (such as when moving pages off Incubator) are given fresh new local revision IDs, which do not relate to their timestamps.
5) Clock skew
Timestamps saved on revisions are made from the local clock on the web server front end. It sometimes happens that some machines have incorrectly set clocks; sometimes by a few seconds, sometimes by years (there was an incident with bad BIOS backing batteries and clocks that reset themselves to 2003).
In these rarer cases, the timestamp is "wrong" whereas the revision ID may be seen to show a more accurate picture of where in the sequence it belongs.
Some of these have been detected and cleaned up, but others will have incorrect timestamps in the logs forever.
This is really the only one that's *wrong* as such, and we may want to better protect against it in the future. One possible protection could be to use the database server's timestamp for things instead of the web front-end's, though that may complicate various things.
-- brion vibber (brion @ wikimedia.org)