Charlotte Webb wrote:
On 2/5/08, Tim Starling
<tstarling(a)wikimedia.org> wrote:
Another way to find a previous revision is to
order by rev_id. But rev_id
is not monotonic with respect to time, due to various accidents of
history. There are step discontinuities, such as those introduced by
undeletion or the conversion script. It is however used by the older/newer
links on the old revision views. This algorithm is chosen over the
timestamp algorithm because at the time the revision is fetched, the
timestamp is not known, only the rev_id is known. Finding out the
timestamp would require an extra query.
I'm guessing a one-time re-assignment of rev_id's for edits older than
[timestamp of most recent discontinuity] is out of the question :-)
Yes. :)
To summarize, there are several sources of discontinuity:
1) The old schema
Originally, all current revisions sat in the 'cur' table. Revision IDs
were only assigned once another edit kicked it out of the current slot
and moved it to the 'old' table.
As a result, revision IDs between _different_ pages from 2002-2005 have
no clear time relationship.
Later merging of multiple pages could then make the order within a
page's history inconsistent.
2) The double import from UseMod
Wikipedia originally ran on the perl-based UseModWiki software until the
predecessor to MediaWiki went live in January 2002. The initial
conversion only kept the *current* version of each page; older revisions
were added some months later on another pass reading from an old backup.
As a result, revisions from before the conversion got revision IDs from
the time of the later import.
They also would have no clear time relationship across multiple pages,
as the convertor's progression was sorted first by page, then by
revisions within each page.
3) Undeletion
Until 2006, revision IDs were not maintained across delete+undelete;
undeleted revisions would be given fresh new IDs, regardless of their
original timestamp.
4) Transwiki import
Pages moved across wikis through Special:Import or bulk imports (such as
when moving pages off Incubator) are given fresh new local revision IDs,
which do not relate to their timestamps.
5) Clock skew
Timestamps saved on revisions are made from the local clock on the web
server front end. It sometimes happens that some machines have
incorrectly set clocks; sometimes by a few seconds, sometimes by years
(there was an incident with bad BIOS backing batteries and clocks that
reset themselves to 2003).
In these rarer cases, the timestamp is "wrong" whereas the revision ID
may be seen to show a more accurate picture of where in the sequence it
belongs.
Some of these have been detected and cleaned up, but others will have
incorrect timestamps in the logs forever.
This is really the only one that's *wrong* as such, and we may want to
better protect against it in the future. One possible protection could
be to use the database server's timestamp for things instead of the web
front-end's, though that may complicate various things.
-- brion vibber (brion @
wikimedia.org)