The order of things

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [MediaWiki-CVS]...

SVG validity checks

Magnus Manske

5 Feb 2008 5 Feb '08

11:17 p.m.

For [[Oxygen]], the first edit logged in the database is mine, September 15, *2001*. However, when you click on it:

http://en.wikipedia.org/w/index.php?title=Oxygen&oldid=271622

be surprised to find a link to "older version", pointing to an edit from September 2, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

When you keep following the "older version" trail, you'll end up at the conversion script edit of February 12, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

I vaguely remember a version-order bug from ancient times. Is that the same one?

Just curious, Magnus

Show replies by date

Simetrical

5 Feb 5 Feb

11:21 p.m.

On Feb 5, 2008 11:17 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

For [[Oxygen]], the first edit logged in the database is mine, September 15, *2001*. However, when you click on it:

http://en.wikipedia.org/w/index.php?title=Oxygen&oldid=271622

be surprised to find a link to "older version", pointing to an edit from September 2, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

It's sorting by rev_id, apparently, not date. Presumably the import script didn't preserve the original ID order.

Charlotte Webb

11:31 p.m.

On 2/5/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

For [[Oxygen]], the first edit logged in the database is mine, September 15, *2001*. However, when you click on it:

http://en.wikipedia.org/w/index.php?title=Oxygen&oldid=271622

be surprised to find a link to "older version", pointing to an edit from September 2, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

When you keep following the "older version" trail, you'll end up at the conversion script edit of February 12, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

Simply explained, the "history" tab sorts edits properly by date/time, but the arrows in the "diff view" navigate through edits in order of "oldid" number (which doesn't always correspond). The latter should probably be fixed to match the former.

—C.W.

Tim Starling

11:35 p.m.

Magnus Manske wrote:

...

For [[Oxygen]], the first edit logged in the database is mine, September 15, *2001*. However, when you click on it:

http://en.wikipedia.org/w/index.php?title=Oxygen&oldid=271622

be surprised to find a link to "older version", pointing to an edit from September 2, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

When you keep following the "older version" trail, you'll end up at the conversion script edit of February 12, *2002*:

http://en.wikipedia.org/w/index.php?title=Oxygen&direction=prev&oldi...

I vaguely remember a version-order bug from ancient times. Is that the same one?

Timestamps have one-second resolution and so are not unique. So ordering by timestamp, "SELECT rev_id FROM revision WHERE rev_timestamp<$t ORDER BY rev_timestamp DESC LIMIT 1" to find a previous revision introduces bugs, because faced with non-unique timestamps, it would skip revisions. This is the algorithm used by the prev/next links on the history page.

Another way to find a previous revision is to order by rev_id. But rev_id is not monotonic with respect to time, due to various accidents of history. There are step discontinuities, such as those introduced by undeletion or the conversion script. It is however used by the older/newer links on the old revision views. This algorithm is chosen over the timestamp algorithm because at the time the revision is fetched, the timestamp is not known, only the rev_id is known. Finding out the timestamp would require an extra query.

Neither rev_timestamp nor rev_id provides a way to order revisions precisely. The way to do it properly is to use the rev_id only to resolve conflicts when multiple revisions share the requested page/timestamp pair. Unfortunately this is slower and more complex than either of the other two methods.

-- Tim Starling

Charlotte Webb

11:48 p.m.

On 2/5/08, Tim Starling tstarling@wikimedia.org wrote:

...

Another way to find a previous revision is to order by rev_id. But rev_id is not monotonic with respect to time, due to various accidents of history. There are step discontinuities, such as those introduced by undeletion or the conversion script. It is however used by the older/newer links on the old revision views. This algorithm is chosen over the timestamp algorithm because at the time the revision is fetched, the timestamp is not known, only the rev_id is known. Finding out the timestamp would require an extra query.

I'm guessing a one-time re-assignment of rev_id's for edits older than [timestamp of most recent discontinuity] is out of the question :-)

—C.W.

Brion Vibber

6 Feb 6 Feb

1:04 a.m.

Charlotte Webb wrote:

...

On 2/5/08, Tim Starling tstarling@wikimedia.org wrote:

...
Another way to find a previous revision is to order by rev_id. But rev_id is not monotonic with respect to time, due to various accidents of history. There are step discontinuities, such as those introduced by undeletion or the conversion script. It is however used by the older/newer links on the old revision views. This algorithm is chosen over the timestamp algorithm because at the time the revision is fetched, the timestamp is not known, only the rev_id is known. Finding out the timestamp would require an extra query.

I'm guessing a one-time re-assignment of rev_id's for edits older than [timestamp of most recent discontinuity] is out of the question :-)

Yes. :)

To summarize, there are several sources of discontinuity:

1) The old schema

Originally, all current revisions sat in the 'cur' table. Revision IDs were only assigned once another edit kicked it out of the current slot and moved it to the 'old' table.

As a result, revision IDs between _different_ pages from 2002-2005 have no clear time relationship.

Later merging of multiple pages could then make the order within a page's history inconsistent.

2) The double import from UseMod

Wikipedia originally ran on the perl-based UseModWiki software until the predecessor to MediaWiki went live in January 2002. The initial conversion only kept the *current* version of each page; older revisions were added some months later on another pass reading from an old backup.

As a result, revisions from before the conversion got revision IDs from the time of the later import.

They also would have no clear time relationship across multiple pages, as the convertor's progression was sorted first by page, then by revisions within each page.

3) Undeletion

Until 2006, revision IDs were not maintained across delete+undelete; undeleted revisions would be given fresh new IDs, regardless of their original timestamp.

4) Transwiki import

Pages moved across wikis through Special:Import or bulk imports (such as when moving pages off Incubator) are given fresh new local revision IDs, which do not relate to their timestamps.

5) Clock skew

Timestamps saved on revisions are made from the local clock on the web server front end. It sometimes happens that some machines have incorrectly set clocks; sometimes by a few seconds, sometimes by years (there was an incident with bad BIOS backing batteries and clocks that reset themselves to 2003).

In these rarer cases, the timestamp is "wrong" whereas the revision ID may be seen to show a more accurate picture of where in the sequence it belongs.

Some of these have been detected and cleaned up, but others will have incorrect timestamps in the logs forever.

This is really the only one that's *wrong* as such, and we may want to better protect against it in the future. One possible protection could be to use the database server's timestamp for things instead of the web front-end's, though that may complicate various things.

-- brion vibber (brion @ wikimedia.org)

Roan Kattouw

2:29 a.m.

On this whole revid vs. timestamp thing: I think I read something a while ago (don't remember where) about adding rev_previd and rev_nextid fields, each pointing to the previous/next revision ID (unless of course it doesn't exist, in which case it would be NULL; does MySQL distinguish between NULL and 0, anyway?). This would make the prev/next links less expensive to generate (only one DB query rather than 3), but would cause an extra UPDATE when adding revisions; also, merging histories could become a nightmare: in a worst-case scenario, merging N revisions into a history of 2N revisions would cause all rows to be changed (3 times as many as is currently the case). Also, filling these fields for existing revisions would be a huge operation (maybe this could be done on the fly?).

Thoughts?

Roan Kattouw (Catrope)

Brion Vibber

2:51 a.m.

Roan Kattouw wrote:

...

On this whole revid vs. timestamp thing: I think I read something a while ago (don't remember where) about adding rev_previd and rev_nextid fields, each pointing to the previous/next revision ID (unless of course it doesn't exist, in which case it would be NULL; does MySQL distinguish between NULL and 0, anyway?). This would make the prev/next links less expensive to generate (only one DB query rather than 3), but would cause an extra UPDATE when adding revisions; also, merging histories could become a nightmare: in a worst-case scenario, merging N revisions into a history of 2N revisions would cause all rows to be changed (3 times as many as is currently the case). Also, filling these fields for existing revisions would be a huge operation (maybe this could be done on the fly?).

There's been occasional talk, and at one point some code went in and I think got taken back out temporarily.

One thing to consider is that we could actually *indicate* those non-contiguous merged sections in some useful way, like a (shudder) tree view....... maybe.... sort of. :D

Everything gets horrifyingly ugly when you try to do anything other than straight timelines, though.

-- brion

Roan Kattouw

3:10 a.m.

Brion Vibber schreef:

...

One thing to consider is that we could actually *indicate* those non-contiguous merged sections in some useful way, like a (shudder) tree view....... maybe.... sort of. :D

Everything gets horrifyingly ugly when you try to do anything other than straight timelines, though.

We could make merging histories a separate feature (as opposed to the move+undelete hack we use now) and let the user specify where to insert revisions. Right now they're just inserted chronologically, which might not always make sense. Of course this requires a lot of coding and schema changes and stuff, so we probably wanna put that off until after the 1.12 release (if we're even gonna do it in the first place).

Roan Kattouw (Catrope)

P.S.: About that, any idea when the 1.12 release will be, Brion?

Thomas Dalton

3:32 a.m.

...

Right now they're just inserted chronologically, which might not always make sense.

You're being kind (something I've never been accused of)... it never makes any sense at all... it's far easier than any of the alternatives, so I understand why it's done that way, but it doesn't make sense.

Magnus Manske

4:45 a.m.

On Feb 5, 2008 7:51 PM, Brion Vibber brion@wikimedia.org wrote:

...

Roan Kattouw wrote:

...
On this whole revid vs. timestamp thing: I think I read something a while ago (don't remember where) about adding rev_previd and rev_nextid fields, each pointing to the previous/next revision ID (unless of course it doesn't exist, in which case it would be NULL; does MySQL distinguish between NULL and 0, anyway?). This would make the prev/next links less expensive to generate (only one DB query rather than 3), but would cause an extra UPDATE when adding revisions; also, merging histories could become a nightmare: in a worst-case scenario, merging N revisions into a history of 2N revisions would cause all rows to be changed (3 times as many as is currently the case). Also, filling these fields for existing revisions would be a huge operation (maybe this could be done on the fly?).

There's been occasional talk, and at one point some code went in and I think got taken back out temporarily.

One thing to consider is that we could actually *indicate* those non-contiguous merged sections in some useful way, like a (shudder) tree view....... maybe.... sort of. :D

Everything gets horrifyingly ugly when you try to do anything other than straight timelines, though.

Just thinking out loud here...

If timestamps are not unique, and revisions might not be ordered, wouldn't ORDER BY revision GROUP BY timestamp order it by timestamp, and then by revision number if two or more identical timestamps are found?

That would get around the two-query issue. Or is that too expensive?

BTW, thanks Brion for the detailed explanation! :-)

Cheers, Magnus

Brion Vibber

5:16 a.m.

Magnus Manske wrote:

...

Just thinking out loud here...

If timestamps are not unique, and revisions might not be ordered, wouldn't ORDER BY revision GROUP BY timestamp order it by timestamp, and then by revision number if two or more identical timestamps are found?

That would get around the two-query issue. Or is that too expensive?

That should work reasonably well in theory, if there was an index for it; without an index I'm not convinced it'll sort efficiently. In theory it'd be simple as you'd only have to reorder small bits here and there in the stream, but the database may not know that. :)

-- brion

Thomas Dalton

5:26 a.m.

On 05/02/2008, Brion Vibber brion@wikimedia.org wrote:

...

Magnus Manske wrote:

...
Just thinking out loud here...

If timestamps are not unique, and revisions might not be ordered, wouldn't ORDER BY revision GROUP BY timestamp order it by timestamp, and then by revision number if two or more identical timestamps are found?

That would get around the two-query issue. Or is that too expensive?

That should work reasonably well in theory, if there was an index for it; without an index I'm not convinced it'll sort efficiently. In theory it'd be simple as you'd only have to reorder small bits here and there in the stream, but the database may not know that. :)

Can you not create a new index for it?

Simetrical

8:56 a.m.

On Feb 5, 2008 2:29 PM, Roan Kattouw roan.kattouw@home.nl wrote:

...

does MySQL distinguish between NULL and 0, anyway?

Yes. Everything except Oracle does, as far as I know (and for Oracle maybe it's only empty strings that are NULL, I can't remember).

...

This would make the prev/next links less expensive to generate (only one DB query rather than 3), but would cause an extra UPDATE when adding revisions

Three very cheap selects versus one (since when do we need three selects for this anyway?) is not a big difference. Nor is an extra update in the extremely rare case of adding a new revision.

...

Also, filling these fields for existing revisions would be a huge operation (maybe this could be done on the fly?).

Probably not as big a deal as altering the table in the first place. :)

On Feb 5, 2008 2:51 PM, Brion Vibber brion@wikimedia.org wrote:

...

Everything gets horrifyingly ugly when you try to do anything other than straight timelines, though.

That's exactly what I always said to David about LQT. He's sticking to trees, and look where that's gotten. :P

On Feb 5, 2008 4:45 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

If timestamps are not unique, and revisions might not be ordered, wouldn't ORDER BY revision GROUP BY timestamp order it by timestamp, and then by revision number if two or more identical timestamps are found?

GROUP BY x ORDER BY y is a filesort in MySQL, always. No index use is possible for the sorting, in any version to date. You could, however, do ORDER BY rev_timestamp, rev_id. That would work with an appropriate index, one beginning with the columns needed for retrieval and ending with (rev_timestamp, rev_id), *if* the retrieval condition were appropriate -- no range scan on anything but rev_timestamp, and no range scan on rev_timestamp other than comparison, like no IN( ... ). Something of the pattern SELECT ... WHERE rev_page = X AND rev_timestamp >= Y ORDER BY rev_timestamp, rev_id would work as desired with an index on (rev_page, rev_timestamp, rev_id).

Note that I'm not sure if we want rev_timestamp > Y or rev_timestamp

...

= Y. Either one is, I think, actually wrong, in a case where you

have three values of (rev_id, rev_timestamp) like (1, 20080205000000), (2, 20080205000000), (3, 20080205000000), and are trying to find the revision before or after revision 2. I guess you would need WHERE rev_page = X AND (rev_timestamp > Y OR (rev_timestamp = Y AND rev_id > Z)). That would have to be rewritten as a UNION to permit index use before 5.0ish (and in 5.0ish still, unless you feel lucky), and I *think* the UNION would be perfectly efficient using the same index, but this is getting kind of hairy. :)

On Feb 5, 2008 5:26 PM, Thomas Dalton thomas.dalton@gmail.com wrote:

...

Can you not create a new index for it?

Every extra index means slower inserts and slower deletes. Unneeded indexes should be avoided. In this case, I wonder if it's not best to just do the logic in PHP in the extremely rare case where two revisions have the same timestamp, rather than adding what amounts to a penalty on *every single query on the revision table* of a particular form.

6160

Age (days ago)

6161

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

7 participants

tags (0)

participants (7)

Brion Vibber
Charlotte Webb
Magnus Manske
Roan Kattouw
Simetrical
Thomas Dalton
Tim Starling