Hello Herbert.
Herbert Van de Sompel wrote:
- Let me describe the actual status and challenges faced in the
Memento plug-in work:
2.1. The plug-in detects a client's X-Accept-Datetime header, and returns the mediawiki page that was active at the datetime specified in the header. Same for images, actually.
2.2. Display history pages with the template that was active at the time the history page acted as the current one. [Snip] So, we are looking at the mediawiki code to see whether a history page, when rendered, could itself retrieve the appropriate (old) template from the database. If we are successful, we will share that code also at http://www.mediawiki.org/wiki/Extension:Memento once available. It will obviously be up to the mediawiki community whether they are willing to adopt the proposed change to the codebase.
Obviously it's a server issue.
2.3. We have looked into another issue raised by Jakob: Display deleted pages as they existed at the datetime expressed in X-Datetime- Accept. We have actually implemented this. There are 2 caveats:
- as is the case with mediawiki in general, deleted pages are only
accessible by those with appropriate permissions;
- as is the case with mediawiki in general, deleted pages show up in
Edit mode. This code will soon be included at http://www.mediawiki.org/wiki/Extension:Memento
Showing deleted pages in edit mode is not always the case, since they can't be rendered (albeit not with the old templates, which would be an interesting enhacement by your work).
It is impressive how far you have gone. However, I don't think you can do a *complete* implementation.
First, you should be aware that timemachining the pages has been tried in the past. Discussions treating FlaggedReves are also relevant for your project. FlaggedRevs is an extension which allow to mark the status of a page (eg. not vandalised) at a point in time. A naive implementation would store the timestamp and get the old version from the archive. They ended up storing in a table specific to the extension the page content with templates transcluded. However, flaggedrevs is a tool to fight vandalism. Yours is an archival one. You could accept imperfect results under certain circunstances.
Problematic aspects:
Page moves/image moves: *You want to see content of Foo at epoch, but the history now at Foo is wrong. Instead you need to look at that history of the page now at Foo_(disambiguation) You need to follow (perhaps even many times) the move logs to find out the real page.
Page merges: *When two pages have been merged, you will want to show the revision which was originally at the page the user wants to timemachine. You can no longer just rely on the timestamps. You may be able to get that by splitting the sources at the merge time and going back via rev_parent_id. Needless to say, this is very inefficient, this piece wouldn't be put live at wikipedia.
Partial undeletions: *When a page is undeleted, the summary shows how many revisions were undeleted, but not *which* ones.
Case: *Page A has two edits (#1 and #2). *A vandal adds obscene content to it (#3). *Admin deletes the page and restores the two first revisions. *Several months later, the page is completely deleted.
When an admin wants to view what the page looked like those months, an application is unable to determine if the two revisions which had been shown were #1 and #2 or perhaps #2 and #3.
revdelete may have similar issues.
2.4. We do not feel that all pages should necessarily be subject to datetime content negotiation, in the same way that not all URIs are subject to content negotiation in other dimensions. We feel that the Special Pages fall under this category, as they do not have History.
2.5. We have ideas regarding how to address the issue raised by Daniel: the timestamp isn't a unique identifier, multiple revisions *might* have the same timestamp. From the perspective of Memento, a datetime is obviously the only "globally" recognizable value that can be used for negotiation. If cases occur where multiple versions of a page exist for the same second, the thing to do according to RFC 2295 would be to return a "300 Mutliple Choices", listing the URIs (and metadata) of those version in an Alternates header. The client then has to take it from there.
2.6. The caching issue is a general problem arising from introducing Memento in a web that does not (yet) do Memento: when in datetime content negotiation mode all caches between client and server (both included) need to be bypassed. As described in our paper, we currently address this problem by adding the following client headers:
Cache-Control: no-cache => to force cache revalidation, and If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT' to enforce validation failure
We very much understand this is not elegant but it tends to work ;-) .
The caching issue is IMHO the bigger problem in your approach using the new header. Disabling cache on the request kind of work (although not in the long term), but you also need to disable caching at the server, so when someone accessing by your same proxy (ignorant of X-Accept-Datetime) to the current page doesn't get the cached page you were served earlier.
RFC 2145 states very clearly that "A proxy MUST forward an unknown header", but in your case it'd have been preferable that the header wasn't forwarded if the proxy isn't memento aware.
Which leads us to another issue, which is that it seems your server implementation doesn't "acknowledge" memento, so given a response to a X-Accept-Datetime, you don't know if what you're getting is the version you requested or the current one (because the server ignored it). It can be as simple as requiring a Last-Modified <= X-Accept-Datetime on Accept-Datetime responses (that would allow the server to explicitely tell since when is it valid), but extended to all response codes.