One possibility is considering storing rendered HTML for old revisions. It lets wikitext (and hence parser) evolve without breaking old revisions.
Plus
rendered HTML will use the template revision at the time it was rendered
vs.
the latest revision (this is the problem Memento tries to solve).
Long term HTML archival is a something we have been gradually working towards with RESTBase.
Since HTML is about 10x larger than wikitext, a major concern is storage cost. Old estimates https://phabricator.wikimedia.org/T97710 put the total storage needed to store one HTML copy of each revision at roughly 120T. To reduce this cost, we have since implemented several improvements https://phabricator.wikimedia.org/T93751:
- Brotli compression https://en.wikipedia.org/wiki/Brotli, once deployed, is expected to reduce the total storage needs to about 1/4-1/5x over gzip https://phabricator.wikimedia.org/T122028#2004953. - The ability to split latest revisions from old revision lets us use cheaper and slower storage for old revisions. - Retention policies let us specify how many renders per revision we want to archive. We currently only archive one (the latest) render per revision, but have the option to store one render per $time_unit. This is especially important for pages like [[Main Page]], which are rarely edited, but constantly change their content in meaningful ways via templates. It is currently not possible to reliably cite such pages, without resorting to external services like archive.org.
Another important requirement for making HTML a useful long-term archival medium is to establish a clear standard for HTML structures used. The versioned Parsoid HTML spec https://www.mediawiki.org/wiki/Specs/HTML/1.2.1, along with format migration logic for old content, are designed to make the stored HTML as future-proof as possible.
While we currently only have space for a few months worth of HTML revisions, we do expect the changes above to make it possible to push this to years in the foreseeable future without unreasonable hardware needs. This means that we can start building up an archive of our content in a format that is not tied to the software.
Faithfully re-rendering old revisions is harder in retrospect. We will likely have to make some trade-offs between fidelity & effort.
Gabriel
On Mon, Aug 1, 2016 at 2:01 PM, David Gerard dgerard@gmail.com wrote:
On 1 August 2016 at 17:37, Marc-Andre marc@uberbox.org wrote:
We need to find a long-term view to a solution. I don't mean just
keeping
old versions of the software around - that would be of limited help.
It's
be an interesting nightmare to try to run early versions of phase3
nowadays,
and probably require managing to make a very very old distro work and finding the right versions of an ancient apache and PHP. Even *building* those might end up being a challenge... when is the last time you saw a working egcs install? I shudder how nigh-impossible the task might be 100 years from now.
oh god yes. I'm having this now, trying to revive an old Slash installation. I'm not sure I could even reconstruct a box to run it without compiling half of CPAN circa 2002 from source.
Suggestion: set up a copy of WMF's setup on a VM (or two or three), save that VM and bundle it off to the Internet Archive as a dated archive resource. Do this regularly.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l