One possibility is considering storing rendered HTML
for old revisions. It
lets wikitext (and hence parser) evolve without breaking old revisions.
rendered HTML will use the template revision at the
time it was rendered
the latest revision (this is the problem Memento tries
Long term HTML archival is a something we have been gradually working
towards with RESTBase.
Since HTML is about 10x larger than wikitext, a major concern is storage
cost. Old estimates <https://phabricator.wikimedia.org/T97710> put the
total storage needed to store one HTML copy of each revision at roughly
120T. To reduce this cost, we have since implemented several improvements
- Brotli compression <https://en.wikipedia.org/wiki/Brotli>, once
deployed, is expected to reduce the total storage needs to about
1/4-1/5x over gzip <https://phabricator.wikimedia.org/T122028#2004953>.
- The ability to split latest revisions from old revision lets us use
cheaper and slower storage for old revisions.
- Retention policies let us specify how many renders per revision we
want to archive. We currently only archive one (the latest) render per
revision, but have the option to store one render per $time_unit. This is
especially important for pages like [[Main Page]], which are rarely edited,
but constantly change their content in meaningful ways via templates. It is
currently not possible to reliably cite such pages, without resorting to
external services like archive.org
Another important requirement for making HTML a useful long-term archival
medium is to establish a clear standard for HTML structures used. The
versioned Parsoid HTML spec
<https://www.mediawiki.org/wiki/Specs/HTML/1.2.1>, along with format
migration logic for old content, are designed to make the stored HTML as
future-proof as possible.
While we currently only have space for a few months worth of HTML
revisions, we do expect the changes above to make it possible to push this
to years in the foreseeable future without unreasonable hardware needs.
This means that we can start building up an archive of our content in a
format that is not tied to the software.
Faithfully re-rendering old revisions is harder in retrospect. We will
likely have to make some trade-offs between fidelity & effort.
On Mon, Aug 1, 2016 at 2:01 PM, David Gerard <dgerard(a)gmail.com> wrote:
On 1 August 2016 at 17:37, Marc-Andre
We need to find a long-term view to a solution.
I don't mean just
old versions of the software around - that would
be of limited help.
be an interesting nightmare to try to run early
versions of phase3
and probably require managing to make a very very
old distro work and
finding the right versions of an ancient apache and PHP. Even *building*
those might end up being a challenge... when is the last time you saw a
working egcs install? I shudder how nigh-impossible the task might be 100
years from now.
oh god yes. I'm having this now, trying to revive an old Slash
installation. I'm not sure I could even reconstruct a box to run it
without compiling half of CPAN circa 2002 from source.
Suggestion: set up a copy of WMF's setup on a VM (or two or three),
save that VM and bundle it off to the Internet Archive as a dated
archive resource. Do this regularly.
Wikitech-l mailing list
Principal Engineer, Wikimedia Foundation