On Mon, Aug 1, 2016 at 11:47 AM, Rob Lanphier <robla(a)wikimedia.org> wrote:
HTML storage
comes with its own can of worms, but it seems like a
solution
worth thinking about in some form.
1. storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs)
2. evolution of HTML spec and its affect on old content (this affects the
entire web, so, whatever solution works there will work for us as well)
3. newly discovered security holes and retroactively fixing them in
stored
html and released dumps (not sure).
... and maybe others.
I think these are all reasons why I chose the word "seductive" as
opposed to more unambiguous praise :-) Beyond these reasons, the
bigger issue is that it's an invitation to be sloppy about our
formats. We should endeavor to make our wikitext to html conversion
more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
Kinzler taught me). Holding a large data store of snapshots seems
like a crutch to avoid the hard work of specifying how this conversion
ought to work. Let's actually nail down the spec for this[2][3]
rather than kidding ourselves into believing we can just store enough
HTML snapshots to make the problem moot.
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
project (ie. wouldn't expect it to happen in this decade), and even then it
would not fully solve the problem - e.g. very old versions relied on the
default CSS of a different MediaWiki skin; you need site scripts for some
things such as infobox show/hide functionality to work, but the standard
library those scripts rely on has changed; same for Scribunto scripts.
HTML storage is actually not that bad - browsers are very good at backwards
compatibility with older HTML spec and there is very little security
footprint in serving static HTML from a separate domain. Storage is
problem, but there is no need to store every page revision - monthly or
yearly snapshots would be fine IMO. (cf. T17017 - again, Kiwix seems to do
this already, so maybe it's just a matter of coordination.) The only other
practical problem I can think of is that it would preserve
deleted/oversighted information - that problem already exists with the
dumps, but those are not kept for very long (on WMF servers at least).