On Mon, Aug 1, 2016 at 11:47 AM, Rob Lanphier robla@wikimedia.org wrote:
HTML storage comes with its own can of worms, but it seems like a
solution
worth thinking about in some form.
- storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs) 2. evolution of HTML spec and its affect on old content (this affects the entire web, so, whatever solution works there will work for us as well) 3. newly discovered security holes and retroactively fixing them in
stored
html and released dumps (not sure). ... and maybe others.
I think these are all reasons why I chose the word "seductive" as opposed to more unambiguous praise :-) Beyond these reasons, the bigger issue is that it's an invitation to be sloppy about our formats. We should endeavor to make our wikitext to html conversion more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel Kinzler taught me). Holding a large data store of snapshots seems like a crutch to avoid the hard work of specifying how this conversion ought to work. Let's actually nail down the spec for this[2][3] rather than kidding ourselves into believing we can just store enough HTML snapshots to make the problem moot.
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of project (ie. wouldn't expect it to happen in this decade), and even then it would not fully solve the problem - e.g. very old versions relied on the default CSS of a different MediaWiki skin; you need site scripts for some things such as infobox show/hide functionality to work, but the standard library those scripts rely on has changed; same for Scribunto scripts.
HTML storage is actually not that bad - browsers are very good at backwards compatibility with older HTML spec and there is very little security footprint in serving static HTML from a separate domain. Storage is problem, but there is no need to store every page revision - monthly or yearly snapshots would be fine IMO. (cf. T17017 - again, Kiwix seems to do this already, so maybe it's just a matter of coordination.) The only other practical problem I can think of is that it would preserve deleted/oversighted information - that problem already exists with the dumps, but those are not kept for very long (on WMF servers at least).