On Mon, Aug 1, 2016 at 9:51 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 08/01/2016 11:37 AM, Marc-Andre wrote:
Is there something we can do to make the passage of years hurt less? Should we be laying groundwork now to prevent issues decades away?
One possibility is considering storing rendered HTML for old revisions. It lets wikitext (and hence parser) evolve without breaking old revisions. Plus rendered HTML will use the template revision at the time it was rendered vs. the latest revision (this is the problem Memento tries to solve).
This is a seductive path to choose. Maintaining backwards compatibility for poorly conceived (in retrospect) engineering decisions is really hard work. A lot of the cruft and awfulness of enterprise-focused software comes from dealing with the seemingly endless torrent of edge cases which are often backwards-compatibility issues in the systems/formats/databases/protocols that the software depends on. The [Y2K problem][1] was a global lesson in the importance of intelligently paying down technical debt.
You outline the problems with this approach in the remainder of your email....
HTML storage comes with its own can of worms, but it seems like a solution worth thinking about in some form.
- storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs) 2. evolution of HTML spec and its affect on old content (this affects the entire web, so, whatever solution works there will work for us as well) 3. newly discovered security holes and retroactively fixing them in stored html and released dumps (not sure). ... and maybe others.
I think these are all reasons why I chose the word "seductive" as opposed to more unambiguous praise :-) Beyond these reasons, the bigger issue is that it's an invitation to be sloppy about our formats. We should endeavor to make our wikitext to html conversion more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel Kinzler taught me). Holding a large data store of snapshots seems like a crutch to avoid the hard work of specifying how this conversion ought to work. Let's actually nail down the spec for this[2][3] rather than kidding ourselves into believing we can just store enough HTML snapshots to make the problem moot.
Rob
[1]: https://en.wikipedia.org/wiki/Year_2000_problem [2]: https://www.mediawiki.org/wiki/Markup_spec [3]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec