On Mon, Aug 1, 2016 at 9:51 AM, Subramanya Sastry <ssastry(a)wikimedia.org> wrote:
On 08/01/2016 11:37 AM, Marc-Andre wrote:
Is there something we can do to make the passage
of years hurt less?
Should we be laying groundwork now to prevent issues decades away?
One possibility is considering storing rendered HTML for old revisions. It
lets wikitext (and hence parser) evolve without breaking old revisions. Plus
rendered HTML will use the template revision at the time it was rendered vs.
the latest revision (this is the problem Memento tries to solve).
This is a seductive path to choose. Maintaining backwards
compatibility for poorly conceived (in retrospect) engineering
decisions is really hard work. A lot of the cruft and awfulness of
enterprise-focused software comes from dealing with the seemingly
endless torrent of edge cases which are often backwards-compatibility
issues in the systems/formats/databases/protocols that the software
depends on. The [Y2K problem] was a global lesson in the
importance of intelligently paying down technical debt.
You outline the problems with this approach in the remainder of your email....
HTML storage comes with its own can of worms, but it
seems like a solution
worth thinking about in some form.
1. storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs)
2. evolution of HTML spec and its affect on old content (this affects the
entire web, so, whatever solution works there will work for us as well)
3. newly discovered security holes and retroactively fixing them in stored
html and released dumps (not sure).
... and maybe others.
I think these are all reasons why I chose the word "seductive" as
opposed to more unambiguous praise :-) Beyond these reasons, the
bigger issue is that it's an invitation to be sloppy about our
formats. We should endeavor to make our wikitext to html conversion
more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
Kinzler taught me). Holding a large data store of snapshots seems
like a crutch to avoid the hard work of specifying how this conversion
ought to work. Let's actually nail down the spec for this
rather than kidding ourselves into believing we can just store enough
HTML snapshots to make the problem moot.