Hoi,
This is an inquiry from my friend in academia, researching about Wikipedia.
He would like to know whether there's a way to acquire a list of templates including external links. Here are some examples including external links.
https://ja.wikipedia.org/wiki/Template:JOI/doc https://ja.wikipedia.org/wiki/Template:Twitter/doc
Such links are stored in externallinks.sql.gz, in an expanded form.
When you want to check increase/decrease of linked domains in chronological order through edit history, you have to check pages-meta-history1.xml etc. In a such case, traditional links and links by templates are mixed, Therefore, the latter ones (links by templates) should be expanded to traditional link forms.
Sorry if what I am saying does not make sense. Thanks in advance,
--Takashi Ota [[U:Takot]]
On 2016-07-31 10:53 AM, Takashi OTA wrote:
When you want to check increase/decrease of linked domains in chronological order through edit history
This is actually a harder problem that it seems, even at first glance: if you want to examine the links over time then, when you are looking at an old revision of an article, you have to contrive to expand the templates /as they existed at that time/ and not those that exist /now/ as the Mediawiki engine would do.
Clearly, all the data to do so is there in the database - and I seem to recall that there exists an extension that will allow you to use the parser in that way - but the Foundation projects do not have such an extension installed and cannot be convinced to render a page for you that would accurately show what ELs it might have had at a given date.
-- Coren / Marc
On Mon, Aug 1, 2016 at 7:46 AM, Marc-Andre marc@uberbox.org wrote:
Clearly, all the data to do so is there in the database - and I seem to recall that there exists an extension that will allow you to use the parser in that way - but the Foundation projects do not have such an extension installed and cannot be convinced to render a page for you that would accurately show what ELs it might have had at a given date.
That would be the Memento [1] extension. I'm not sure this is even theoretically possible - the parser has changed over time and old templates might not work anymore.
Your best bet is probably to find some old dumps. (Kiwix [2] maybe? I don't know if they preserve templates.)
[1] https://www.mediawiki.org/wiki/Extension:Memento [2] https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
On 2016-08-01 12:21 PM, Gergo Tisza wrote:
the parser has changed over time and old templates might not work anymo
Aaah. Good point. Also, the changes in extensions (or, indeed, what extensions are installed at all) might break attempts to parse the past, as it were.
You know, this is actually quite troublesome: as the platform evolves the older data becomes increasingly hard to use at all - making it effectively lost even if we kept the bits around. This is a rather widespread issue in computing as a rule; but I now find myself distressed at its unavoidable effect on what we've always intended to be a permanent contribution to humanity.
We need to find a long-term view to a solution. I don't mean just keeping old versions of the software around - that would be of limited help. It's be an interesting nightmare to try to run early versions of phase3 nowadays, and probably require managing to make a very very old distro work and finding the right versions of an ancient apache and PHP. Even *building* those might end up being a challenge... when is the last time you saw a working egcs install? I shudder how nigh-impossible the task might be 100 years from now.
Is there something we can do to make the passage of years hurt less? Should we be laying groundwork now to prevent issues decades away?
At the very least, I think those questions are worth asking.
-- Coren / Marc
On 08/01/2016 11:37 AM, Marc-Andre wrote:
... Is there something we can do to make the passage of years hurt less? Should we be laying groundwork now to prevent issues decades away?
One possibility is considering storing rendered HTML for old revisions. It lets wikitext (and hence parser) evolve without breaking old revisions. Plus rendered HTML will use the template revision at the time it was rendered vs. the latest revision (this is the problem Memento tries to solve).
HTML storage comes with its own can of worms, but it seems like a solution worth thinking about in some form.
1. storage costs (fully rendered HTML would be 5-10 times bigger than wikitext for that same page, and much larger if stored as wikitext diffs) 2. evolution of HTML spec and its affect on old content (this affects the entire web, so, whatever solution works there will work for us as well) 3. newly discovered security holes and retroactively fixing them in stored html and released dumps (not sure). ... and maybe others.
Subbu.
On Mon, Aug 1, 2016 at 9:51 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 08/01/2016 11:37 AM, Marc-Andre wrote:
Is there something we can do to make the passage of years hurt less? Should we be laying groundwork now to prevent issues decades away?
One possibility is considering storing rendered HTML for old revisions. It lets wikitext (and hence parser) evolve without breaking old revisions. Plus rendered HTML will use the template revision at the time it was rendered vs. the latest revision (this is the problem Memento tries to solve).
This is a seductive path to choose. Maintaining backwards compatibility for poorly conceived (in retrospect) engineering decisions is really hard work. A lot of the cruft and awfulness of enterprise-focused software comes from dealing with the seemingly endless torrent of edge cases which are often backwards-compatibility issues in the systems/formats/databases/protocols that the software depends on. The [Y2K problem][1] was a global lesson in the importance of intelligently paying down technical debt.
You outline the problems with this approach in the remainder of your email....
HTML storage comes with its own can of worms, but it seems like a solution worth thinking about in some form.
- storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs) 2. evolution of HTML spec and its affect on old content (this affects the entire web, so, whatever solution works there will work for us as well) 3. newly discovered security holes and retroactively fixing them in stored html and released dumps (not sure). ... and maybe others.
I think these are all reasons why I chose the word "seductive" as opposed to more unambiguous praise :-) Beyond these reasons, the bigger issue is that it's an invitation to be sloppy about our formats. We should endeavor to make our wikitext to html conversion more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel Kinzler taught me). Holding a large data store of snapshots seems like a crutch to avoid the hard work of specifying how this conversion ought to work. Let's actually nail down the spec for this[2][3] rather than kidding ourselves into believing we can just store enough HTML snapshots to make the problem moot.
Rob
[1]: https://en.wikipedia.org/wiki/Year_2000_problem [2]: https://www.mediawiki.org/wiki/Markup_spec [3]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
On Mon, Aug 1, 2016 at 11:47 AM, Rob Lanphier robla@wikimedia.org wrote:
HTML storage comes with its own can of worms, but it seems like a
solution
worth thinking about in some form.
- storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs) 2. evolution of HTML spec and its affect on old content (this affects the entire web, so, whatever solution works there will work for us as well) 3. newly discovered security holes and retroactively fixing them in
stored
html and released dumps (not sure). ... and maybe others.
I think these are all reasons why I chose the word "seductive" as opposed to more unambiguous praise :-) Beyond these reasons, the bigger issue is that it's an invitation to be sloppy about our formats. We should endeavor to make our wikitext to html conversion more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel Kinzler taught me). Holding a large data store of snapshots seems like a crutch to avoid the hard work of specifying how this conversion ought to work. Let's actually nail down the spec for this[2][3] rather than kidding ourselves into believing we can just store enough HTML snapshots to make the problem moot.
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of project (ie. wouldn't expect it to happen in this decade), and even then it would not fully solve the problem - e.g. very old versions relied on the default CSS of a different MediaWiki skin; you need site scripts for some things such as infobox show/hide functionality to work, but the standard library those scripts rely on has changed; same for Scribunto scripts.
HTML storage is actually not that bad - browsers are very good at backwards compatibility with older HTML spec and there is very little security footprint in serving static HTML from a separate domain. Storage is problem, but there is no need to store every page revision - monthly or yearly snapshots would be fine IMO. (cf. T17017 - again, Kiwix seems to do this already, so maybe it's just a matter of coordination.) The only other practical problem I can think of is that it would preserve deleted/oversighted information - that problem already exists with the dumps, but those are not kept for very long (on WMF servers at least).
On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of project (ie. wouldn't expect it to happen in this decade), and even then it would not fully solve the problem[...]
You seem to be suggesting that 1. Specifying wikitext-html conversion is really hard 2. It's not a silver bullet (i.e. it doesn't "fully solve the problem") 3. HTML storage looks more like a silver bullet, and is cheaper 4. Therefore, a specification is not really worth doing, or if it is, it's really low priority
Is that an accurate way of paraphrasing your email?
Rob
On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier robla@wikimedia.org wrote:
On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of project (ie. wouldn't expect it to happen in this decade), and even then
it
would not fully solve the problem[...]
You seem to be suggesting that
- Specifying wikitext-html conversion is really hard
- It's not a silver bullet (i.e. it doesn't "fully solve the problem")
- HTML storage looks more like a silver bullet, and is cheaper
- Therefore, a specification is not really worth doing, or if it is,
it's really low priority
Is that an accurate way of paraphrasing your email?
Yes. The main problem with specifying wikitext-to-html is that extensions get to extend it in arbitrary ways; e.g. the specification for Scribunto would have to include the whole Lua compiler semantics.
On Mon, Aug 1, 2016 at 1:56 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier robla@wikimedia.org wrote:
On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza gtisza@wikimedia.org wrote:
Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of project (ie. wouldn't expect it to happen in this decade), and even then
it
would not fully solve the problem[...]
You seem to be suggesting that
- Specifying wikitext-html conversion is really hard
- It's not a silver bullet (i.e. it doesn't "fully solve the problem")
- HTML storage looks more like a silver bullet, and is cheaper
- Therefore, a specification is not really worth doing, or if it is,
it's really low priority
Is that an accurate way of paraphrasing your email?
Yes. The main problem with specifying wikitext-to-html is that extensions get to extend it in arbitrary ways; e.g. the specification for Scribunto would have to include the whole Lua compiler semantics.
Do you believe that declaring "the implementation is the spec" is a sustainable way of encouraging contribution to our projects?
Rob
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier robla@wikimedia.org wrote:
Do you believe that declaring "the implementation is the spec" is a sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions, Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is practically impossible. What we do or do not declare won't change that.
There are many other, more realistic ways to encourage contribution by users who are interested in wikis, but not in Wikimedia projects. (Supporting Markdown would certainly be one of them.) But historically the WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no other actor has been both willing and able to step up in its place.
On Tue, Aug 2, 2016 at 8:34 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier robla@wikimedia.org wrote:
Do you believe that declaring "the implementation is the spec" is a sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions, Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is practically impossible. What we do or do not declare won't change that.
Correct, re-implementing the MediaWiki parser is a mission from hell. And yet, WMF is doing that with parsoid ... ;-) And, WMF will no doubt do it again in the future. Changing infrastructure is normal for systems that last many generations.
But the real problem of not using a versioned spec is that nobody can reliably do anything, at all, with the content.
Even basic tokenizing of wikitext has many undocumented gotchas, and even with the correct voodoo today there is no guarantee that WMF engineers wont break it tomorrow, and not inform everyone that the spec has changed.
There are many other, more realistic ways to encourage contribution by users who are interested in wikis, but not in Wikimedia projects. (Supporting Markdown would certainly be one of them.) But historically the WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no other actor has been both willing and able to step up in its place.
The main reason for a spec should be the sanity of the Wikimedia technical user base, including WMF engineers paid by donors, who build parsers in other languages for various reasons, including supporting tools that account for a very large percent of the total edits to Wikimedia and are critical in preventing abuse and assisting admins performing critical tasks to keep the sites from falling apart.
-- John Vandenberg
TL:DR; You get to a spec by paying down technical debt that untangles wikitext parsing from being intricately tied to the internals of mediawiki implementation and state.
In discussions, there is far too much focus on the fact that you cannot write a BNF grammar or yacc / lex / bison / whatever or that quote parsing is context-sensitive. I don't think it is as much of a big deal. For example, you could use Markdown for parsing but that doesn't change much of the picture outlined below ... I think all of that is less of an issue compared to the following:
Right now, mediawiki HTML output depends on the following: * input wikitext * wiki config (including installed extensions) * installed templates * media resources (images, audio, video) * PHP parser hooks that expose parsing internals and implementation details (not replicable in other parsers) * wiki messages (ex: cite output) * state of the corpus and other db state (ex: red links, bad images) * user state (prefs, etc.) * Tidy
So, one reason for the complexity in implementing a wikitext parser is because the output HTML is not simply a straightforward transformation of input wikitext (and some config). There is far too much other state that gets in the way.
The second reason for complexity is because markup errors aren't bounded to narrow contexts, but, can leak out and impact output of the entire page. Some user pages seem to exploit this as a feature even (unclosed div tags).
The third source of complexity is because some parser hooks expose internals of the implementation (Before/After Strip/Tidy and other such hooks). An implementation without tidy or that handles wikitext different might not have the same pipeline.
However, we can still get to a spec that is much more replicable if we start cleaning up some of this incrementally and paying down technical debt. Here are some things going on right now towards that.
* We are close to getting rid of Tidy which removes it from the equation. * There are RFCs that propose defining DOM scopes and propose that output of templates (and extensions) be a DOM (vs a string), with some caveats (that I will ignore for here). If we can get to implementing these, we immediately isolate the parsing of a top-level page from the details of how extensions and transclusions are processed. * RFCs that propose that things like red links, bad images, user state, site messages not be an input into the core wikitext parse. From a spec-point of view, they should be viewed as post-processing transformations. However, for efficiency reasons, an implementation might choose to integrate that as part of the parse, but that is not a requirement.
Separately, here is one other thing we can consider: * Deprecate and replace tag hooks that expose parser internals.
When all of these are done, it become far more feasible to think of defining a spec for wikitext parsing that is not tied to the internals of mediawiki or its extensions. At that point, you could implement templating via Lua or via JS or via Ruby ... the specifics are immaterial. What matters is those templating implementations and extensions produce output with certain properties. You can then specify that mediawiki-HTML is a series of transformations that are applied to the output of the wikitext parser ... and where there can be multiple spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for wikitext and should work towards that is a different question.
Subbu.
On 08/01/2016 08:34 PM, Gergo Tisza wrote:
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier robla@wikimedia.org wrote:
Do you believe that declaring "the implementation is the spec" is a sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions, Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is practically impossible. What we do or do not declare won't change that.
There are many other, more realistic ways to encourage contribution by users who are interested in wikis, but not in Wikimedia projects. (Supporting Markdown would certainly be one of them.) But historically the WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no other actor has been both willing and able to step up in its place. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Subramanya Sastry wrote:
Some user pages seem to exploit this as a feature even (unclosed div tags).
Not just some and not just seem. :-) Thank you for this detailed e-mail.
John Mark Vandenberg wrote:
The main reason for a spec should be the sanity of the Wikimedia technical user base, including WMF engineers paid by donors, who build parsers in other languages for various reasons, including supporting tools that account for a very large percent of the total edits to Wikimedia and are critical in preventing abuse and assisting admins performing critical tasks to keep the sites from falling apart.
In fairness, there's a Parsing team at the Wikimedia Foundation. We all recognize that there's a problem and I think we're making decent progress toward better defined, though not necessarily saner, behavior.
MZMcBride
On Mon, Aug 1, 2016 at 10:15 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
When [a detailed list of stuff is] done, it become far more feasible to think of defining a spec for wikitext parsing that is not tied to the internals of mediawiki or its extensions. At that point, you could implement templating via Lua or via JS or via Ruby ... the specifics are immaterial. What matters is those templating implementations and extensions produce output with certain properties. You can then specify that mediawiki-HTML is a series of transformations that are applied to the output of the wikitext parser ... and where there can be multiple spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for wikitext and should work towards that is a different question.
In our planning meeting (E250), we discussed this issue as a possibility for next week's ArchCom office hour (E259). We don't (yet) have a specific RFC we can point to, but this seems ripe for a discussion to answer whether we should work toward a spec. Thoughts?
Rob
On 08/03/2016 07:17 PM, Rob Lanphier wrote:
On Mon, Aug 1, 2016 at 10:15 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
When [a detailed list of stuff is] done, it become far more feasible to think of defining a spec for wikitext parsing that is not tied to the internals of mediawiki or its extensions. At that point, you could implement templating via Lua or via JS or via Ruby ... the specifics are immaterial. What matters is those templating implementations and extensions produce output with certain properties. You can then specify that mediawiki-HTML is a series of transformations that are applied to the output of the wikitext parser ... and where there can be multiple spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for wikitext and should work towards that is a different question.
In our planning meeting (E250), we discussed this issue as a possibility for next week's ArchCom office hour (E259). We don't (yet) have a specific RFC we can point to, but this seems ripe for a discussion to answer whether we should work toward a spec. Thoughts?
Works for me.
I can take the email I posted, clean it up a bit, and also pull additional thoughts from https://www.mediawiki.org/wiki/User:SSastry_(WMF)/Notes/Wikitext and elsewhere that are relevant. The idea I have is to provide a very high level view of what one possible spec might look like, and what that might enable.
Or, should I pull together something else that might be useful to guide the discussion?
Subbu.
On Wed, Aug 3, 2016 at 8:48 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 08/03/2016 07:17 PM, Rob Lanphier wrote:
In our planning meeting (E250), we discussed this issue as a possibility for next week's ArchCom office hour (E259). We don't (yet) have a specific RFC we can point to, but this seems ripe for a discussion to answer whether we should work toward a spec. Thoughts?
Works for me.
Excellent!
I can take the email I posted, clean it up a bit, and also pull additional thoughts from https://www.mediawiki.org/wiki/User:SSastry_(WMF)/Notes/Wikitext and elsewhere that are relevant. The idea I have is to provide a very high level view of what one possible spec might look like, and what that might enable.
Or, should I pull together something else that might be useful to guide the discussion?
I suspect User:SSastry_(WMF)/Notes/Wikitext is a really good explanation that I suspect will be a good explanation for people who have deep understanding of parsers and our parsing infrastructure. I say "suspect" because I'm operating from a point of someone whose knowledge of our system is wide and shallow.[1] I fear that we're coming from a wide enough set of perspectives about wikitext that we're doomed to talk past each other, despite ample preparation.
Perhaps a good place to start for our 2016-08-10 conversation is with this page: https://www.mediawiki.org/wiki/Markup_spec
As of this writing, the last really substantive addition to that page was in 2010. Maybe we can have a discussion about what should reside at that URL, and where the content currently on that page should go.
Of course, it would be a lot more fun to have a celebration about how awesome that page had become in the past week after I sent this email. That probably won't be because I made the first (or any) edits to the page ;-)
Rob
[1] Like the Platte River, which traveling pioneers described as "a mile wide and an inch deep" and "the most magnificent and useless of rivers": https://en.wikipedia.org/wiki/Missouri_River#cite_ref-188
On 08/03/2016 10:48 PM, Subramanya Sastry wrote:
On 08/03/2016 07:17 PM, Rob Lanphier wrote:
On Mon, Aug 1, 2016 at 10:15 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
... I think it is feasible to get there. But, whether we want a spec for wikitext and should work towards that is a different question.
In our planning meeting (E250), we discussed this issue as a possibility for next week's ArchCom office hour (E259). We don't (yet) have a specific RFC we can point to, but this seems ripe for a discussion to answer whether we should work toward a spec. Thoughts?
Works for me.
I can take the email I posted, clean it up a bit, and also pull additional thoughts from https://www.mediawiki.org/wiki/User:SSastry_(WMF)/Notes/Wikitext and elsewhere that are relevant. The idea I have is to provide a very high level view of what one possible spec might look like, and what that might enable.
Or, should I pull together something else that might be useful to guide the discussion?
I have started working on https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext to guide that discussion. It is definitely a rough and incomplete work in progress, but with enough initial material for sharing that link here ... mostly so that those of you interested in that discussion have some time to read at least some of that material.
Subbu.
"Should we be laying groundwork now to prevent issues decades away?" I'll answer that with "Yes". I could provide some interesting stories about technological and budgetary headaches that result from repeatedly delaying efforts to make legacy software be forwards-compatible. The technical details of the tools mentioned here are beyond me, but I saw what happened in another org that was dealing with legacy software and it wasn't pretty.
Pine
On Mon, Aug 1, 2016 at 9:37 AM, Marc-Andre marc@uberbox.org wrote:
On 2016-08-01 12:21 PM, Gergo Tisza wrote:
the parser has changed over time and old templates
might not work anymo
Aaah. Good point. Also, the changes in extensions (or, indeed, what extensions are installed at all) might break attempts to parse the past, as it were.
You know, this is actually quite troublesome: as the platform evolves the older data becomes increasingly hard to use at all - making it effectively lost even if we kept the bits around. This is a rather widespread issue in computing as a rule; but I now find myself distressed at its unavoidable effect on what we've always intended to be a permanent contribution to humanity.
We need to find a long-term view to a solution. I don't mean just keeping old versions of the software around - that would be of limited help. It's be an interesting nightmare to try to run early versions of phase3 nowadays, and probably require managing to make a very very old distro work and finding the right versions of an ancient apache and PHP. Even *building* those might end up being a challenge... when is the last time you saw a working egcs install? I shudder how nigh-impossible the task might be 100 years from now.
Is there something we can do to make the passage of years hurt less? Should we be laying groundwork now to prevent issues decades away?
At the very least, I think those questions are worth asking.
-- Coren / Marc
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 1 August 2016 at 17:37, Marc-Andre marc@uberbox.org wrote:
We need to find a long-term view to a solution. I don't mean just keeping old versions of the software around - that would be of limited help. It's be an interesting nightmare to try to run early versions of phase3 nowadays, and probably require managing to make a very very old distro work and finding the right versions of an ancient apache and PHP. Even *building* those might end up being a challenge... when is the last time you saw a working egcs install? I shudder how nigh-impossible the task might be 100 years from now.
oh god yes. I'm having this now, trying to revive an old Slash installation. I'm not sure I could even reconstruct a box to run it without compiling half of CPAN circa 2002 from source.
Suggestion: set up a copy of WMF's setup on a VM (or two or three), save that VM and bundle it off to the Internet Archive as a dated archive resource. Do this regularly.
- d.
One possibility is considering storing rendered HTML for old revisions. It lets wikitext (and hence parser) evolve without breaking old revisions.
Plus
rendered HTML will use the template revision at the time it was rendered
vs.
the latest revision (this is the problem Memento tries to solve).
Long term HTML archival is a something we have been gradually working towards with RESTBase.
Since HTML is about 10x larger than wikitext, a major concern is storage cost. Old estimates https://phabricator.wikimedia.org/T97710 put the total storage needed to store one HTML copy of each revision at roughly 120T. To reduce this cost, we have since implemented several improvements https://phabricator.wikimedia.org/T93751:
- Brotli compression https://en.wikipedia.org/wiki/Brotli, once deployed, is expected to reduce the total storage needs to about 1/4-1/5x over gzip https://phabricator.wikimedia.org/T122028#2004953. - The ability to split latest revisions from old revision lets us use cheaper and slower storage for old revisions. - Retention policies let us specify how many renders per revision we want to archive. We currently only archive one (the latest) render per revision, but have the option to store one render per $time_unit. This is especially important for pages like [[Main Page]], which are rarely edited, but constantly change their content in meaningful ways via templates. It is currently not possible to reliably cite such pages, without resorting to external services like archive.org.
Another important requirement for making HTML a useful long-term archival medium is to establish a clear standard for HTML structures used. The versioned Parsoid HTML spec https://www.mediawiki.org/wiki/Specs/HTML/1.2.1, along with format migration logic for old content, are designed to make the stored HTML as future-proof as possible.
While we currently only have space for a few months worth of HTML revisions, we do expect the changes above to make it possible to push this to years in the foreseeable future without unreasonable hardware needs. This means that we can start building up an archive of our content in a format that is not tied to the software.
Faithfully re-rendering old revisions is harder in retrospect. We will likely have to make some trade-offs between fidelity & effort.
Gabriel
On Mon, Aug 1, 2016 at 2:01 PM, David Gerard dgerard@gmail.com wrote:
On 1 August 2016 at 17:37, Marc-Andre marc@uberbox.org wrote:
We need to find a long-term view to a solution. I don't mean just
keeping
old versions of the software around - that would be of limited help.
It's
be an interesting nightmare to try to run early versions of phase3
nowadays,
and probably require managing to make a very very old distro work and finding the right versions of an ancient apache and PHP. Even *building* those might end up being a challenge... when is the last time you saw a working egcs install? I shudder how nigh-impossible the task might be 100 years from now.
oh god yes. I'm having this now, trying to revive an old Slash installation. I'm not sure I could even reconstruct a box to run it without compiling half of CPAN circa 2002 from source.
Suggestion: set up a copy of WMF's setup on a VM (or two or three), save that VM and bundle it off to the Internet Archive as a dated archive resource. Do this regularly.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
There is a slow moving discussion about this at https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Markdown
The bigger risk is that the rest of the world settles on using CommonMark Markdown once it is properly specified. That will mean in the short term that MediaWiki will need to support Markdown, and eventually it would need to adopt Markdown as the primary text format, and ultimately we would loose our own ability to render old revisions, because the parser would bit rot.
One practical way to add more discipline around this problem is to introduce a "mediawiki-wikitext-announce" list, similar to the mediawiki-api-announce list, and require that *every* breaking change to the wikitext parser is announced there.
wikitext is file format, and there are alternative parsers, which need to be updated any time the Php parser changes.
https://www.mediawiki.org/wiki/Alternative_parsers
It should be managed just like the MediaWiki API, with appropriate notices sent out, so that other tools can be kept up to date, and so there is an accurate record of when breaking changes occurred.
-- John Vandenberg
Hi,
On 07/31/2016 07:53 AM, Takashi OTA wrote:
Such links are stored in externallinks.sql.gz, in an expanded form.
When you want to check increase/decrease of linked domains in chronological order through edit history, you have to check pages-meta-history1.xml etc. In a such case, traditional links and links by templates are mixed, Therefore, the latter ones (links by templates) should be expanded to traditional link forms.
If you have the revision ID, you can make an API query like: https://en.wikipedia.org/w/api.php?action=parse&oldid=387276926&prop=externallinks.
This will expand all templates and give you the same set of externallinks that would have ended up in the dump.
-- Legoktm
wikitech-l@lists.wikimedia.org