How to read this post? ---------------------- * For those without time to read lengthy technical emails, read the TL;DR section. * For those who don't care about all the details but want to help with this project, you can read sections 1 and 2 about Tidy, and then skip to section 7. * For those who like all their details, read the post in its entirety, and follow the links.
Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you find a bug, please report it *on Phabricator or on the page mentioned above*.
TL;DR ----- The Parsing team wants to replace Tidy with a RemexHTML-based solution on the Wikimedia cluster by June 2018. This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML. Please see 'What editors will need to do' section on the Tidy replacement FAQ [1].
1. What is Tidy? ---------------- Tidy [2] is a library currently used by MediaWiki to fix some HTML errors found in wiki pages.
Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Ex: unclosed HTML tags, such as a <small> without a </small>, are common). In some cases, MediaWiki can generate erroneous HTML by itself. If we didn't fix these before sending it to browsers, some would display things in a broken way to readers.
But Tidy also does other "cleanup" on its own that is not required for correctness. Ex: it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering.
2. Why replace it? ------------------ Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it also makes some incorrect changes to HTML to 'fix' things that used to not work; for example, Tidy will unexpectedly move a bullet list out of a table caption even though that's allowed. HTML4 Tidy is no longer maintained or packaged. There have also been a number of bug reports filed against Tidy [3]. Since Parsoid is based on HTML5 semantics, there are differences in rendering between Parsoid's rendering of a page and current read view that is based on Tidy.
3. Project status ----------------- Given all these considerations, the Parsing team started work to replace Tidy [4] around mid-2015. Tim Starling started this work and after a survey of existing options, decided to write a wrapper over a Java-based HTML5 parser. At the time we started the project, we thought we could probably have Tidy replaced by mid-2016. Alas!
4. What is replacing Tidy? -------------------------- Tidy will be replaced by a RemexHTML-based solution that uses the RemexHTML[5] library along with some Tidy-compatibility shims to ensure better parity with the current rendering. RemexHTML is a PHP library that Tim wrote with C.Scott’s input that implements the HTML5 parsing spec.
5. Testing and followup ----------------------- We knew that some pages will be affected and need fixing due to this change. In order to more precisely identify what that would be, we wanted to do some thorough testing. So, we built some new tools [6][7] and overhauled and upgraded other test infrastructure [8][9] to let us evaluate the impacts of replacing Tidy (among other such things in the future) which can be a subject of a post all on its own.
You can find the details of our testing on the wiki [1][10], but we found that a large number of pages had rendering differences. We analyzed the results and categorized the source of differences. Based on that, to ease the process of replacement, we added a bunch of compatibility shims to mimic what Tidy does. I am skipping the details in this post. Even after that, newer testing showed that this nevertheless still leaves us with a few patterns that need fixing that we cannot / don't want to work around automatically.
6. Tools to assist editors: Linter & ParserMigration ---------------------------------------------------- In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm (WMF)]]) dusted off the stalled wikitext linting project [11] and (with the help from a bunch of people on the Parsoid, db/security/code review areas) built the Linter extension that surfaces wikitext errors that Parsoid knows about to let editors fix them.
Earlier this year, we decided to use Linter in service of Tidy replacement. Based on our earlier testing results, we have added a set of high-priority linter categories that identifies specific wikitext markup patterns on wiki pages that need to be fixed [12].
Separately, Tim built the ParserMigration extension to let editors evaluate their fixes to pages [13]. You can enable this in your editing preferences or replace '&action=edit' in your url bar with '&action=parsermigration-edit'.
7. What editors have to do -------------------------- The part that you have all been waiting for!
Please see 'What editors will need to do' section on the Tidy replacement FAQ [1]. We have added simplified instructions, so that even community members who do not consider themselves "techies" can still learn about ways to fix pages. We'll keep that section up to date based on feedback and questions. But since it is a wiki, please also edit and tweak as required to make the text useful for yourselves! This is a first call for fixes and it is about the problems defined as "high priority". We'll issue other calls in the future for any other necessary Tidy fixups.
Caveats:
* As noted on that page, the linter categories don't cover all the possible sources of rendering differences. For example, there is still T157418 [14] left to address. For those who have an opinion about this, please chime in on that task. We are still evaluating the best solution for this without adding more cruft to wikitext behavior or kicking the cleanup can down the road.
* As the issues in the identified linter categories are fixed, we might be better able to isolate other issues that need addressing.
8. So, when will Tidy actually be replaced? ------------------------------------------- We really would like to get Tidy removed from the cluster latest by June 2018 (or sooner if possible), and your assistance and prompt attention to these markup issues would be very helpful. We will do this in a phased manner on different wikis rather than all at once on all wikis.
We really want to do this as smoothly as possible without disrupting the work of editors or affecting the rendering of the large corpus of pages on the various wikis. As you might have gathered from the text above, we have built and leveraged a wide variety of tools to assist with this.
9. Monitoring progress ---------------------- In order to monitor progress, we plan to do a weekly (or some such periodic frequency) test run that compares the rendering of pages with Tidy and with RemexHTML on a large sample of pages (in the 50K range) from a large subset of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups are going, and when we might be able to flip the switch on different wikis.
Subramanya (Subbu) Sastry Parsing Team.
References ---------- 0. https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ 1. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#What_will_editors_... 2. https://en.wikipedia.org/wiki/HTML_Tidy 3. https://phabricator.wikimedia.org/tag/tidy/ 4. https://phabricator.wikimedia.org/T89331 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml 6. https://phabricator.wikimedia.org/T120345 7. https://github.com/wikimedia/integration-uprightdiff 8. https://github.com/wikimedia/integration-visualdiff 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy 11. https://phabricator.wikimedia.org/T48705 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_Replacing_Tidy 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifying_fixes_for_the...
Thanks for the information.
I understand that moving from HTML 4 to HTML 5 is probably a good idea.
However, I am concerned about this statement: "This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML".
As you probably know, the supply of content contributors' time is far too low to meet the demands of keeping up with everything that ideally would be done on the content projects.
I am thinking that instead of asking content contributors to spend lots of hours (do we know how many? Hundreds? Thousands?) fixing all of these issues, it would make more sense to develop bots to address them.
Here are a few questions: 1. How many fixes do you think will be needed, for the highest priority fixes as well as all fixes?
2. How many hours of volunteer time do you think that these fixes will require, for the highest priority fixes as well as all fixes?
3. How feasible would it be to build bots to make 90% of high priority fixes and 90% of all fixes?
I'm not trying to obstruct technical progress, but I am generally not a fan of WMF adding to volunteers' workloads. If the number of changes involved are small and the number of hours to make them is small, that is less of a concern than if we are talking about thousands of changes and hundreds or thousands of volunteer hours.
Thanks,
Pine
On Thu, Jul 6, 2017 at 5:02 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
How to read this post?
- For those without time to read lengthy technical emails, read the TL;DR section.
- For those who don't care about all the details but want to help with this project, you can read sections 1 and 2 about Tidy, and then skip to section 7.
- For those who like all their details, read the post in its entirety, and follow the links.
Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you find a bug, please report it *on Phabricator or on the page mentioned above*.
TL;DR
The Parsing team wants to replace Tidy with a RemexHTML-based solution on the Wikimedia cluster by June 2018. This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML. Please see 'What editors will need to do' section on the Tidy replacement FAQ [1].
- What is Tidy?
Tidy [2] is a library currently used by MediaWiki to fix some HTML errors found in wiki pages.
Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Ex: unclosed HTML tags, such as a
<small> without a </small>, are common). In some cases, MediaWiki can generate erroneous HTML by itself. If we didn't fix these before sending it to browsers, some would display things in a broken way to readers.
But Tidy also does other "cleanup" on its own that is not required for correctness. Ex: it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering.
- Why replace it?
Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it also makes some incorrect changes to HTML to 'fix' things that used to not work; for example, Tidy will unexpectedly move a bullet list out of a table caption even though that's allowed. HTML4 Tidy is no longer maintained or packaged. There have also been a number of bug reports filed against Tidy [3]. Since Parsoid is based on HTML5 semantics, there are differences in rendering between Parsoid's rendering of a page and current read view that is based on Tidy.
- Project status
Given all these considerations, the Parsing team started work to replace Tidy [4] around mid-2015. Tim Starling started this work and after a survey of existing options, decided to write a wrapper over a Java-based HTML5 parser. At the time we started the project, we thought we could probably have Tidy replaced by mid-2016. Alas!
- What is replacing Tidy?
Tidy will be replaced by a RemexHTML-based solution that uses the RemexHTML[5] library along with some Tidy-compatibility shims to ensure better parity with the current rendering. RemexHTML is a PHP library that Tim wrote with C.Scott’s input that implements the HTML5 parsing spec.
- Testing and followup
We knew that some pages will be affected and need fixing due to this change. In order to more precisely identify what that would be, we wanted to do some thorough testing. So, we built some new tools [6][7] and overhauled and upgraded other test infrastructure [8][9] to let us evaluate the impacts of replacing Tidy (among other such things in the future) which can be a subject of a post all on its own.
You can find the details of our testing on the wiki [1][10], but we found that a large number of pages had rendering differences. We analyzed the results and categorized the source of differences. Based on that, to ease the process of replacement, we added a bunch of compatibility shims to mimic what Tidy does. I am skipping the details in this post. Even after that, newer testing showed that this nevertheless still leaves us with a few patterns that need fixing that we cannot / don't want to work around automatically.
- Tools to assist editors: Linter & ParserMigration
In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm (WMF)]]) dusted off the stalled wikitext linting project [11] and (with the help from a bunch of people on the Parsoid, db/security/code review areas) built the Linter extension that surfaces wikitext errors that Parsoid knows about to let editors fix them.
Earlier this year, we decided to use Linter in service of Tidy replacement. Based on our earlier testing results, we have added a set of high-priority linter categories that identifies specific wikitext markup patterns on wiki pages that need to be fixed [12].
Separately, Tim built the ParserMigration extension to let editors evaluate their fixes to pages [13]. You can enable this in your editing preferences or replace '&action=edit' in your url bar with '&action=parsermigration-edit' .
- What editors have to do
The part that you have all been waiting for!
Please see 'What editors will need to do' section on the Tidy replacement FAQ [1]. We have added simplified instructions, so that even community members who do not consider themselves "techies" can still learn about ways to fix pages. We'll keep that section up to date based on feedback and questions. But since it is a wiki, please also edit and tweak as required to make the text useful for yourselves! This is a first call for fixes and it is about the problems defined as "high priority". We'll issue other calls in the future for any other necessary Tidy fixups.
Caveats:
- As noted on that page, the linter categories don't cover all the possible sources of rendering differences. For example, there is still T157418
[14] left to address. For those who have an opinion about this, please chime in on that task. We are still evaluating the best solution for this without adding more cruft to wikitext behavior or kicking the cleanup can down the road.
- As the issues in the identified linter categories are fixed, we might be better able to isolate other issues that need addressing.
- So, when will Tidy actually be replaced?
We really would like to get Tidy removed from the cluster latest by June 2018 (or sooner if possible), and your assistance and prompt attention to these markup issues would be very helpful. We will do this in a phased manner on different wikis rather than all at once on all wikis.
We really want to do this as smoothly as possible without disrupting the work of editors or affecting the rendering of the large corpus of pages on the various wikis. As you might have gathered from the text above, we have built and leveraged a wide variety of tools to assist with this.
- Monitoring progress
In order to monitor progress, we plan to do a weekly (or some such periodic frequency) test run that compares the rendering of pages with Tidy and with RemexHTML on a large sample of pages (in the 50K range) from a large subset of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups are going, and when we might be able to flip the switch on different wikis.
Subramanya (Subbu) Sastry Parsing Team.
References
- https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ
- https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#
What_will_editors_need_to_do.3F 2. https://en.wikipedia.org/wiki/HTML_Tidy 3. https://phabricator.wikimedia.org/tag/tidy/ 4. https://phabricator.wikimedia.org/T89331 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml 6. https://phabricator.wikimedia.org/T120345 7. https://github.com/wikimedia/integration-uprightdiff 8. https://github.com/wikimedia/integration-visualdiff 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy 11. https://phabricator.wikimedia.org/T48705 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_ Replacing_Tidy 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifyi ng_fixes_for_these_lint_categories 14. https://phabricator.wikimedia.org/T157418
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 07/06/2017 09:59 AM, Pine W wrote:
Thanks for the information.
I understand that moving from HTML 4 to HTML 5 is probably a good idea.
It is a good and necessary step. We want MediaWiki (and wikipedia) output to keep up with web standards.
However, I am concerned about this statement: "This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML".
Understandably. Let me try to address your questions below.
But, before that, I want to point you back to my email where I mentioned that before we arrived at the current proposal, we did a bunch of work to minimize work for editors. (1) we added Tidy compatibility shims where we could automatically preserve Tidy behavior (however much we might have liked to pull those bandaids off) (2) we built a bunch of tools and infrastructure to precisely identify what pages and what specific pieces of markup on those pages need fixing (3) build a tool to let editors compare their fixes before/after so they can be sure that they are doing the right fixes and the fixes do the right thing.
I am re-emphasizing this to indicate that we arrived at this proposal after sufficient prior work to respect editors' time and efforts and to support them in what we are requesting them to do. We also re-aligned timelines to reflect the complexity of the task that we uncovered.
As you probably know, the supply of content contributors' time is far too low to meet the demands of keeping up with everything that ideally would be done on the content projects.
Do note that we started this process last year as somewhat of a trial balloon and we found that editors on wikis were very willing and helpful with the process. See https://phabricator.wikimedia.org/T134423. More recently, earlier this year, we made some fixes to the preprocessor to fix some edge cases in language variant handling. Once again, this required fixes to markup on pages and editors on a bunch of wikis were more than willing and quite helpful in making these changes. See https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixu.... We are very happy with this collaboration and hope we can continue with this.
So, while I understand your concern, I am optimistic that we can work on this collaboratively and make the fixes necessary to address technical debt that has accumulated over the years in our wikis (and hence the MediaWiki codebase) while enabling the upgrade of our output to modern web standards.
I am thinking that instead of asking content contributors to spend lots of hours (do we know how many? Hundreds? Thousands?) fixing all of these issues, it would make more sense to develop bots to address them.
I cannot quantify this number (about number of hours). But, with some effort, we could perhaps come with rough back of the envelope numbers.
Here are a few questions:
- How many fixes do you think will be needed, for the highest priority
fixes as well as all fixes?
But, on the large wikipedias, thousands of fixup instances (not pages) are present ([1], [2], [3]) in the high priority categories (which are the only ones required for replacing Tidy). In reality, fixing a few templates will bring down these numbers greatly. For example, for one of the linter categories (workaround for a paragraph wrapping bug), fixing the nowrap / nowrap-begin template will likely address most problematic instances found on specific wikis.
[1] https://en.wikipedia.org/wiki/Special:LintErrors [2] https://fr.wikipedia.org/wiki/Special:LintErrors [3] https://es.wikipedia.org/wiki/Special:LintErrors
- How many hours of volunteer time do you think that these fixes will
require, for the highest priority fixes as well as all fixes?
I do not know offhand, but I expect each (non-template) fix to take no more than a couple of minutes. So, let us go with that and do some rough back of the envelope numbers. Assuming we have say 120,000 fixes total required across all wikis, 50% of them coming from non-templates, we have 60,000*2 minutes = 2000 person hours. Templates are going to take more time but they are likely to far fewer of them. Say, 1000 * 15 minutes = 250 hours?
Take that 2-minute calculation for what it is worth, but I think the collective effort we are asking is not entirely unreasonable over a period of many months.
- How feasible would it be to build bots to make 90% of high priority
fixes and 90% of all fixes?
Since the start of the Linter project (when we started off with the GSoC prototype in summer of 2014, and once again when Kunal picked it up in 2016), we have been in conversation with Nico V (frwiki and who maintains WPCleaner) and with Marios Magioladitis and Bryan White (Checkwiki) to integrate the output with their projects / tools. On Nico's request, we have added API endpoints to Linter, Parsoid, and RESTBase so that the tool can programmatically fetch linter issues, and let editors / bots fix them appropriately.
But, I cannot answer right now how feasible it is to build bots to do what you are asking about. I welcome other insights and perspectives from others.
I'm not trying to obstruct technical progress, but I am generally not a fan of WMF adding to volunteers' workloads. If the number of changes involved are small and the number of hours to make them is small, that is less of a concern than if we are talking about thousands of changes and hundreds or thousands of volunteer hours.
I hope my response here allays some, if not all, your concerns.
Subbu.
Hi,
On Thu, Jul 6, 2017 at 5:53 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
- How feasible would it be to build bots to make 90% of high priority
fixes and 90% of all fixes?
Since the start of the Linter project (when we started off with the GSoC prototype in summer of 2014, and once again when Kunal picked it up in 2016), we have been in conversation with Nico V (frwiki and who maintains WPCleaner) and with Marios Magioladitis and Bryan White (Checkwiki) to integrate the output with their projects / tools. On Nico's request, we have added API endpoints to Linter, Parsoid, and RESTBase so that the tool can programmatically fetch linter issues, and let editors / bots fix them appropriately.
I'm happy to announce that I have just released WPCleaner[1] version 1.43 which brings a better integration which the Linter extension. It's a first step, but I hope it can already help in fixing errors reported by Linter.
The features related to the Linter extension are the following:
- On the main WPCleaner window, there's a "Linter categories" button which gives the list of categories that Linter is detecting. Clicking on one of the categories returns the list of pages detected by Linter for this category. From the list of pages, you can go to the full analysis window for the pages that you want to fix. - On the Full Analysis window, the second button with a globe and a broom (Subbu, would you have a recommended icon for Linter related stuff ?) allows to retrieve the list of errors still detected by Linter on the current text: for each error, there's a magnifying glass that brings you to the location of the error in the text. You can then fix errors and check if Linter still finds something in your current version. - On the Check Wiki window, there's a similar button
Subbu, I have a question about the result returned by the API to transform wikitext to lint : in the "dsr" fields, what is the meaning of the 4th value? (the first one is the beginning of the error, the second one is the end of the error, the third one is the length of the error...)
Nico
On 07/06/2017 05:09 PM, Nicolas Vervelle wrote:
Since the start of the Linter project (when we started off with the GSoC prototype in summer of 2014, and once again when Kunal picked it up in 2016), we have been in conversation with Nico V (frwiki and who maintains WPCleaner) and with Marios Magioladitis and Bryan White (Checkwiki) to integrate the output with their projects / tools. On Nico's request, we have added API endpoints to Linter, Parsoid, and RESTBase so that the tool can programmatically fetch linter issues, and let editors / bots fix them appropriately.
I'm happy to announce that I have just released WPCleaner[1] version 1.43 which brings a better integration which the Linter extension. It's a first step, but I hope it can already help in fixing errors reported by Linter.
Good to hear! Thanks for your work on this.
- On the Full Analysis window, the second button with a globe and a broom (Subbu, would you have a recommended icon for Linter related stuff ?)
I will have to get back to you on this. I'll have to get some help from someone who can design / recommend something appropriate here.
Subbu, I have a question about the result returned by the API to transform wikitext to lint : in the "dsr" fields, what is the meaning of the 4th value? (the first one is the beginning of the error, the second one is the end of the error, the third one is the length of the error...)
It is [ start-offset, end-offset, start-tag-width, end-tag-width ].
Note that in some cases, the 3rd and/or 4th values might be null.
So, for -> "x\n\n<div>foo</div>" input wikitext, DSR generated on the div tag will be [3,17,5,6]. -> "x\n\n* foo" input wikitext, DSR generated on the li tag will be [3,8,1,0].
Subbu.
- On the Full Analysis window, the second button with a globe and a broom (Subbu, would you have a recommended icon for Linter
related stuff ?)
I will have to get back to you on this. I'll have to get some help from someone who can design / recommend something appropriate here.
I added a logo to https://www.mediawiki.org/wiki/Extension:Linter
Subbu.
Thanks very much for the detailed comments, Subbu. And thanks to the folks who are working on tools to help automate the necessary changes. It sounds like there's cooperative effort and careful planning that hopefully will make the Tidy to RemexHTML process a smooth one.
Pine
Thanks Pine!
One other related comment that perhaps I should have made earlier and that is relevant based on your broader question around efforts of editors, bots, where to spend time fixing things.
https://www.mediawiki.org/wiki/Help:Extension:Linter#Why_and_what_to_fix tries to clarify how we plan to leverage Linter going forward. This has guidance around what goals are served by what fixes, and also has some language around priorities based on what goals are worth pursuing. This thread is about goal #1 (and indirectly, goal #2) in that list.
Subbu.
On 07/07/2017 07:53 PM, Pine W wrote:
Thanks very much for the detailed comments, Subbu. And thanks to the folks who are working on tools to help automate the necessary changes. It sounds like there's cooperative effort and careful planning that hopefully will make the Tidy to RemexHTML process a smooth one.
Pine _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Jul 8, 2017 at 12:54 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
- On the Full Analysis window, the second button with a globe and a
broom (Subbu, would you have a recommended icon for Linter related
stuff ?)
I will have to get back to you on this. I'll have to get some help from someone who can design / recommend something appropriate here.
I added a logo to https://www.mediawiki.org/wiki/Extension:Linter
Thanks, I've included it in WPCleaner :-)
Nico
On Thu, 6 Jul 2017 at 08:01 Pine W wiki.pine@gmail.com wrote:
I understand that moving from HTML 4 to HTML 5 is probably a good idea.
However, I am concerned about this statement: "This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML".
As you probably know, the supply of content contributors' time is far too low to meet the demands of keeping up with everything that ideally would be done on the content projects.
The interpretation of wikitext changes from time to time, and have done ever since we started inventing it. New features get added, old features get removed, and existing ones get altered. Consequently, lines of wikitext that previously did one thing will then do another, which may or may not be desired.
When these kinds of change happen, there's normally a brief notice in Tech/News with a few weeks' warning, and often a few community members do a quick scan for issues. Sometimes the effects of the change can be fixed with a bot on some wikis, which is hampered by the lack of a cluster-wide bot policy; the one called the "global bot policy" on Meta doesn't allow technical fixes like this, and even if it did, it doesn't apply to all Wikimedia wikis. Some changes aren't automatically fixable, however; they instead require a human editor, ideally from that community, to judge what effect was intended, and how to correct it, rather than a simple substitution.
This set of changes is no different, except that we're being particularly cautious in alerting communities to those changes, taking our time to make sure this goes well, and providing a suite of tools to identify and fix these occurrences (which will be useful for future changes).
[Snip]
Here are a few questions:
- How many fixes do you think will be needed, for the highest priority
fixes as well as all fixes?
In the document linked from the e-mail to which you replying, it stated that communities will need to fix the three "High" priority tasks on https://www.mediawiki.org/wiki/Special:LintErrors and the equivalent for each wiki.
- How many hours of volunteer time do you think that these fixes will
require, for the highest priority fixes as well as all fixes?
It will vary by wiki, especially regarding the point below. For MW.org it took maybe a few hours, spread over a half dozen individuals.
- How feasible would it be to build bots to make 90% of high priority
fixes and 90% of all fixes?
That's a question for each community. In this case, the majority of complex fixes will need to be made to templates, rather than directly in-text, and I'm sure that semi-automated fixes will be appropriate for some communities, but others will feel that they need to be made manually.
J.
On Thu, Jul 6, 2017 at 5:02 AM Subramanya Sastry ssastry@wikimedia.org wrote:
- Tools to assist editors: Linter & ParserMigration
In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm (WMF)]]) dusted off the stalled wikitext linting project [11] and (with the help from a bunch of people on the Parsoid, db/security/code review areas) built the Linter extension that surfaces wikitext errors that Parsoid knows about to let editors fix them.
Earlier this year, we decided to use Linter in service of Tidy replacement. Based on our earlier testing results, we have added a set of high-priority linter categories that identifies specific wikitext markup patterns on wiki pages that need to be fixed [12].
Linter is certainly awesome and kudos to Kunal for getting that done and pushed out. [[Special:LintErrors]] is super useful, I'm wondering if there's a dashboard somewhere that summarizes this across all wikis? If so, I missed it. If not, it should be pretty easy to wire something up to grab info from api.php on all wikis.
I think it'd help for coordinating cross-wiki efforts (bots, tools) as well as seeing which wikis are "done" and could be early candidates for migration.
-Chad
On 07/07/2017 04:05 PM, Chad wrote:
On Thu, Jul 6, 2017 at 5:02 AM Subramanya Sastry ssastry@wikimedia.org wrote:
- Tools to assist editors: Linter & ParserMigration
In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm (WMF)]]) dusted off the stalled wikitext linting project [11] and (with the help from a bunch of people on the Parsoid, db/security/code review areas) built the Linter extension that surfaces wikitext errors that Parsoid knows about to let editors fix them.
Earlier this year, we decided to use Linter in service of Tidy replacement. Based on our earlier testing results, we have added a set of high-priority linter categories that identifies specific wikitext markup patterns on wiki pages that need to be fixed [12].
Linter is certainly awesome and kudos to Kunal for getting that done and pushed out. [[Special:LintErrors]] is super useful, I'm wondering if there's a dashboard somewhere that summarizes this across all wikis? If so, I missed it. If not, it should be pretty easy to wire something up to grab info from api.php on all wikis.
I think it'd help for coordinating cross-wiki efforts (bots, tools) as well as seeing which wikis are "done" and could be early candidates for migration.
Kunal did an early pass on https://tools.wmflabs.org/wikitext-deprecation/ but it needs to be picked up again and worked on. Help welcome. :-)
Subbu.
Hi Subbu !
I have barely started using WPCleaner to fix some errors reported by Linter, and I know I still have work to do on WPCleaner to make it easier for users. But I have a few questions / suggestions regarding Linter for the moment:
- Is is possible to retrieve also the localized names of the Linter categories and priorities: for example, on frwiki, you can see on the Linter page [1] that the high priority is translated into "Priorité haute" and that self-closed-tag has a user friendly name "Balises auto-fermantes". I don't see the localized names in the informations sent by the API for siteinfo. - Where is it possible to change the description displayed in each page dedicated to a category ? For example, the page for self-closed-ags [2] is very short. It would be nice to be able to add a description of what the error is, what problems it can cause and what are the solutions to fix it (or to be able to link to a page explaining all that). - In the page dedicated to a category, there's a column telling if the problem is due to one template (and which one) or by several templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template?
[1] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors [2] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors/self-closed-tag
On Thu, Jul 6, 2017 at 2:02 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
How to read this post?
- For those without time to read lengthy technical emails, read the TL;DR section.
- For those who don't care about all the details but want to help with this project, you can read sections 1 and 2 about Tidy, and then skip to section 7.
- For those who like all their details, read the post in its entirety, and follow the links.
Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you find a bug, please report it *on Phabricator or on the page mentioned above*.
TL;DR
The Parsing team wants to replace Tidy with a RemexHTML-based solution on the Wikimedia cluster by June 2018. This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML. Please see 'What editors will need to do' section on the Tidy replacement FAQ [1].
- What is Tidy?
Tidy [2] is a library currently used by MediaWiki to fix some HTML errors found in wiki pages.
Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Ex: unclosed HTML tags, such as a
<small> without a </small>, are common). In some cases, MediaWiki can generate erroneous HTML by itself. If we didn't fix these before sending it to browsers, some would display things in a broken way to readers.
But Tidy also does other "cleanup" on its own that is not required for correctness. Ex: it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering.
- Why replace it?
Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it also makes some incorrect changes to HTML to 'fix' things that used to not work; for example, Tidy will unexpectedly move a bullet list out of a table caption even though that's allowed. HTML4 Tidy is no longer maintained or packaged. There have also been a number of bug reports filed against Tidy [3]. Since Parsoid is based on HTML5 semantics, there are differences in rendering between Parsoid's rendering of a page and current read view that is based on Tidy.
- Project status
Given all these considerations, the Parsing team started work to replace Tidy [4] around mid-2015. Tim Starling started this work and after a survey of existing options, decided to write a wrapper over a Java-based HTML5 parser. At the time we started the project, we thought we could probably have Tidy replaced by mid-2016. Alas!
- What is replacing Tidy?
Tidy will be replaced by a RemexHTML-based solution that uses the RemexHTML[5] library along with some Tidy-compatibility shims to ensure better parity with the current rendering. RemexHTML is a PHP library that Tim wrote with C.Scott’s input that implements the HTML5 parsing spec.
- Testing and followup
We knew that some pages will be affected and need fixing due to this change. In order to more precisely identify what that would be, we wanted to do some thorough testing. So, we built some new tools [6][7] and overhauled and upgraded other test infrastructure [8][9] to let us evaluate the impacts of replacing Tidy (among other such things in the future) which can be a subject of a post all on its own.
You can find the details of our testing on the wiki [1][10], but we found that a large number of pages had rendering differences. We analyzed the results and categorized the source of differences. Based on that, to ease the process of replacement, we added a bunch of compatibility shims to mimic what Tidy does. I am skipping the details in this post. Even after that, newer testing showed that this nevertheless still leaves us with a few patterns that need fixing that we cannot / don't want to work around automatically.
- Tools to assist editors: Linter & ParserMigration
In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm (WMF)]]) dusted off the stalled wikitext linting project [11] and (with the help from a bunch of people on the Parsoid, db/security/code review areas) built the Linter extension that surfaces wikitext errors that Parsoid knows about to let editors fix them.
Earlier this year, we decided to use Linter in service of Tidy replacement. Based on our earlier testing results, we have added a set of high-priority linter categories that identifies specific wikitext markup patterns on wiki pages that need to be fixed [12].
Separately, Tim built the ParserMigration extension to let editors evaluate their fixes to pages [13]. You can enable this in your editing preferences or replace '&action=edit' in your url bar with '&action=parsermigration-edit' .
- What editors have to do
The part that you have all been waiting for!
Please see 'What editors will need to do' section on the Tidy replacement FAQ [1]. We have added simplified instructions, so that even community members who do not consider themselves "techies" can still learn about ways to fix pages. We'll keep that section up to date based on feedback and questions. But since it is a wiki, please also edit and tweak as required to make the text useful for yourselves! This is a first call for fixes and it is about the problems defined as "high priority". We'll issue other calls in the future for any other necessary Tidy fixups.
Caveats:
- As noted on that page, the linter categories don't cover all the possible sources of rendering differences. For example, there is still T157418
[14] left to address. For those who have an opinion about this, please chime in on that task. We are still evaluating the best solution for this without adding more cruft to wikitext behavior or kicking the cleanup can down the road.
- As the issues in the identified linter categories are fixed, we might be better able to isolate other issues that need addressing.
- So, when will Tidy actually be replaced?
We really would like to get Tidy removed from the cluster latest by June 2018 (or sooner if possible), and your assistance and prompt attention to these markup issues would be very helpful. We will do this in a phased manner on different wikis rather than all at once on all wikis.
We really want to do this as smoothly as possible without disrupting the work of editors or affecting the rendering of the large corpus of pages on the various wikis. As you might have gathered from the text above, we have built and leveraged a wide variety of tools to assist with this.
- Monitoring progress
In order to monitor progress, we plan to do a weekly (or some such periodic frequency) test run that compares the rendering of pages with Tidy and with RemexHTML on a large sample of pages (in the 50K range) from a large subset of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups are going, and when we might be able to flip the switch on different wikis.
Subramanya (Subbu) Sastry Parsing Team.
References
- https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ
- https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#
What_will_editors_need_to_do.3F 2. https://en.wikipedia.org/wiki/HTML_Tidy 3. https://phabricator.wikimedia.org/tag/tidy/ 4. https://phabricator.wikimedia.org/T89331 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml 6. https://phabricator.wikimedia.org/T120345 7. https://github.com/wikimedia/integration-uprightdiff 8. https://github.com/wikimedia/integration-visualdiff 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy 11. https://phabricator.wikimedia.org/T48705 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_ Replacing_Tidy 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifyi ng_fixes_for_these_lint_categories 14. https://phabricator.wikimedia.org/T157418
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Jul 11, 2017, at 6:13 AM, Nicolas Vervelle nvervelle@gmail.com wrote:
- Where is it possible to change the description displayed in each page
dedicated to a category ?
https://phabricator.wikimedia.org/source/mediawiki-extensions-Linter/browse/...
For example, the page for self-closed-ags [2] is very short. It would be nice to be able to add a description of what the error is, what problems it can cause and what are the solutions to fix it (or to be able to link to a page explaining all that).
In the top right corner, there's a link to "Aide"
https://www.mediawiki.org/wiki/Help:Extension:Linter/self-closed-tag
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
But I have a few questions / suggestions regarding Linter for the moment:
- Is is possible to retrieve also the localized names of the Linter categories and priorities: for example, on frwiki, you can see on the Linter page [1] that the high priority is translated into "Priorité haute" and that self-closed-tag has a user friendly name "Balises auto-fermantes". I don't see the localized names in the informations sent by the API for siteinfo.
Okay, will file a bug and take a look at this.
- Where is it possible to change the description displayed in each page dedicated to a category ? For example, the page for self-closed-ags [2] is very short. It would be nice to be able to add a description of what the error is, what problems it can cause and what are the solutions to fix it (or to be able to link to a page explaining all that).
Arlo already responded to this, but yes, you can update the message via translatewiki, I think.
- In the page dedicated to a category, there's a column telling if the problem is due to one template (and which one) or by several templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template?
Look for this in the template response.
|"templateInfo": { "multiPartTemplateBlock": true }|
Subbu.
On Tue, Jul 11, 2017 at 8:05 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
- Where is it possible to change the description displayed in each
page dedicated to a category ? For example, the page for self-closed-ags [2] is very short. It would be nice to be able to add a description of what the error is, what problems it can cause and what are the solutions to fix it (or to be able to link to a page explaining all that).
Arlo already responded to this, but yes, you can update the message via translatewiki, I think.
Specifically: https://translatewiki.net/wiki/Special:Translate/ext-linter?language=fr&... (Translatewiki links can be found in the infobox, on all Extension pages. If anyone is not familiar with the project, see https://www.mediawiki.org/wiki/Translatewiki.net )
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
But I have a few questions / suggestions regarding Linter for the moment:
- Is is possible to retrieve also the localized names of the Linter categories and priorities: for example, on frwiki, you can see on the Linter page [1] that the high priority is translated into "Priorité
haute" and that self-closed-tag has a user friendly name "Balises auto-fermantes". I don't see the localized names in the informations sent by the API for siteinfo.
Okay, will file a bug and take a look at this.
I used Arlo answer, and I'm getting the localized names from the messages, so I can do without the localized names in Linter answers.
Nico
Hi Subbu,
Using the localized names, I've found that not all Linter categories are listed in the API result. Is it normal ? For example, on frwiki, Linter reports 3 "mixed-content" errors for "Les Trolls (film)" but this category is not in the API siteinfo call.
Nico
On Wed, Jul 12, 2017 at 8:02 AM, Nicolas Vervelle nvervelle@gmail.com wrote:
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
But I have a few questions / suggestions regarding Linter for the moment:
- Is is possible to retrieve also the localized names of the Linter categories and priorities: for example, on frwiki, you can see on the Linter page [1] that the high priority is translated into "Priorité
haute" and that self-closed-tag has a user friendly name "Balises auto-fermantes". I don't see the localized names in the informations sent by the API for siteinfo.
Okay, will file a bug and take a look at this.
I used Arlo answer, and I'm getting the localized names from the messages, so I can do without the localized names in Linter answers.
Nico
On 07/12/2017 01:12 AM, Nicolas Vervelle wrote:
Hi Subbu,
Using the localized names, I've found that not all Linter categories are listed in the API result. Is it normal ? For example, on frwiki, Linter reports 3 "mixed-content" errors for "Les Trolls (film)" but this category is not in the API siteinfo call.
Yup.
Parsoid currently has detection for more patterns than are exposed via the Linter extension. Mixed content is more informational at this point - it will become relevant when we are ready to start nudging markup towards being more well-formed / well-balanced than it is now.
This was raised earlier on the Linter Extension talk page as well ( https://www.mediawiki.org/w/index.php?title=Topic:Tszvb85ccd0thbeo&topic... )
Subbu.
On Wed, Jul 12, 2017 at 4:43 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/12/2017 01:12 AM, Nicolas Vervelle wrote:
Hi Subbu,
Using the localized names, I've found that not all Linter categories are listed in the API result. Is it normal ? For example, on frwiki, Linter reports 3 "mixed-content" errors for "Les Trolls (film)" but this category is not in the API siteinfo call.
Yup.
Parsoid currently has detection for more patterns than are exposed via the Linter extension. Mixed content is more informational at this point - it will become relevant when we are ready to start nudging markup towards being more well-formed / well-balanced than it is now.
This was raised earlier on the Linter Extension talk page as well ( https://www.mediawiki.org/w/index.php?title=Topic:Tszvb85ccd 0thbeo&topic_showPostId=tteddfdly7fin8p6#flow-post-tteddfdly7fin8p6 )
Ok, I will only report patterns known by the Linter extension then.
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
- In the page dedicated to a category, there's a column telling if the
problem is due to one template (and which one) or by several
templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template?
Look for this in the template response.
|"templateInfo": { "multiPartTemplateBlock": true }|
Thanks ! I have updated WPCleaner to display the information about the template (template name or multiple templates).
I think I've found some discrepancy between Linter reports. On frwiki, the page "Discussion:Yasser Arafat" is reported in the list for self-closed-tag [1], but when run the text of the page through the transform API [2], I only get errors for obsolete-tag and mixed-content and nothing for self-closed-tag.
[1] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors/self-closed-tag [2] https://fr.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_t...
On 07/13/2017 02:18 AM, Nicolas Vervelle wrote:
I think I've found some discrepancy between Linter reports. On frwiki, the page "Discussion:Yasser Arafat" is reported in the list for self-closed-tag [1], but when run the text of the page through the transform API [2], I only get errors for obsolete-tag and mixed-content and nothing for self-closed-tag.
When I pasted the wikitext for Discussion:Yasser_Arafat page in the wikitext box AND entered the page title in the title box on https://fr.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_t..., I do see the following among others: ...
|{ "type": "self-closed-tag", "params": { "name": "span" }, "dsr": [ 183063, 183134, null, null ], "templateInfo": { "name": "Modèle:Censuré" } },|
...
However, if I don't add the page title in the title box, I can reproduce your problem ... so, clearly this is something to do with a template depending on the page title.
I can reproduce this on the commandline with the specific wikitext substring that the Linter interface shows you. This output below shows that the linter error is dependent on having the page title there.
--- [subbu@earth parsoid] echo '{{Censuré|Tu remarqueras que je ne te retourne pas la question.<br />}}' | parse.js --page Discussion:Yasser_Arafat --prefix frwiki --lint > /dev/null [info/lint/self-closed-tag][frwiki/Discussion:Yasser_Arafat] {"type":"self-closed-tag","params":{"name":"span"},"dsr":[0,71,null,null],"templateInfo":{"name":"Modèle:Censuré"}} [info/lint/stripped-tag][frwiki/Discussion:Yasser_Arafat] {"type":"stripped-tag","params":{"name":"SPAN"},"dsr":[0,71,null,null],"templateInfo":{"name":"Modèle:Censuré"}} [subbu@earth parsoid] echo '{{Censuré|Tu remarqueras que je ne te retourne pas la question.<br />}}' | parse.js --prefix frwiki --lint > /dev/null [subbu@earth parsoid] ---
When I add a --dump tplsrc flag to parsoid (which you can also get by using the expandtemplates action api endpoint), I see the following:
--- <span class="censure" style="background-color:#EEF;color:#EEF;" title="Tu remarqueras que je ne te retourne pas la question.<br />"><span style="visibility:hidden">Tu remarqueras que je ne te retourne pas la question.<br /></span></span> ---
So, it looks like Parsoid's tokenizer is tripping on the /> that is present in the span title attribute and false assumes it is a self-closing tag.
In any case, in conclusion:
(1) Please provide page title when you use the API (2) There is a Parsoid bug in detection of self-closing tags where presence of a "/>" in an HTML attribute triggers a false positive. This has been reported previously ... so I suppose it is not as uncommon as I thought. We'll take a look at that.
Subbu.
On Jul 13, 2017, at 10:35 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
(2) There is a Parsoid bug in detection of self-closing tags where presence of a "/>" in an HTML attribute triggers a false positive. This has been reported previously ... so I suppose it is not as uncommon as I thought. We'll take a look at that.
No, Parsoid is doing that by design to match the php parser.
See T97157 and https://phabricator.wikimedia.org/T170582#3435855
Hi folks,
Do you think that the implementation discussion should move to Phabricator?
Pine
On Thu, Jul 13, 2017 at 9:18 AM, Nicolas Vervelle nvervelle@gmail.com wrote:
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
- In the page dedicated to a category, there's a column telling if the
problem is due to one template (and which one) or by several
templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template?
Look for this in the template response.
|"templateInfo": { "multiPartTemplateBlock": true }|
Thanks ! I have updated WPCleaner to display the information about the template (template name or multiple templates).
I've started adding a detection in WPCleaner (error #532) for the missing-end-tag error reported by Linter (I'm starting with easy ones).
Is it normal that errrors inside a gallery tag are reported as being an error in a "multiPartTemplateBlock" while it's directly inside the page wikitext ? Examples on frwiki : Manali https://fr.wikipedia.org/w/index.php?title=Manali&action=edit&lintid=4555235, Zillis-Reischen https://fr.wikipedia.org/w/index.php?title=Zillis-Reischen&action=edit&lintid=4555585 ...
Nico
Hi Nico,
If you don't mind, let us move this more bug/feature-specific discussion to Phabricator by filing bugs where appropriate. Or, we can have discussions on-wiki at https://www.mediawiki.org/wiki/Help_talk:Extension:Linter. I'll copy your query to the talk page there and we can discuss it there.
Subbu.
On 07/17/2017 04:10 AM, Nicolas Vervelle wrote:
On Thu, Jul 13, 2017 at 9:18 AM, Nicolas Vervelle nvervelle@gmail.com wrote:
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
- In the page dedicated to a category, there's a column telling if the problem is due to one template (and which one) or by several
templates, but I don't get this information in the REST API for Linter. Is it possible to have it in the API result or should I deduce it myself where the offset given by the API matches a call to a template?
Look for this in the template response.
|"templateInfo": { "multiPartTemplateBlock": true }|
Thanks ! I have updated WPCleaner to display the information about the template (template name or multiple templates).
I've started adding a detection in WPCleaner (error #532) for the missing-end-tag error reported by Linter (I'm starting with easy ones).
Is it normal that errrors inside a gallery tag are reported as being an error in a "multiPartTemplateBlock" while it's directly inside the page wikitext ? Examples on frwiki : Manali https://fr.wikipedia.org/w/index.php?title=Manali&action=edit&lintid=4555235, Zillis-Reischen https://fr.wikipedia.org/w/index.php?title=Zillis-Reischen&action=edit&lintid=4555585 ...
Nico _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 07/06/2017 08:02 AM, Subramanya Sastry wrote:
TL;DR
The Parsing team wants to replace Tidy with a RemexHTML-based solution on the Wikimedia cluster by June 2018. This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML. Please see 'What editors will need to do' section on the Tidy replacement FAQ [1].
......
- Monitoring progress
In order to monitor progress, we plan to do a weekly (or some such periodic frequency) test run that compares the rendering of pages with Tidy and with RemexHTML on a large sample of pages (in the 50K range) from a large subset of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups are going, and when we might be able to flip the switch on different wikis.
I wanted to post some followups on this.
1. We have a revived dashboard that tracks linter error counts on wikis for all linter categories.
See https://tools.wmflabs.org/wikitext-deprecation/
2. We track the error counts as they change and publish weekly snapshots comparing counts to a July 24th baseline (which is when I first started collecting stats)
See https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/Linter/Stats
3. We also have a pixel-diffs test run (previously called visual diffs) that compares page rendering with Tidy and with RemexHTML. The test set has 73K pages sampled from 60 wikis. These diffs more accurately reflect what kind of rendering differences we can expect to see if pages are not fixed.
See http://mw-expt-tests.wmflabs.org/
4. Based on the runs above, I identified one more high priority linter category which is a Tidy whitespace bug and needs to be fixed (expect mostly templates, especially navboxes based on what I've seen in the test run above). Once the code is reviewed and deployed to the cluster, we'll start populating this category.
See https://gerrit.wikimedia.org/r/#/c/371068/ and https://gerrit.wikimedia.org/r/#/c/371071/
Thanks, Subbu.
Hello and thank you for this. Is there a phab ticket to follow the deployment process? Igal (User:IKhitron)
2017-08-10 21:42 GMT+03:00 Subramanya Sastry ssastry@wikimedia.org:
On 07/06/2017 08:02 AM, Subramanya Sastry wrote:
TL;DR
The Parsing team wants to replace Tidy with a RemexHTML-based solution on the Wikimedia cluster by June 2018. This will require editors to fix pages and templates to address wikitext patterns that behave differently with RemexHTML. Please see 'What editors will need to do' section on the Tidy replacement FAQ [1].
......
- Monitoring progress
In order to monitor progress, we plan to do a weekly (or some such periodic frequency) test run that compares the rendering of pages with Tidy and with RemexHTML on a large sample of pages (in the 50K range) from a large subset of Wikimedia wikis (~50 or so). This will give us a pulse of how fixups are going, and when we might be able to flip the switch on different wikis.
I wanted to post some followups on this.
We have a revived dashboard that tracks linter error counts on wikis for all linter categories.
We track the error counts as they change and publish weekly snapshots comparing counts to a July 24th baseline (which is when I first started collecting stats)
See https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/Linter/Stats
We also have a pixel-diffs test run (previously called visual diffs) that compares page rendering with Tidy and with RemexHTML. The test set has 73K pages sampled from 60 wikis. These diffs more accurately reflect what kind of rendering differences we can expect to see if pages are not fixed.
Based on the runs above, I identified one more high priority linter category which is a Tidy whitespace bug and needs to be fixed (expect mostly templates, especially navboxes based on what I've seen in the test run above). Once the code is reviewed and deployed to the cluster, we'll start populating this category.
https://gerrit.wikimedia.org/r/#/c/371071/
Thanks, Subbu.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
Hello and thank you for this. Is there a phab ticket to follow the deployment process? Igal (User:IKhitron)
We have the original Tidy replacement ticket (https://phabricator.wikimedia.org/T89331) but, as we get closer to start making phased deployments, we'll create phab tickets to track deployments separately.
Subbu.
Sorry for misunderstanding, I spoke about the whitespace. Igal
2017-08-10 22:06 GMT+03:00 Subramanya Sastry ssastry@wikimedia.org:
On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
Hello and thank you for this. Is there a phab ticket to follow the deployment process? Igal (User:IKhitron)
We have the original Tidy replacement ticket ( https://phabricator.wikimedia.org/T89331) but, as we get closer to start making phased deployments, we'll create phab tickets to track deployments separately.
Subbu.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ah! No, there wasn't one. But, I created https://phabricator.wikimedia.org/T173096 now and added you to the ticket. We are expecting to deploy it by end of next week.
Subbu.
On 08/10/2017 03:09 PM, יגאל חיטרון wrote:
Sorry for misunderstanding, I spoke about the whitespace. Igal
2017-08-10 22:06 GMT+03:00 Subramanya Sastry ssastry@wikimedia.org:
On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
Hello and thank you for this. Is there a phab ticket to follow the deployment process? Igal (User:IKhitron)
We have the original Tidy replacement ticket ( https://phabricator.wikimedia.org/T89331) but, as we get closer to start making phased deployments, we'll create phab tickets to track deployments separately.
Subbu.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Saw it now. Thank you very much. Igal
2017-08-11 17:26 GMT+03:00 Subramanya Sastry ssastry@wikimedia.org:
Ah! No, there wasn't one. But, I created https://phabricator.wikimedia. org/T173096 now and added you to the ticket. We are expecting to deploy it by end of next week.
Subbu.
On 08/10/2017 03:09 PM, יגאל חיטרון wrote:
Sorry for misunderstanding, I spoke about the whitespace. Igal
2017-08-10 22:06 GMT+03:00 Subramanya Sastry ssastry@wikimedia.org:
On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
Hello and thank you for this. Is there a phab ticket to follow the
deployment process? Igal (User:IKhitron)
We have the original Tidy replacement ticket (
https://phabricator.wikimedia.org/T89331) but, as we get closer to start making phased deployments, we'll create phab tickets to track deployments separately.
Subbu.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org