Getting rid of $wgWellFormedXml = false;

List overview All Threads
Download

newer

older

Prioritizing Phabricator...

[Breaking Change] Scap change for...

Brian Wolff

3 May 2016 3 May '16

12:42 a.m.

So currently, we have two ways of outputting html - $wgWellFormedXml = true (The default), outputs html that happens to conform with the rules of XML. $wgWellFormedXml = false on the other hand, uses more lax html5 rules to save a few bytes.

Having two modes of output, feels rather silly to me. Originally I think this was meant as a feature flag well $wgWellFormedXml=false stabilized, but it never got turned on, and here we are 7 years later.

Having $wgWellFormedXml=false increases the complexity of the code, and not all that many people use it (Notable exception is translatewiki). I think its important that security critical code be as simple as possible. Furthermore, there seems to be very little benefit to having the second mode (After you account for gzip, saving a few bytes from writing <img> instead of <img/> really doesn't matter, imo)

With that in mind, I would like to propose killing $wgWellFormedXml = false; I'm not so much attached to the true mode (Although I do feel the true mode is significantly more sane), as I just simply want there to be a single mode. Putting the default to false was vetoed in T52040, so I think that true would be the best choice to go with going forward if we are getting rid of one of the modes.

If there are aspects of the other mode that people really want, then I think we should simply merge that in to the default behavior instead of having two separate modes.

See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Thanks, Brian

Show replies by date

Brion Vibber

3 May 3 May

3:32 a.m.

I'd say an HTML5 output mode *ought* to work like this:

*Don't try to be clever.* * Consistency and predictability are key to both security review and data consumability.

*Quote attributes consistently and predictably.* * Always use double-quotes on attributes in output.

*Output specced empty tags in HTML style.* * <img>, <hr>, <br> are fine and not ambiguous at all to an HTML parser. There's no need to go adding a "/" in at the end! * These are already whitelisted in the Html class so it's easy to not mess this up.

*Don't do other silly things for old-school XHTML 1.* * CDATA wrapping of <script>s and <style>s is not needed.

The only benefit of $wgWellFormedXml was that you could toss your "well-formed" tag soup into an XML parser that didn't grok HTML. I have no idea if that worked reliably or was actually useful to anyone, but it's probably worth confirming that before actually removing the funky self-closing tags.

-- brion

On Mon, May 2, 2016 at 11:42 AM, Brian Wolff bawolff@gmail.com wrote:

...

So currently, we have two ways of outputting html - $wgWellFormedXml = true (The default), outputs html that happens to conform with the rules of XML. $wgWellFormedXml = false on the other hand, uses more lax html5 rules to save a few bytes.

Having two modes of output, feels rather silly to me. Originally I think this was meant as a feature flag well $wgWellFormedXml=false stabilized, but it never got turned on, and here we are 7 years later.

Having $wgWellFormedXml=false increases the complexity of the code, and not all that many people use it (Notable exception is translatewiki). I think its important that security critical code be as simple as possible. Furthermore, there seems to be very little benefit to having the second mode (After you account for gzip, saving a few bytes from writing <img> instead of <img/> really doesn't matter, imo)

With that in mind, I would like to propose killing $wgWellFormedXml = false; I'm not so much attached to the true mode (Although I do feel the true mode is significantly more sane), as I just simply want there to be a single mode. Putting the default to false was vetoed in T52040, so I think that true would be the best choice to go with going forward if we are getting rid of one of the modes.

If there are aspects of the other mode that people really want, then I think we should simply merge that in to the default behavior instead of having two separate modes.

See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Thanks, Brian

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian Wolff

4:04 a.m.

...

The only benefit of $wgWellFormedXml was that you could toss your "well-formed" tag soup into an XML parser that didn't grok HTML. I have no idea if that worked reliably or was actually useful to anyone, but it's probably worth confirming that before actually removing the funky self-closing tags.

There are references to it breaking people's screen scraping bots last time it was turned on. That was like 5 years ago though.

--bawolff

Max Semenik

6:43 a.m.

On Mon, May 2, 2016 at 3:04 PM, Brian Wolff bawolff@gmail.com wrote:

...

There are references to it breaking people's screen scraping bots last time it was turned on. That was like 5 years ago though.

At this point, I would say that everybody who screen-scrapes saw it coming and breaking them is a good thing as sometimes, lessons just have to be learned.

Best regards, Max Semenik ([[User:MaxSem]])

Gergo Tisza

8:34 p.m.

On Tue, May 3, 2016 at 2:43 AM, Max Semenik maxsem.wiki@gmail.com wrote:

...

At this point, I would say that everybody who screen-scrapes saw it coming and breaking them is a good thing as sometimes, lessons just have to be learned.

There aren't many options other than content-scraping if you want to transform Wikipedia articles into some semblance of structured data. We even do it ourselves, for media metadata (and use an XML parser for it, as PHP doesn't offer much in the way of parsing HTML5, so outputting HTML5-style empty tags might break it - although IIRC there is a hack to work around that as file pages can contain ill-formed HTML anyway).

Gergo Tisza

8:40 p.m.

On Tue, May 3, 2016 at 4:34 PM, Gergo Tisza gtisza@wikimedia.org wrote:

...

There aren't many options other than content-scraping if you want to transform Wikipedia articles into some semblance of structured data. We even do it ourselves, for media metadata (and use an XML parser for it

Actually the XML parser has been replaced with DOMDocument a while ago, which can handle HTML5 fine. But the point stands: HTML scraping is hardly an unusual requirement for reusers of our content.

Brian Wolff

4 May 4 May

1:19 a.m.

On Monday, May 2, 2016, Max Semenik maxsem.wiki@gmail.com wrote:

...

On Mon, May 2, 2016 at 3:04 PM, Brian Wolff bawolff@gmail.com wrote:

...

...

At this point, I would say that everybody who screen-scrapes saw it coming and breaking them is a good thing as sometimes, lessons just have to be learned.

Personally, I dont think we should shy away from breaking screen scrapers if we get something out of it, but in this case I dont see the benefit. Breaking things because we can without getting any benefit (or only trivial benefits) seems rather pointless and kind of mean to those who do scrape.

-- bawolff

Legoktm

14 May 14 May

7:07 a.m.

Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:

...

See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged it today.

-- Legoktm

Strainu

8 p.m.

2016-05-14 4:07 GMT+03:00 Legoktm legoktm.wikipedia@gmail.com:

...

Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:

...
See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged it today.

Can you please clarify if this change will have any effect on non-valid HTML in the Wikitext? I suppose no change will occur, since this was the default anyway, but I'd like a confirmation.

Strainu

Brian Wolff

15 May 15 May

3:02 a.m.

On Saturday, May 14, 2016, Strainu strainu10@gmail.com wrote:

...

2016-05-14 4:07 GMT+03:00 Legoktm legoktm.wikipedia@gmail.com:

...
Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:

...
See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged it today.

Can you please clarify if this change will have any effect on non-valid HTML in the Wikitext? I suppose no change will occur, since this was the default anyway, but I'd like a confirmation.

Strainu

That is correct. Nothing will change about invalid html - if you have tidy enabled the invalid html gets fixed, if you dont it does not.

-- bawolff

Antoine Musso

1:12 a.m.

Le 14/05/2016 à 03:07, Legoktm a écrit :

...

Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:

...
See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged it today.

Hello,

That sounds good. I would suggest to apply to REL1_27 as well.

-- Antoine "hashar" Musso

3162

Age (days ago)

3174

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

Antoine Musso
Brian Wolff
Brion Vibber
Gergo Tisza
Legoktm
Max Semenik
Strainu