I'm in the process of rewriting the old Sanitizer::removeHTMLTags to work much better. The new code properly closes implied end-tags and obeys some additional HTML rules about what can go where.
In-progress patches posted at: http://bugzilla.wikimedia.org/show_bug.cgi?id=5497
Before I finish this up, though, it would be good if we can agree on how to handle a few things.
** HTML across template boundaries
Right now there's a big behavior difference between the regular mode and the behavior with Tidy enabled. In regular mode, the HTML nesting and closing rules are separately applied to every transcluded text chunk. In Tidy mode, only the allowed-HTML check is applied at that stage, and nesting and closing is left for Tidy to fix things up at the very end.
An example of a construct that breaks is a template that defines a table header like:
<table class="fooba">
and is included like this:
{{cool-table-start}} {{cool-row|blah}} {{cool-table-end}}
In current non-Tidy mode, this breaks violently as the <table> gets closed in the first template, and then all the following <tr>, <td> etc are rejected as they're not allowed in body text.
In current Tidy mode this is allowed to pass on through just fine; the pieces are assembled and then checked for nesting later.
I really don't like this kind of construct as it makes it harder to treat transclusions at the abstract-parse-tree level in the future; in order to understand the markup _following_ the transclusion you need to have already expanded it. Yucky!
However the current system allows the same thing to work with wiki tables (eg {|class="fooba") in either mode. I'm pretty sure at least the latter are in fairly common use on Wikipedia.
So we either need to decide to Kill Them All, or accept the sacrifice for compatibility.
** Inline HTML across wiki blocks
Currently, removeHTMLTags is applied before most other parsing steps, most notably doBlockLevels which handles paragraph splitting, wiki lists, etc.
A consequence of this is that bad nesting / illegal overlapping can occur with a construct like this:
<b>First paragraph
Second paragraph
The HTML normalizer adds the missing close tag:
<b>First paragraph
Second paragraph</b>
and later the wiki block levels adds <p> tags:
<p><b>First paragraph </p><p>Second paragraph</b> </p>
This is fairly obviously incorrect; it _probably_ would make a reasonable amount of sense to rework how the block levels interact with stuff so it happens either up before, or in concert with, the HTML normalization.
** Mixing of HTML and wiki tables
Running tests on pages from French Wikipedia, I found a cute bugger that does something like this:
{| <caption>A table caption</caption> |- |blah |}
Since tables haven't been replaced in the output yet, this <caption> is in a <body> context as far as the HTML normalizer sees and it fails. But the old code let it through, in both tidy and non-tidy mode.
While this kind of admixture looks *supremely ugly* to me, do we have any reason to disallow it?
Should we think of the wiki table syntax as just a shortcut/transformation to HTML table tags, or should they be entirely separate entities?
-- brion vibber (brion @ pobox.com)
Moin,
On Saturday 03 June 2006 11:50, Brion Vibber wrote:
I'm in the process of rewriting the old Sanitizer::removeHTMLTags to work much better. The new code properly closes implied end-tags and obeys some additional HTML rules about what can go where.
In-progress patches posted at: http://bugzilla.wikimedia.org/show_bug.cgi?id=5497
Before I finish this up, though, it would be good if we can agree on how to handle a few things. ** HTML across template boundaries
[snip]
However the current system allows the same thing to work with wiki tables (eg {|class="fooba") in either mode. I'm pretty sure at least the latter are in fairly common use on Wikipedia.
So we either need to decide to Kill Them All, or accept the sacrifice for compatibility.
It is very convient to be able to create templates that start tables with a lot of predefiened markup, so users can just say:
{{start-table-for-specific-purpose-foo}}
instead of creatig manually bordersizes etc. It also makes mass-changing tables easier.
However, I am not sure you really need the <table> tag for this. Raw HTML shouldn't be nec. except in very limited circumstances. So what can table do that "{| |}" can't and can't we teach the wiki code these things and finally kill normal HTML tables altogether?
** Mixing of HTML and wiki tables
Running tests on pages from French Wikipedia, I found a cute bugger that does something like this:
{|
<caption>A table caption</caption>
|- |blah |}
Since tables haven't been replaced in the output yet, this <caption> is in a <body> context as far as the HTML normalizer sees and it fails. But the old code let it through, in both tidy and non-tidy mode.
While this kind of admixture looks *supremely ugly* to me, do we have any reason to disallow it?
Three: Uglyness, cleanliness, simplicity. KISS. Can we please kill HTML tables? Now? :)
Should we think of the wiki table syntax as just a shortcut/transformation to HTML table tags, or should they be entirely separate entities?
I think the wiki syntax should be the norm.
My € 0.02,
Tels
On 03/06/06, Adrian Buehlmann ligulem@pobox.com wrote:
Tels wrote:
.... Can we please kill HTML tables? Now? :)
Certainly not. We need them because the "|" of wiki table interferes with template "|" and ParserFunctions "|".
I would support the opposite. Kill wikitext tables.
Rob Church
On 6/3/06, Tels nospam-abuse@bloodgate.com wrote:
It is very convient to be able to create templates that start tables with a lot of predefiened markup, so users can just say:
{{start-table-for-specific-purpose-foo}}
instead of creatig manually bordersizes etc. It also makes mass-changing tables easier.
Is it possible to use templates that instead have the form {{table for specific purpose|all the HTML code for the guts of the table}}?
ie:
{{cool table|<TR><TD>first cell</TD><TD>second cell</TD></TR>}}
Having templates that "open" some HTML code and then others that "close" it is inherently contradictory to the goal of always producing well formed HTML.
Another case to look at is {{col-begin}} {{col-break}} etc...
Steve
I've done a little analysis of the templates in the last en.wikipedia data dump to look for these Evil Templates with Incomplete Tables.
The entire enwiki data set contained 55,921 template pages as of May 18.
13,310 templates appear to contain complete wiki tables. 493 appear to have a wiki table start "{|" but no end 883 have an end "|}" but no start 1717 contain table rows "|-" but neither start nor finish
2,218 templates appear to contain complete HTML tables. 34 have a <table> but no </table> 39 have a </table> but no <table> 274 have a <tr> but neither <table> nor </table>
I haven't looked up usage to see whether these are under- or over-represented in use.
The lists of templates matching the above criteria, the Jython script I used to run the counts, and a pre-extracted set of templates for those of you who don't want to download a 2-gigabyte file just to get 9 megs of compressed templates are at http://leuksman.com/misc/templates/
-- brion vibber (brion @ pobox.com)
The lists of templates matching the above criteria, the Jython script I used to run the counts, and a pre-extracted set of templates for those of you who don't want to download a 2-gigabyte file just to get 9 megs of compressed templates are at http://leuksman.com/misc/templates/
-- brion vibber (brion @ pobox.com)
A quick view show several templates with a horrible markup syntax. They are not closed due to ignorance! Or you can see http://en.wikipedia.org/w/index.php?title=Template:History_of_China with left <tr> where its open and starts as a wiki table, then changes to html table and finsih with wiki again. Maybe someone html-only adding rows...
Some unsense notes: When i started, i prefered html tables as i understood them. Wiki table syntax is not as easy as writing text or formatting quotes. I have also seen people uploading tables as images. Html tables are also easy-copied from html-exporting programs.
Tels wrote:
It is very convient to be able to create templates that start tables with a lot of predefiened markup, so users can just say:
{{start-table-for-specific-purpose-foo}}
instead of creatig manually bordersizes etc. It also makes mass-changing tables easier.
Uuuh.... bordersizes? "etc."??
You're supposed to write:
{| class="specific-purpose"
and define the look of the table in the CSS.
Timwi
Moin,
On Sunday 04 June 2006 15:04, Timwi wrote:
Tels wrote:
It is very convient to be able to create templates that start tables with a lot of predefiened markup, so users can just say:
{{start-table-for-specific-purpose-foo}}
instead of creatig manually bordersizes etc. It also makes mass-changing tables easier.
Uuuh.... bordersizes? "etc."??
You're supposed to write:
{| class="specific-purpose"
and define the look of the table in the CSS.
Normal users can't edit the CSS. Thats why you add the CSS stuff (e.g. backround colors, bordersizes etc) to a template, which gives you:
* easily changable by normal users * revision history * etc. all the normal wiki advantages
Adding the CSS to monobook.css or similiar wouldn't work for that, as would adding <style> tags (these aren't allowed normaly), nor can normal users access the <head> section.
If we followed your login then
{{red|txt=Red text}}
should be written as:
<span class="red">red text</span>
which isn't the wiki-way and was what many people turned of from hand-editing HTMl documents in the first place. :)
Best wishes,
Tels
Tels wrote:
Moin,
You're supposed to write:
{| class="specific-purpose"
and define the look of the table in the CSS.
Normal users can't edit the CSS.
Users can't edit [[MediaWiki:Monobook.css]], but that doesn't mean all the CSS has to go there. You could use templates (or at least one template) to include user-editable CSS from elsewhere.
If we followed your logic then {{red|txt=Red text}} should be written as: <span class="red">red text</span>
No, that is a strawman. I am neither arguing against wiki-syntax nor against templates.
Timwi
Moin,
On Monday 05 June 2006 01:28, Timwi wrote:
Tels wrote:
Moin,
You're supposed to write:
{| class="specific-purpose"
and define the look of the table in the CSS.
Normal users can't edit the CSS.
Users can't edit [[MediaWiki:Monobook.css]], but that doesn't mean all the CSS has to go there. You could use templates (or at least one template) to include user-editable CSS from elsewhere.
Except that you can't insert CSS into a template unless the style tag gets explicitely enabled. (and last I checked thats f.i. not possible on meta)
In any event, whats the difference between using:
class="foo"
and a template to set foo, and using a template directly anyway?
I think we agree here, we just don't know it yet :)
If we followed your logic then {{red|txt=Red text}} should be written as: <span class="red">red text</span>
No, that is a strawman. I am neither arguing against wiki-syntax nor against templates.
Then please don't inject arguments that aren't n the current discussion without adding OT :)
I was arguing for templates and for wiki-syntax. CSS doesn't have anything to do with it if I understood you correctly above :)
Best wishes,
Tels
Tels wrote:
Users can't edit [[MediaWiki:Monobook.css]], but that doesn't mean all the CSS has to go there. You could use templates (or at least one template) to include user-editable CSS from elsewhere.
Except that you can't insert CSS into a template unless the style tag gets explicitely enabled. (and last I checked thats f.i. not possible on meta)
This is simply not true. Go to [[Template:CSS specificpurpose]] and type
.specificpurpose { border: 1px solid black; background: red; }
Then go to [[MediaWiki:Monobook.css]] and insert
@import "/w/index.php?title=Template:CSS_specificpurpose&action=raw&ctype=text/css"
and voilà, the template is user-editable CSS.
However, I do not actually advocate this practice. I don't think CSS is suitable for editing by the average user; it should be reserved for admins who (1) know to leave it alone if they don't know what they're doing, and (2) can ensure project-wide layout consistency.
In any event, whats the difference between using:
class="foo"
and a template to set foo, and using a template directly anyway?
If you use something like this:
{| style="border: 1px solid black;"
then (1) the CSS applies no matter what skin anyone is using, but people who use MySkin probably don't want any CSS applied because they want to define it all themselves; and (2) if I don't want that border, I can't even remove it. However, in the my example above I could just add to my User CSS:
.specificpurpose { border: none; }
No, that is a strawman. I am neither arguing against wiki-syntax nor against templates.
Then please don't inject arguments that aren't n the current discussion without adding OT :)
Huh? I was responding to this argument from you:
It is very convient to be able to create templates that start tables with a lot of predefiened markup, so users can just say:
{{start-table-for-specific-purpose-foo}}
instead of creatig manually bordersizes etc. It also makes mass-changing tables easier.
Under my proposal, users can just say:
{| class="specific-purpose"
which is surely shorter, cleaner, easier on the server (no template look-up), and more future-proof (does not make crazy assumptions about the working principles of the parser) and you _still_ don't need to manually create any style definitions, and mass-changing the style of those tables is no more or less easy.
I was arguing for templates and for wiki-syntax. CSS doesn't have anything to do with it if I understood you correctly above :)
Then according to your logic, surely we should use {{strike|text}} instead of <s>text</s>. (This is to demonstrate your strawman.)
No, you were actually arguing for templates _in a specific situation_ (starting a table with a certain style). When I refuted that argument, you suddenly claimed that my logic would apply to completely different situations. That is simply a strawman.
Timwi
Brion Vibber wrote:
<snip>
** Mixing of HTML and wiki tables
Running tests on pages from French Wikipedia, I found a cute bugger that does something like this:
{|
<caption>A table caption</caption> |- |blah |}
Since tables haven't been replaced in the output yet, this <caption> is in a
<body> context as far as the HTML normalizer sees and it fails. But the old code let it through, in both tidy and non-tidy mode.
While this kind of admixture looks *supremely ugly* to me, do we have any reason to disallow it?
This is probably a case of users not realizing that there is a wiki syntax for table captions ( |+ ). I did this a lot at the Vietnamese Wikipedia before I realized that this syntax even existed, because you don't see it as often as |- or ! for example. But couldn't cases like these be handled by having a bot replace the misplaced HTML with wikimarkup beforehand?
Should we think of the wiki table syntax as just a shortcut/transformation to HTML table tags, or should they be entirely separate entities?
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org