Hi, I would appreciate your help with wrestling Tidy into submision: I have a revision, https://gerrit.wikimedia.org/r/#/c/80578/ which allows to disable TOC in ParserOutput. Because now only a marker instead of full TOC passes through Tidy, it gets sometimes wrapped in <p> tags. I tried various stuff like trying different types of marker, but Tidy is stubborn. Does anyone know a way to work around this?
Yes, this is somewhat similar to what I had to deal with when adding mw:editsection.
Take a look at includes/parser/Tidy.php, specifically the MWTidyWrapper class. You'll find that in the end tidy couldn't play nice at all. And to support the mw:editsection placeholder I had to explicitly add a new wrapper class to our tidy code for the purpose of holding a mapping of UNIQ style tokens to the contents of mw:editsection tokens and completely hiding the fact that mw:editsection exists from Tidy. (And then in the future I had to add a meta/link -> html-meta/html-link just to stop it from screwing up the RDFa/Microdata style meta/link tags in the body.
You'll have to tweak the tidy code to also wrap your tokens.
Also beware of parameters/data you need passed with placeholder tags. I had some language converter issues I had to deal with. Some params you explicitly do want converted and others you don't. So you end up having to mix specific attribute names and contents just to get specific behavior out of the converter. I had failing unit tests in my changes till I found a way to deal with this.
I've been wanting a generic API to do this kind of post-processing in the Parser for awhile. Though I just thought of an idea on how to generically solve the "To language convert or not to language convert" issue. mw:sometoken <mw:param name="some-key" value="..." /><!-- Language converter should ignore this --> <mw:param name="some-key">...</mw:param><!-- Language converter should convert this --> </mw:sometoken>
Or we could stick with attributes but automatically prefix everything with convert- noconvert- attribute prefixes. Then hack all the LanguageConverters to handle that.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 2013-10-07 5:45 PM, Max Semenik wrote:
Hi, I would appreciate your help with wrestling Tidy into submision: I have a revision, https://gerrit.wikimedia.org/r/#/c/80578/ which allows to disable TOC in ParserOutput. Because now only a marker instead of full TOC passes through Tidy, it gets sometimes wrapped in
<p> tags. I tried various stuff like trying different types of marker, but Tidy is stubborn. Does anyone know a way to work around this?
On 08.10.2013, 5:27 Daniel wrote:
Yes, this is somewhat similar to what I had to deal with when adding mw:editsection.
Take a look at includes/parser/Tidy.php, specifically the MWTidyWrapper class. You'll find that in the end tidy couldn't play nice at all. And to support the mw:editsection placeholder I had to explicitly add a new wrapper class to our tidy code for the purpose of holding a mapping of UNIQ style tokens to the contents of mw:editsection tokens and completely hiding the fact that mw:editsection exists from Tidy. (And then in the future I had to add a meta/link -> html-meta/html-link just to stop it from screwing up the RDFa/Microdata style meta/link tags in the body.
You'll have to tweak the tidy code to also wrap your tokens.
Thanks Daniel - but my problem is not that Tidy corrupts the tag itself, but that it wraps it in <p>. This is different from mw:editsection that is always surrounded by other tags. After much experimenting, I ended up with a conclusion that the only way to reliably prevent <p> addition is... to make it a <p> itself:) I used a unique attribute, so there's no chance it will clash with something including user-submitted tags.
Also beware of parameters/data you need passed with placeholder tags. I had some language converter issues I had to deal with. Some params you explicitly do want converted and others you don't. So you end up having to mix specific attribute names and contents just to get specific behavior out of the converter. I had failing unit tests in my changes till I found a way to deal with this.
I need no parameters, so no danger here.
The patchset that passes all tests is at https://gerrit.wikimedia.org/r/#/c/80578/ , can someone take a look at it, please?
On 2013-10-09 8:04 AM, Max Semenik wrote:
On 08.10.2013, 5:27 Daniel wrote:
You'll have to tweak the tidy code to also wrap your tokens.
Thanks Daniel - but my problem is not that Tidy corrupts the tag itself, but that it wraps it in <p>. This is different from mw:editsection that is always surrounded by other tags. After much experimenting, I ended up with a conclusion that the only way to reliably prevent <p> addition is... to make it a <p> itself:) I used a unique attribute, so there's no chance it will clash with something including user-submitted tags.
Please please, *please* do not give up on implementing this with a proper placeholder and put an even worse hack into the Parser.
There should be some way to find out exactly what is happening, where, and then target only that spot with a work-around.
Some things to start with:
* Is this really Tidy adding the <p>? Or is it actually the Parser adding them? (In the latter case I'd try adding a UNIQ there). * You tried wrapping with a UNIQ inside that tidy wrapper code, right? If Tidy is wrapping the text itself in a <p> then how about targeting tidy specifically by wrapping the UNIQ in one single expected <div> and then stripping that . Or maybe instead of a <div> add "mw-uniq" to new-blocklevel-tags and wrap the UNIQ in <mw-uniq> in the tidy wrapper.
Something like mw:foo shouldn't be changed to a <p> in the Parser just because of tidy. It doesn't make sense to make the parser worse just because of tidy when not everyone is using tidy and we have an open bug planning to kill tidy and replace it with something that wont suck so much (and theoretically won't give you the same issue with adding a mw:foo.
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=54617
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
(Tangenially related: I made a proposal to eradicate Tidy some days ago over at https://bugzilla.wikimedia.org/show_bug.cgi?id=54617 . I didn't look into alternatives much yet – not enough free time – but http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.p... looks promising. The bug is open for taking :) )
wikitech-l@lists.wikimedia.org