On 3/24/06, Gabriel Wicke <gabrielwicke(a)users.sourceforge.net> wrote:
Update of /cvsroot/wikipedia/phase3/includes
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv12319/includes
Modified Files:
Parser.php
Log Message:
Provide some cleanup if tidy is disabled:
* fix invalid nesting of anchors and i/b
* remove empty i/b tags
* remove divs inside anchors
Fixes several test cases
Index: Parser.php
===================================================================
RCS file: /cvsroot/wikipedia/phase3/includes/Parser.php,v
retrieving revision 1.602
retrieving revision 1.603
diff -u -d -r1.602 -r1.603
--- Parser.php 22 Mar 2006 04:57:14 -0000 1.602
+++ Parser.php 24 Mar 2006 16:36:29 -0000 1.603
@@ -250,6 +250,32 @@
if (($wgUseTidy and $this->mOptions->mTidy) or $wgAlwaysUseTidy) {
$text = Parser::tidy($text);
+ } else {
+ # attempt to sanitize at least some nesting problems
+ # (bug #2702 and quite a few others)
+ $tidyregs = array(
+ # ''Something [
http://www.cool.com
cool''] -->
+ # <i>Something</i><a
href="http://www.cool.com"..><i>cool></i></a>
+
'/(<([bi])>)(<([bi])>)?([^<]*)(<\/?a[^<]*>)([^<]*)(<\/\\4>)?(<\/\\2>)/'
=>
+ '\\1\\3\\5\\8\\9\\6\\1\\3\\7\\8\\9',
+ # fix up an anchor inside another anchor, only
+ # at least for a single single nested link (bug 3695)
+
'/(<a[^>]+>)([^<]*)(<a[^>]+>[^<]*)<\/a>(.*)<\/a>/'
=>
+ '\\1\\2</a>\\3</a>\\1\\4</a>',
+ # fix div inside inline elements- doBlockLevels won't
wrap a line which
+ # contains a div, so fix it up here; replace
+ # div with escaped text
+ '/(<([aib])
[^>]+>)([^<]*)(<div([^>]*)>)(.*)(<\/div>)([^<]*)(<\/\\2>)/'
=>
+
'\\1\\3<div\\5>\\6</div>\\8\\9',
+ # remove empty italic or bold tag pairs, some
+ # introduced by rules above
+ '/<([bi])><\/\\1>/' => ''
+ );
+
+ $text = preg_replace(
+ array_keys( $tidyregs ),
+ array_values( $tidyregs ),
+ $text );
}
wfRunHooks( 'ParserAfterTidy', array( &$this, &$text ) );
This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags are
invalid" test case but it's not really an improvement. The test case
was supposed to demonstrate that we don't balance tags, which this
doesn't fix, it merely hacks around very specific cases with regular
expressions which fail if you insert more tags which would be fixed in
a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make
that parser test cases we have pass by basically writing a hack in the
parser to make just that test pass rather than fixing the core issue.