On 3/24/06, Gabriel Wicke gabrielwicke@users.sourceforge.net wrote:
Update of /cvsroot/wikipedia/phase3/includes In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv12319/includes
Modified Files: Parser.php Log Message: Provide some cleanup if tidy is disabled:
- fix invalid nesting of anchors and i/b
- remove empty i/b tags
- remove divs inside anchors
Fixes several test cases
Index: Parser.php
RCS file: /cvsroot/wikipedia/phase3/includes/Parser.php,v retrieving revision 1.602 retrieving revision 1.603 diff -u -d -r1.602 -r1.603 --- Parser.php 22 Mar 2006 04:57:14 -0000 1.602 +++ Parser.php 24 Mar 2006 16:36:29 -0000 1.603 @@ -250,6 +250,32 @@
if (($wgUseTidy and $this->mOptions->mTidy) or $wgAlwaysUseTidy) { $text = Parser::tidy($text);
} else {
# attempt to sanitize at least some nesting problems
# (bug #2702 and quite a few others)
$tidyregs = array(
# ''Something [http://www.cool.com cool''] -->
# <i>Something</i><a href="http://www.cool.com"..><i>cool></i></a>
'/(<([bi])>)(<([bi])>)?([^<]*)(<\/?a[^<]*>)([^<]*)(<\/\\4>)?(<\/\\2>)/' =>
'\\1\\3\\5\\8\\9\\6\\1\\3\\7\\8\\9',
# fix up an anchor inside another anchor, only
# at least for a single single nested link (bug 3695)
'/(<a[^>]+>)([^<]*)(<a[^>]+>[^<]*)<\/a>(.*)<\/a>/' =>
'\\1\\2</a>\\3</a>\\1\\4</a>',
# fix div inside inline elements- doBlockLevels won't wrap a line which
# contains a div, so fix it up here; replace
# div with escaped text
'/(<([aib]) [^>]+>)([^<]*)(<div([^>]*)>)(.*)(<\/div>)([^<]*)(<\/\\2>)/' =>
'\\1\\3<div\\5>\\6</div>\\8\\9',
# remove empty italic or bold tag pairs, some
# introduced by rules above
'/<([bi])><\/\\1>/' => ''
);
$text = preg_replace(
array_keys( $tidyregs ),
array_values( $tidyregs ),
$text ); } wfRunHooks( 'ParserAfterTidy', array( &$this, &$text ) );
This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags are invalid" test case but it's not really an improvement. The test case was supposed to demonstrate that we don't balance tags, which this doesn't fix, it merely hacks around very specific cases with regular expressions which fail if you insert more tags which would be fixed in a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make that parser test cases we have pass by basically writing a hack in the parser to make just that test pass rather than fixing the core issue.
On Sun, 2006-03-26 at 03:08 +0000, Ævar Arnfjörð Bjarmason wrote:
This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags are invalid" test case but it's not really an improvement. The test case was supposed to demonstrate that we don't balance tags, which this doesn't fix, it merely hacks around very specific cases with regular expressions which fail if you insert more tags which would be fixed in a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make that parser test cases we have pass by basically writing a hack in the parser to make just that test pass rather than fixing the core issue.
Ævar,
i'm the last one to disagree with you about the core issue. I did start an (unfinished) attempt to write a Bison-based parser, and i added tidy support to MediaWiki to get at least close to xhtml. Timwi and others got further with the flexbisonparse module in CVS.
But still as it stands the parser in use is a sophisticated regexp collection.
While i would welcome all effort being directed towards a proper solution of the core issue this is simply not happening right now, for various reasons. Shifting the focus of MediaWiki development from the current parser to a better solution is difficult at best, i doubt anything short of doing the full implementation or funding it will do. I would love to be wrong on this though, and i would really like to work on such a parser.
My task at hand was to fix bugs in the current parser, not working on a replacement for it. The change above makes MediaWiki-generated (x)html slightly less invalid, so i consider it an improvement overall.
On Wed, Mar 29, 2006 at 07:14:52PM +0200, Gabriel Wicke wrote:
While i would welcome all effort being directed towards a proper solution of the core issue this is simply not happening right now, for various reasons. Shifting the focus of MediaWiki development from the current parser to a better solution is difficult at best, i doubt anything short of doing the full implementation or funding it will do. I would love to be wrong on this though, and i would really like to work on such a parser.
Since you appear to be the person sleeping with the parser this month (:-), I'll direct this question to you:
Just how complex is it?
I'd like to get maybe 75-80% of the pertinent parts of the parser glued into WebGUI as a replacement for it's current article markup facilities, but that will entail rewriting it in perl.
Is that even remotely practical-sounding?
Cheers, -- jra
Jay R. Ashworth wrote:
I'd like to get maybe 75-80% of the pertinent parts of the parser glued into WebGUI as a replacement for it's current article markup facilities, but that will entail rewriting it in perl.
You might take a look at existing attempts, I think a couple are perl-ish: http://meta.wikimedia.org/wiki/Alternative_parsers
-- brion vibber (brion @ pobox.com)
On Wed, 2006-03-29 at 12:55 -0500, Jay R. Ashworth wrote:
Just how complex is it?
Take a look at Parser.php, function parse().
I'd like to get maybe 75-80% of the pertinent parts of the parser glued into WebGUI as a replacement for it's current article markup facilities, but that will entail rewriting it in perl.
Is that even remotely practical-sounding?
Hmm- depends on how you weigh things to get those 75-80% ;-)
On Wed, Mar 29, 2006 at 08:29:20PM +0200, Gabriel Wicke wrote:
I'd like to get maybe 75-80% of the pertinent parts of the parser glued into WebGUI as a replacement for it's current article markup facilities, but that will entail rewriting it in perl.
Is that even remotely practical-sounding?
Hmm- depends on how you weigh things to get those 75-80% ;-)
Understood. Most of what I'm after is easier markup: graf breaks, links, italics and bold, maybe lists. The table stuff is nice, but lower on my list. Transclusion is probably Right Out, since there's already a way to accomplish that, and the page framing of WebGUI is sufficient different, semantically, to make it hard to describe what it would do, anyway.
I'll check out that article brion pointed me to.
Cheers, -- jra
On 3/29/06, Gabriel Wicke lists@wikidev.net wrote:
My task at hand was to fix bugs in the current parser, not working on a replacement for it. The change above makes MediaWiki-generated (x)html slightly less invalid, so i consider it an improvement overall.
It is, I agree with you on that.
What I have an issue with is that the test case I'm talking about (and there might be others) is supposed to test whether the parser can balance (X)HTML tags, which it can't (with or without your fix), even though you've fixed some cases where the parser will generate output containing unbalanced tags with a hack the core issue has still not been solved, which is okey, we have a lot of unsolved core issues. But what's worse is that the test case that tested whether the parser could balance tags now passes although the parser still can't balance tags (because your fix only fixes a very small subset of unbalanced tags that the parser might generate).
Thus what I'm asking you to do is go over the test cases that were solved with your fix(es) and add new ones that do fail if you didn't solve the real problem they were supposed to test for (e.g. if it tested whether the parser can balance tags and you only fixed <i><b></i></b> or some other things like that), because having a test suite that passes when we still have some core parsing problems doesn't mean that we have a decent parser, it means that the test suite has become worthless.
On 3/29/06, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
<i><b></i></b> or some other things like that), because having a test suite that passes when we still have some core parsing problems doesn't mean that we have a decent parser, it means that the test suite has become worthless.
Maybe "insufficient" would be a nicer way of putting it than "worthless"? :)
Steve
On Wed, Mar 29, 2006 at 10:43:54PM +0200, Steve Bennett wrote:
On 3/29/06, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
<i><b></i></b> or some other things like that), because having a test suite that passes when we still have some core parsing problems doesn't mean that we have a decent parser, it means that the test suite has become worthless.
Maybe "insufficient" would be a nicer way of putting it than "worthless"? :)
It might well be nicer, but optimizing the language to avoid offending people may not be the most effective solution, and a test suite -- or at least any component thereof -- *is* sort of a binary proposition.
In fact, "worse than useless" fits this case nicely; the test suite *implies* that all is well, but it's providing a false sense of security; all is not well.
Cheers, -- jra
On Wed, 2006-03-29 at 18:48 +0000, Ævar Arnfjörð Bjarmason wrote:
Thus what I'm asking you to do is go over the test cases that were solved with your fix(es) and add new ones that do fail if you didn't solve the real problem they were supposed to test for (e.g. if it tested whether the parser can balance tags and you only fixed <i><b></i></b> or some other things like that), because having a test suite that passes when we still have some core parsing problems doesn't mean that we have a decent parser, it means that the test suite has become worthless.
Ævar,
i do not agree that the test suite has become worthless- it currently provides 301 regression tests and a sound specification on what a future parser needs to do. Any drop-in replacement parser will have to pass the same tests as the current one, plus more tricky ones.
Having many failing tests to document obvious and known shortcomings that won't be fixed without a rewrite tends to decrease the motivation to run the test suite and distracts from important regressions. Who cares if there are 50 or 52 failing tests? Having all tests pass and then suddenly two fail generates far more attention.
In my opinion deep structural shortcomings are better documented in other places (bug reports and design documents come to mind) or in not-yet-implemented/wishlist tests that are not run by default, but can be enabled with a switch to test new features or implementations.
In XP, unit tests are mainly a way to enable changes without having to worry too much about unforeseen consequences- if the tests pass, you should be fine. Having all tests pass does not mean that the product is perfect. It only means the product is doing what the tests specify and nothing about what they don't specify.
On 3/30/06, Gabriel Wicke lists@wikidev.net wrote:
Having many failing tests to document obvious and known shortcomings that won't be fixed without a rewrite tends to decrease the motivation to run the test suite and distracts from important regressions. Who cares if there are 50 or 52 failing tests? Having all tests pass and then suddenly two fail generates far more attention.
I agree. I was a bit shocked when I started receiving "20 tests failed" messages - normally a failed regression test is a cause for serious alarm.
On the other hand, you could definitely create tests for features that have not been implemented, and run them separtely, perhaps replacing "FAILED" with "Sorry, not yet".
Steve
Gabriel Wicke schrieb:
On Sun, 2006-03-26 at 03:08 +0000, Ævar Arnfjörð Bjarmason wrote:
This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags are invalid" test case but it's not really an improvement. The test case was supposed to demonstrate that we don't balance tags, which this doesn't fix, it merely hacks around very specific cases with regular expressions which fail if you insert more tags which would be fixed in a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make that parser test cases we have pass by basically writing a hack in the parser to make just that test pass rather than fixing the core issue.
Ævar,
i'm the last one to disagree with you about the core issue. I did start an (unfinished) attempt to write a Bison-based parser, and i added tidy support to MediaWiki to get at least close to xhtml. Timwi and others got further with the flexbisonparse module in CVS.
I'm not joking (and not trying to spam the mailing list yet again with my stuff;-) when I say that my PHP-based wiki2xml parser/converter is probably the only working alternative out there. Note that this is not one of my many abandoned, half-finished toys; this one works, on the whole syntax, on real pages.
Among other nice habits, it nests HTML tags correctly (or converts them to plain text, which makes it easy to see the error). Same of course for wiki markup. It might be a little more strict than the current MediaWiki parser; that might be a good thing, actually.
It is also surprisingly fast for its approach (which is similar to the code structure flexbisonparse generates as C code). It converts [[en:Biology]], including all templates, to XML in less than 0.5 seconds (not counting the time it takes to load the source texts via web). I am certain these times can be reduced by further tweaking. However, an additional step to generate XHTML form XML has to be added (which should be quite fast IMHO).
During the XML generation, it can collect all links in the page, which can then be looked up with the usual single query before converting it to XHTML, making the current replacement hell obsolete. Actually, it doesn't use *any* regexps, except to remove HTML comments.
While there are likely lots of new, fun bugs to discover, IMHO they can be squashed far more easily than in the current parser, due to its cleaner structure (hey, don't laugh! I can write structured code if I really want to!:-)
I will go ahead and add an XHTML export function to the wiki2xml code. Anyone interested in outfitting MediaWiki with a decent almost-parser is invited to help ;-) especially with a potential integration into MediaWiki itself.
Magnus
Gabriel Wicke wrote:
On Sun, 2006-03-26 at 03:08 +0000, Ævar Arnfjörð Bjarmason wrote:
This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags are invalid" test case but it's not really an improvement. The test case was supposed to demonstrate that we don't balance tags, which this doesn't fix, it merely hacks around very specific cases with regular expressions which fail if you insert more tags which would be fixed in a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make that parser test cases we have pass by basically writing a hack in the parser to make just that test pass rather than fixing the core issue.
Ævar,
i'm the last one to disagree with you about the core issue. I did start an (unfinished) attempt to write a Bison-based parser, and i added tidy support to MediaWiki to get at least close to xhtml. Timwi and others got further with the flexbisonparse module in CVS.
But still as it stands the parser in use is a sophisticated regexp collection.
While i would welcome all effort being directed towards a proper solution of the core issue this is simply not happening right now, for various reasons. Shifting the focus of MediaWiki development from the current parser to a better solution is difficult at best, i doubt anything short of doing the full implementation or funding it will do. I would love to be wrong on this though, and i would really like to work on such a parser.
My task at hand was to fix bugs in the current parser, not working on a replacement for it. The change above makes MediaWiki-generated (x)html slightly less invalid, so i consider it an improvement overall.
Has anyone considered using a packrat parser for implementing MediaWiki parsing? Packrat parsers use simple grammar-driven top-down recursion, are naturally ambiguity-resolving, and use memoization to prevent the otherwise inevitable combinatorial explosion. They are quite fast-running and yet easy to write, and are much easier to understand and hack than most compiler-compiler-style parsers.
-- Neil
wikitech-l@lists.wikimedia.org