Hi,
in response to Brion Vibber's reply to my posting to the wrong mailing list (apologies), I'm attaching here a patch of what I think fixes the bug that five apostrophes produce invalid HTML.
This takes care of quite a lot of weird cases and nestings. However, of course I'm aware that it is not perfect. I can construct cases where it will still fail, but those are cases that I don't think will ever actually come up in an encyclopedia article (or if they do, the author was feeling fancy and deserves to be shot ;-) ).
Here's the patch:
--- includes/OutputPage-old.php Sun Jun 29 04:06:12 2003 +++ includes/OutputPage.php Sat Jun 28 19:04:42 2003 @@ -675,6 +675,15 @@
/* private */ function doQuotes( $text ) { + /* prevent invalid HTML (<strong><em>...</strong></em>) */ + $text = preg_replace( "/'''''(.+)'''(.*)''/mU", "<em><strong>$1</strong>$2</em>", $text ); + $text = preg_replace( "/'''''(.+)''(.+)'''/mU", "<strong><em>$1</em>$2</strong>", $text ); + $text = preg_replace( "/'''(.*[^'])''([^'].*)'''''/mU", "<strong>$1<em>$2</em></strong>", $text ); + $text = preg_replace( "/''(.+)'''(.+)'''''/mU", "<em>$1<strong>$2</strong></em>", $text ); + + $text = preg_replace( "/(<strong>.*)'''(.+)'''(.*<\/strong>)/mU", "$1</strong>$2<strong>$3", $text ); + $text = preg_replace( "/(<em>.*)''(.+)''(.*<\/em>)/mU", "$1</em>$2<em>$3", $text ); + $text = preg_replace( "/'''(.+)'''/mU", "<strong>$1</strong>", $text ); $text = preg_replace( "/''(.+)''/mU", "<em>$1</em>", $text ); return $text;
How do you usually handle casual programmer contributions? Are they usually posted to this mailing list? Or do you have a BugZilla installation somewhere where I should upload this as a patch? Or is there an actual chance I might get write access to the repository?
Greetings, Timwi
Timwi wrote:
in response to Brion Vibber's reply to my posting to the wrong mailing list (apologies),
No problem. :)
This takes care of quite a lot of weird cases and nestings. However, of course I'm aware that it is not perfect. I can construct cases where it will still fail, but those are cases that I don't think will ever actually come up in an encyclopedia article (or if they do, the author was feeling fancy and deserves to be shot ;-) ).
Hmm, here are some cases that really oughtn't fail; even if they're probably rare, they're perfectly legit. Using a pattern twice on the same line shouldn't cause it to fail. :(
* ''em '''em-strong''''' normal ''em '''em-strong''''' goes to: <em>em <strong>em-strong<em><strong> normal </em>em </strong>em-strong</em></strong>
* '''strong ''em-strong''''' normal '''strong ''em-strong''''' <strong>strong <em>em-strong<em><strong> normal </strong>strong </em>em-strong</em></strong>
* '''''em-strong'' strong''' normal '''''em-strong'' strong''' <em><strong>em-strong<em> strong</strong> normal </em><strong>em-strong</em> strong</strong>
An alternate implementation, incidentally, might be to treat '' and ''' as toggles rather than nestable delimiters, and ensure that the correct nesting level in the output HTML is maintained upon changing the toggles.
How do you usually handle casual programmer contributions? Are they usually posted to this mailing list? Or do you have a BugZilla installation somewhere where I should upload this as a patch? Or is there an actual chance I might get write access to the repository?
In theory you can upload patches on sourceforge, but no one will ever look at it. :) Posting to the list is the most likely way to get people to look it over.
If we ever get around to setting up a CVS commit notification mailer that would do as well...
-- brion vibber (brion @ pobox.com)
Hmm, here are some cases that really oughtn't fail; even if they're probably rare, they're perfectly legit. Using a pattern twice on the same line shouldn't cause it to fail. :(
- ''em '''em-strong''''' normal ''em '''em-strong'''''
goes to: <em>em <strong>em-strong<em><strong> normal </em>em </strong>em-strong</em></strong>
Ah, I see.
Well, I've already sort of concluded (and now for certain) that regexps won't do this. You shouldn't have made Wiki-markup use the same character (') for two concepts that can be nested. You should have used '' for em and "" for strong. :-p
Now, the case you described at least produces completely wrong output, which is better than wrong HTML that renders right. This forces the author to go back and check. Eventually they will realise that they can get it to work by putting a linebreak in between (which obviously won't put a linebreak in the final output, so you're okay).
An alternate implementation, incidentally, might be to treat '' and ''' as toggles rather than nestable delimiters, and ensure that the correct nesting level in the output HTML is maintained upon changing the toggles.
Yes, that would have been my second implementation plan, but I don't think that'll be possible with just regexps.
Posting to the list is the most likely way to get people to look it over. If we ever get around to setting up a CVS commit notification mailer that would do as well...
Hm, I'm sorry, but this didn't really answer my question. Are you going to commit the patch for me (once I've made it work satisfactorily), or are you going to give me CVS write access? I'm wondering this because copying & pasting patches from e-mails seems like a very inefficient way of handling it; certainly that can't be how you handle contributions from other developers?
Thanks a lot for your help, insight, etc.
Greetings, Timwi
Here is another (much more elaborate) attempt at fixing the pentuple-apostrophe problem (patch below).
This should handle all situations correctly where users enter correctly nested mark-up, e.g. '' text ''' text '''''.
Even for incorrectly nested mark-up, it often returns desirable results: '' text ''' text '' text ''' is turned into <em> text <strong> text </strong></em><strong> text </strong>
However, for incorrect mark-up, it sometimes returns weird results. For example, for the following input: '' text ''' text '' text if I'm not mistaken (remember I don't have an installation to test it on), it will return: <em> text <strong> text </strong></em>''' text which is at least correct HTML.
Anyway, here's the patch.
Greetings, Timwi
--- OutputPage-orig.php Sun Jun 29 15:36:00 2003 +++ OutputPage.php Sun Jun 29 15:35:08 2003 @@ -675,9 +675,67 @@
/* private */ function doQuotes( $text ) { - $text = preg_replace( "/'''(.+)'''/mU", "<strong>$1</strong>", $text ); - $text = preg_replace( "/''(.+)''/mU", "<em>$1</em>", $text ); - return $text; + if ( preg_match( "/^(.*)''(.*)$/mU", $text, $m ) ) { + if ( substr ($m[2], 0, 1) == "'" ) { + return $m[1] . doQuotesStrong ( substr ($m[2], 1) ); + } else { + return $m[1] . doQuotesEm ( $m[2] ); + } + } else { + return $text; + } + } + + /* private */ function doQuotesEm( $text ) + { + if ( preg_match( "/^(.*)''(.*)$/mU", $text, $m ) ) { + if ( substr ($m[2], 0, 1) == "'" ) { + return doQuotesEmStrong ( $m[1], substr ($m[2], 1) ); + } else { + return "<em>" . $m[1] . "</em>" . doQuotes ( $m[2] ); + } + } else { + return "''" . $text; + } + } + + /* private */ function doQuotesStrong( $text ) + { + if ( preg_match( "/^(.*)''(.*)$/mU", $text, $m ) ) { + if ( substr ($m[2], 0, 1) == "'" ) { + return "<strong>" . $m[1] . "</strong>" . doQuotes ( substr ($m[2], 1) ); + } else { + return doQuotesStrongEm ( $m[1], $m[2] ); + } + } else { + return "'''" . $text; + } + } + + /* private */ function doQuotesEmStrong( $pre, $text ) + { + if ( preg_match( "/^(.*)''(.*)$/mU", $text, $m ) ) { + if ( substr ($m[2], 0, 1) == "'" ) { + return doQuotesEm ( $pre . "<strong>" . $m[1] . "</strong>" . substr ($m[2], 1) ); + } else { + return "<em>" . $pre . "<strong>" . $m[1] . "</strong></em>" . doQuotesStrong ( $m[2] ); + } + } else { + return "'''''" . $text; + } + } + + /* private */ function doQuotesStrongEm( $pre, $text ) + { + if ( preg_match( "/^(.*)''(.*)$/mU", $text, $m ) ) { + if ( substr ($m[2], 0, 1) == "'" ) { + return "<strong>" . $pre . "<em>" . $m[1] . "</em></strong>" . doQuotesEm ( substr ($m[2], 1) ); + } else { + return doQuotesStrong ( $pre . "<em>" . $m[1] . "</em>" . $m[2] ); + } + } else { + return "'''''" . $text; + } }
/* private */ function doHeadings( $text )
Timwi wrote in part:
However, for incorrect mark-up, it sometimes returns weird results. For example, for the following input: '' text ''' text '' text if I'm not mistaken (remember I don't have an installation to test it on), it will return: <em> text <strong> text </strong></em>''' text which is at least correct HTML.
I see what you're doing. Clever!
-- Toby
wikitech-l@lists.wikimedia.org