Seeing as 1.5 is UTF-8 for all languages (yay), can we parse -- and --- to N and M dashes? I know there were some problems with wiki table syntax last time this was attempted, but that's easy to eliminate if you require spaces around the hyphen-sequence. In that case, as a rule, the resulting display would have spaces as well (except numbers... more later). The spaces around dashes are a mini-dispute of their own over at [[Wikipedia talk:Manual of Style (dashes)]], but I think that people would be happy just to have an easy, standard way to enter dashes.
Last time this came up someone mentioned other languages that might forbid a space around dashes. I know that in French it's common to use dashes at the beginning lines in dialog. This problem has an easy solution: French keyboards also have a a dash key, unless my memory is failing me completely. American Mac keyboards have option-hyphen. On Windows, Word inserts dashes all over the place. People who aren't happy with the " --- " and " -- " solution would be free to enter unicode dashes however they wish, as they can today on UTF-8 wikipedias. (Stubborn space-conscious anglophones might also resort to this method... so be it.)
I've written a patch that I think is fairly well placed since it's adjacent to the existing code that inserts non-breaking spaces between guillemets. This method would make a lot of people happy, and it promotes compliance to the Manual of Style as much as is possible. Here's how it works:
1. Replace any ' -- ' with the UTF-8 sequence equivalent to ' – ' 2. Replace any '--' between numbers with '–' alone. 3. Replace any ' --- ' with the UTF-8 sequence equivalent to ' — '
See below for the code.
Nathan
Index: Parser.php =================================================================== RCS file: /cvsroot/wikipedia/phase3/includes/Parser.php,v retrieving revision 1.383 diff --unified=3 -w -B -r1.383 Parser.php --- Parser.php 6 Feb 2005 16:13:05 -0000 1.383 +++ Parser.php 9 Feb 2005 21:15:50 -0000 @@ -185,6 +185,9 @@ '/<br *>/i' => '<br />', '/<center *>/i' => '<div class="center">', '/<\/center *>/i' => '</div>', + '/ -- /i' => "\xC2\xA0\xE2\x80\x93 ", # –<normal space> + '/([0-9])--([0-9])/i' => "\1\xE2\x80\x93\2", # – + '/ --- /i' => "\xC2\xA0\xE2\x80\x94 " # —<normal space> ); $text = preg_replace( array_keys($fixtags), array_values($fixtags), $text ); $text = Sanitizer::normalizeCharReferences( $text ); @@ -195,7 +198,10 @@ # french spaces, Guillemet-right '/(\302\253) /i' => '\1 ', '/<center *>/i' => '<div class="center">', - '/<\/center *>/i' => '</div>' + '/<\/center *>/i' => '</div>', + '/ -- /i' => "\xC2\xA0\xE2\x80\x93 ", # –<normal space> + '/([0-9])--([0-9])/i' => "\1\xE2\x80\x93\2", # – + '/ --- /i' => "\xC2\xA0\xE2\x80\x94 " # —<normal space> ); $text = preg_replace( array_keys($fixtags), array_values($fixtags), $text ); }
On Wed, 09 Feb 2005 16:25:27 -0500, Nathan Hamblen nhamblen@mac.com wrote:
- Replace any '--' between numbers with '–' alone.
Does this work for "2mm-3m"? "50KHz-1GHz"? What happens if it doesn't work?
See below for the code.
Maybe you could have attached it to a feature request at http://bugzilla.wikipedia.org and that would keep track of this better. :)
Tomer Chachamu wrote:
On Wed, 09 Feb 2005 16:25:27 -0500, Nathan Hamblen nhamblen@mac.com wrote:
- Replace any '--' between numbers with '–' alone.
Does this work for "2mm-3m"? "50KHz-1GHz"? What happens if it doesn't work?
It works, that pattern wouldn't match so it there would be no replacement.
Maybe you could have attached it to a feature request at http://bugzilla.wikipedia.org and that would keep track of this better. :)
Maybe. There is a bug there, I queried "dash", but it looks kind of lonely (no comments at all). I thought I would post here first.
Nathan
Nathan Hamblen wrote:
problem has an easy solution: French keyboards also have a a dash key
They don't, at least normal PC keyboards don't.
- Replace any ' -- ' with the UTF-8 sequence equivalent to
' – ' 2. Replace any '--' between numbers with '–' alone. 3. Replace any ' --- ' with the UTF-8 sequence equivalent to ' — '
Sounds like LaTeX.
David Monniaux wrote:
Nathan Hamblen wrote:
problem has an easy solution: French keyboards also have a a dash key
They don't, at least normal PC keyboards don't.
Damn. Well, anyway, at least you have ยบ. ;) If people want it, we could also match '\n--- ', but I doubt there are too many dialogues being recounted over at fr.wikipedia.
Sounds like LaTeX.
Right, I think that's where people got the idea to do it that way here. (Wasn't me...)
Nathan
wikitech-l@lists.wikimedia.org