What you can do is to run some experiments on the
existing dump.
How many cases are there where ''''' is hard to resolve? Did
anybody count?
It is of course possible to write articles with unbalanced
apostrophes. If I write '''hey'' it will render as
'<i>hey</i>,
and that's also how a conversion program should leave it. How many
such user mistakes are there in the current dump?
I can't answer that question for a current dump, but I can answer it
for a dump of EN that's about 15 months old (this was done as part of
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wiki_Syntax ).
The formats and figures are shown below, and I've added examples to
show a single line that would cause it to be logged (assuming the rest
of the wikitext in that article is well-formed). Basically they're all
about _balance_ - if you open a bit of paired syntax, you should close
it. Some syntaxes must be closed on the same line (e.g. ''' ), and some
must be closed in the same article (e.g. {| ).
Note however that these figures are from only a few months after a
previous run (I think - it *was* a while ago), so the figures for
now I'm guestimating would be between 2 and 4 times higher - because
it's been so long since it was last done, and because there
are probably more contributions now, and "wikitext format errors
introduced" is probably directly proportional to the number of
contributions.
----------------------------------------------
mysql> select format, count(*) as count from malformed_page group by format order by
count desc;
+-------------------------+-------+
| format | count |
+-------------------------+-------+
| '' | 7161 | example: this is a ''test
| ''' | 1248 | example: this is a '''test
| '' and ''' | 1155 | example: this is a
'''test''
| ''' and '' | 1091 | example: this is a
''test'''
| ] | 587 | example: this is a] test
| [ and ]] | 507 | example: this is a [test]]
| ]] | 417 | example: this is a test]]
| [[ | 413 | example: this is a [[test
| [ | 372 | example: this is [a test
| [[ and ] | 347 | example: this is a [[test]
| {| | 261 | example: {| (and never close it)
| |} | 238 | example: |} (and never open it)
| --> | 67 | example: <!-- blah --> -->
| <div> | 60 | example: <div> <div> blah
</div>
| {{ | 46 | example: {{ {{delete}}
| <!-- | 43 | example: <!-- <!-- blah -->
| </div> | 39 | example: <div> blah </div>
</div>
| }} | 34 | example: {{delete}} }}
| ]] and [[ | 24 | example: this is a ]]test[[
| == and === | 20 | example: ==heading===
| ] and [ | 14 | example: this is a ]test[
| [[image: | 11 | example: [[image: [[image:test.gif]]
| === and == | 8 | example: ===heading==
| '' and [[ | 5 | example: this ''is a [[test
| [ and '' | 5 | example: this [is a ''test
| '' and ]] | 5 | example: this ''is a]] test
| <code> | 5 | example: <code> <code> for i=1
</code>
| </pre> | 4 | example: <pre> for i=1 </pre>
</pre>
| </nowiki> | 4 | example: <nowiki> for i=1 </nowiki>
</nowiki>
| '' and ] | 3 | etc ....
| ]] and ''' | 3 |
| ]] and '' | 3 |
| ] and '' | 2 |
| </math> | 2 |
| [[ and '' | 2 |
| '' and [[ and ] | 2 |
| </code> | 2 |
| === and ==== | 2 |
| '' and ''' and ] | 2 |
| [ and ''' | 1 |
| ]] and '' and ''' | 1 |
| [[ and ] and [ | 1 |
| [ and ]] and '' | 1 |
| ''' and [[ | 1 |
| <math> | 1 |
| ==== and === | 1 |
| ]] and [[ and '' | 1 |
| ] and [ and '' | 1 |
| ''' and '' and ]] | 1 |
| ''' and '' and [[ and [ | 1 |
| ''' and ]] | 1 |
| ]] and ] and ''' and '' | 1 |
| [ and '' and ''' | 1 |
+-------------------------+-------+
53 rows in set (0.57 sec)
mysql>
----------------------------------------------
Note: ''''' was treated as ''' + '' (rather than
as a separate category), so
it will be mixed in with the above figures for ''' and ''.
Perhaps somebody
is already running a robot to find and fix such errors?
Not that I'm aware of - humans were better at it anyway, because some of the above
were false positives (e.g. some math formulas), and the ''' and ''
& ''' and '' tests
had lots of false positives. If a robot went around blindly automatically fixing
these, it'd be banned for vandalism. However, some automated approach could be good,
as it's an ongoing problem with no real closure (like sorting the mail), so people
eventually have enough of doing it (as I did), and move onto other stuff.
All the best,
Nick.