On 07/01/11 00:49, Happy-melon wrote:
"Jay Ashworth"<jra(a)baylink.com> wrote
in message
news:32162150.4910.1294292017738.JavaMail.root@benjamin.baylink.com...
----- Original Message -----
The thing you want expanded, George, is "Last Five Percent"; I refer
there to (I think it was) David Gerard's comment earlier that the
first 95% of wikisyntax fits reasonably well into current parser
building frameworks, and the last 5% causes well adjusted programmers
to consider heroin... or something like that. :-)
The argument advanced was always "there's too much usage of that ugly
stuff to consider Just Not Supporting It" and I always asked whether
anyone with larger computers than me had ever extracted actual statistics,
and no one ever answered.
This is a key point. Every other parser discussion has
floundered *before*
the stage of saying "here is a working parser which does *something*
interesting, now we can see how it behaves". Everyone before has got to
that last 5% and said "I can't make this work; I can do *this* which is
kinda similar, but when you combine it with *this* and *that* and *the
other* we're now in a totally different set of edge cases". And stopped
there. Obviously it's impossible to quantify all the edge cases of the
current parser *because of* the lack of a schema, but until we actually get
a new parser churning through real wikitext, we're blind in the dark to say
whether those edge cases make up 5%, 0.5% or 50% of the corpus that's out
there.
--HM
Am I right in assuming that "working" means in this case:
(a) being able to parse an article as a valid production of its grammar,
and then
(b) being able to complete the round trip by generating
character-for-character identical wikitext output from that parse tree
If so, what would count as a statistically useful sample of articles to
test? 1000? 10,000? 100,000? Or, if someone has access to serious
computing resources, and a recent dump, is it worth just trying all of
them? In any case, it would be interesting to have a list of failed
revisions, so developers can study the problems involved.
Given the generality of wikimarkup, and that user-editability means that
editors can provide absolutely any string as an input to it, it might
also make sense trying it on random garbage inputs, and "fuzzed"
versions of articles as well as real articles.
Flexbisonparser looks like the most plausible candidate for testing.
Does anyone know if it is currently buildable?
-- Neil