On 07/01/11 00:49, Happy-melon wrote:
"Jay Ashworth"jra@baylink.com wrote in message news:32162150.4910.1294292017738.JavaMail.root@benjamin.baylink.com...
----- Original Message -----
The thing you want expanded, George, is "Last Five Percent"; I refer there to (I think it was) David Gerard's comment earlier that the first 95% of wikisyntax fits reasonably well into current parser building frameworks, and the last 5% causes well adjusted programmers to consider heroin... or something like that. :-)
The argument advanced was always "there's too much usage of that ugly stuff to consider Just Not Supporting It" and I always asked whether anyone with larger computers than me had ever extracted actual statistics, and no one ever answered.
This is a key point. Every other parser discussion has floundered *before* the stage of saying "here is a working parser which does *something* interesting, now we can see how it behaves". Everyone before has got to that last 5% and said "I can't make this work; I can do *this* which is kinda similar, but when you combine it with *this* and *that* and *the other* we're now in a totally different set of edge cases". And stopped there. Obviously it's impossible to quantify all the edge cases of the current parser *because of* the lack of a schema, but until we actually get a new parser churning through real wikitext, we're blind in the dark to say whether those edge cases make up 5%, 0.5% or 50% of the corpus that's out there.
--HM
Am I right in assuming that "working" means in this case:
(a) being able to parse an article as a valid production of its grammar, and then (b) being able to complete the round trip by generating character-for-character identical wikitext output from that parse tree
If so, what would count as a statistically useful sample of articles to test? 1000? 10,000? 100,000? Or, if someone has access to serious computing resources, and a recent dump, is it worth just trying all of them? In any case, it would be interesting to have a list of failed revisions, so developers can study the problems involved.
Given the generality of wikimarkup, and that user-editability means that editors can provide absolutely any string as an input to it, it might also make sense trying it on random garbage inputs, and "fuzzed" versions of articles as well as real articles.
Flexbisonparser looks like the most plausible candidate for testing. Does anyone know if it is currently buildable?
-- Neil