Re: [Wikitech-l] What do we want to accomplish? (was Re: WikiCreole)

7 Jan 2011


      On 07/01/11 00:49, Happy-melon wrote:
...
"Jay Ashworth"jra@baylink.com  wrote in message
news:32162150.4910.1294292017738.JavaMail.root@benjamin.baylink.com...
...
----- Original Message -----
The thing you want expanded, George, is "Last Five Percent"; I refer
there to (I think it was) David Gerard's comment earlier that the
first 95% of wikisyntax fits reasonably well into current parser
building frameworks, and the last 5% causes well adjusted programmers
to consider heroin... or something like that. :-)
The argument advanced was always "there's too much usage of that ugly
stuff to consider Just Not Supporting It" and I always asked whether
anyone with larger computers than me had ever extracted actual statistics,
and no one ever answered.
This is a key point.  Every other parser discussion has floundered *before*
the stage of saying "here is a working parser which does *something*
interesting, now we can see how it behaves".  Everyone before has got to
that last 5% and said "I can't make this work; I can do *this* which is
kinda similar, but when you combine it with *this* and *that* and *the
other* we're now in a totally different set of edge cases".  And stopped
there.  Obviously it's impossible to quantify all the edge cases of the
current parser *because of* the lack of a schema, but until we actually get
a new parser churning through real wikitext, we're blind in the dark to say
whether those edge cases make up 5%, 0.5% or 50% of the corpus that's out
there.
--HM
Am I right in assuming that "working" means in this case:
(a) being able to parse an article as a valid production of its grammar, 
and then
(b) being able to complete the round trip by generating 
character-for-character identical wikitext output from that parse tree
If so, what would count as a statistically useful sample of articles to 
test? 1000? 10,000? 100,000? Or, if someone has access to serious 
computing resources, and a recent dump, is it worth just trying all of 
them? In any case, it would be interesting to have a list of failed 
revisions, so developers can study the problems involved.
Given the generality of wikimarkup, and that user-editability means that 
editors can provide absolutely any string as an input to it, it might 
also make sense trying it on random garbage inputs, and "fuzzed" 
versions of articles as well as real articles.
Flexbisonparser looks like the most plausible candidate for testing. 
Does anyone know if it is currently buildable?
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] What do we want to accomplish? (was Re: WikiCreole)