[Wikitech-l] Re: Test my lex/yacc parser!

23 Sep 2004

Hi,

...
  Still, doesn't this mean the parser needs to
recognise
 "#REDIRECT <linkpattern>" as a special token? And doesn't that, in
 turn, present a problem if we want to retain MagicWord i18n? 
Not really. We can still recognise redirects with a regexp (or anything 
else in PHP) before passing the page to the parser.

...
  I'm not
sure why you think allowing all translations on all Wikipedias
would be a "step backwards"? Or do you seriously think someone would use
the Chinese translation of <math> on the English Wikipedia? :)  
 It's a step backwards because instead of being able to say "yes, we
 have an i18n system for you to customise these 'magic words' according
 to your language and preference", we will be saying "the parser knows
 a few variants for each of these 'magic words'; if you want to add any
 more, tell us, and we'll recompile the parser for you". 
First of all, even in the current system there is no way for server 
admins to customise the magic words without modifying actual source 
code. Secondly, you're making it sound like recompiling the parser was 
some sort of monumental task.

Here's an idea. One could provide a .c or .h file where #define 
statements are used to define the magic words, and then make sure that 
if you modify it, you only need to recompile the binary (i.e. invoke 
gcc) but you don't need flex, bison, or swig. But even if you were to 
require flex, bison and swig, even then the recompilation can be 
automated by a simple script.

Here's another reason why I think the parser should recognise all 
variants of the magic words. Think about the alternative. The 
alternative is to have "<xyz>" mean "invoke the math extension"
on one 
Wikipedia, but "&lt;xyz&gt;" on another. What is the point in allowing 
for articles to exist that rely on "<xyz>" meaning
"&lt;xyz&gt;" when it 
means "math" elsewhere?

...
  And no, I don't think anyone will need the Chinese
translation of
 <math> on the English Wikipedia, and for that precise reason see no
 reason the English wikipedia should be parsing for it. 
It doesn't make a performance difference, if that's what you're worrying 
about.

...
  * We could
replace the "other-language" words with the "this-language"
   words upon save. I.e. if someone wrote <math> on the Chinese
   Wikipedia, it would automatically be changed into "<" + some Chinese
   characters + ">" before storing it in the DB.  
 Sorry, I'm not with you on this one - are you suggesting that the
 Chinese parser be specifically compiled to only cope with the Chinese
 magic words? If so, perhaps you misunderstood my problem with having
 all variants coded in: its not that they will all work everywhere, but
 that adding new ones would (if I understand the whole yacc concept
 correctly) require recompiling a new parser, rather than just tweaking
 the appropriate language or settings file. 
To the first part: No, the idea was to have one parser that recognises 
everything, but to expose to the Chinese users only the Chinese variant 
even if someone typed in the Swahili one. (Haha.) But of course that 
doesn't solve your problem. I mentioned above that recompiling the 
parser is by no means difficult, and that the current system also 
requires editing source code, but if you still think that it's a 
problem, then we're stuck because I don't know what else we can do.

...
  So maybe you're right, and the only workable
solution is to have all
 variants hard-coded in the parser. I guess this is where we come to
 regret adopting an "extension" syntax that matches/conflicts with the
 syntax used by "allowed bits of HTML". 
True. If we had something like [!math x^2 + y^2 = z^2 !], then we could 
say "everything in [! ... !] is an extension". Would make life much easier.

...
  If we want to keep true customisation of magic words
(where "editing
 the source" != "customisation") the best idea I've come up with is:
 1) hard-code all allowed HTML into the parser. This means maximum
 efficiency for those bits, and the ability to handle relationships
 between them, etc.
 2) treat everything else matching "<"+some_letters+">" as an
 "extension" and spew out its contents as one element of the parse
 tree. If the receiving PHP script then says "there's no such
 extension", it escapes the "<" and ">", and passes the
contents back
 to be parsed normally. 
Unfortunately, this opens a whole other can of worms. What if there is 
no end tag? What if there is other mark-up partly inside and partly 
outside the "extension" block?

	He said, "What the ''<swearword>'' is going on?"

I think, considering all of these problems we have discussed, it makes a 
real lot of sense to formulate a "rule" that the design of the parser 
should fulfill: The parser must know in advance how to parse everything. 
The resulting parse tree must not depend on anything other than the 
input wiki text.

Greetings,
Timwi

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: Test my lex/yacc parser!