Hi, thanks for the response to everybody. Some comments.
+ Using c in java makes for my project no sense and I personal made a lot of bad experience with jni. + javacc would be from the performance point of view the best choice however it will be very hard to write a javacc parser for the wikipedia syntax, since the syntax is not very strictly. + Axel parser looks interesting since - at first it already exists. :-) and the concept of using Radeox Wiki Engine make sense. However as Axel already mentioned the code is not very 'contribution' friendly and it is not a own 'jar'.
Before I heard of Axel parser I spend some hours to write a parser based on apache nekohtml. http://www.apache.org/~andyc/neko/doc/html/
The advantage is that nekohtml can be extended by filters, use the Xerces Native Interface (XNI) framework and already will parse some possible html snippets in the text. Attention the result of the nekohtml parser extend by a wikipedia filter set is not HTML but a in memory DOM object that can be transformed to xhtml.
I have more or less written the basic stuff and defined an interface that need to be implemented to handle the content of a 'tagged' text snippet. Anyway I need now to implement as many Classes as tags exist in the wikipedia syntax. :-o
For me two possibilities would be interesting refactor and separate the parser from Axel. Or I can contribute my code in cvs somewhere and we clue things together.
Anyway I personal searching for quick solution. ;-)
Stefan
--------------------------------------------------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net