Hi there,
may someone know if there is a java based parser for the wikimedia formating language. I would love to transform an article into html and transfer for example [image:xxx] to <img src... etc. I found the lucene project in the cvs but this only cleanup the strings.
Or may someone can give me a hint where I can found a full documentation of the allowed annontations since the help pages wasn' available or at least wasn't complete.
Thanks for any hints. Stefan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Stefan Groschupf schrieb: | Hi there, | | may someone know if there is a java based parser for the wikimedia | formating language. | I would love to transform an article into html and transfer for example | [image:xxx] to <img src... etc. | I found the lucene project in the cvs but this only cleanup the strings.
Try the flexbisonparse module from the CVS. It contains a mostly-finished Bison-generated C parser that turns wiki text into XML.
<cardealer> It needs a few minor fixes and a total rework of the HTML part, but otherwise it's in good shape... </cardealer>
Be sure to write a *real* parser, not "merely" a converter. We have plenty of those, half-finished. Wiki syntax is really tricky when it comes to special cases and syntax combinations.
If you can get a "real" parser to work, please let us know.
| Or may someone can give me a hint where I can found a full documentation | of the allowed annontations since the help pages wasn' available or at | least wasn't complete.
http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page
Magnus
Magnus Manske wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Stefan Groschupf schrieb: | Hi there, | | may someone know if there is a java based parser for the wikimedia | formating language. | I would love to transform an article into html and transfer for example | [image:xxx] to <img src... etc. | I found the lucene project in the cvs but this only cleanup the strings.
Try the flexbisonparse module from the CVS. It contains a mostly-finished Bison-generated C parser that turns wiki text into XML.
<cardealer> It needs a few minor fixes and a total rework of the HTML part, but otherwise it's in good shape... </cardealer>
Be sure to write a *real* parser, not "merely" a converter. We have plenty of those, half-finished. Wiki syntax is really tricky when it comes to special cases and syntax combinations.
If a Java parser is tried, this might be of interest:
It might be possible to adapt the Bison specs to work with javacc, although using the JNI to call the Bison generated C parser would be a lot simpler, depending on what the Java based parser is being used for.
On Monday 14 February 2005 17:53, Jim Higson wrote:
If a Java parser is tried, this might be of interest:
an alternative might be ttp://www.antlr.org/ if someone would try to write a proper parser in java, i'd be glad to help.
It might be possible to adapt the Bison specs to work with javacc, although using the JNI to call the Bison generated C parser would be a lot simpler, depending on what the Java based parser is being used for.
not a good idea - one of the best things about java is that you can just throw in a jar file and it works on all supported platforms, if you do it right.
playing around with FUSE and java has tought me one thing: JNI is slow, clumsy and absolutely no fun. polluting your code with dependencies on native code leads to far too many problems.
i'd avoid it whenever possible.
daniel
Daniel Wunsch schrieb:
On Monday 14 February 2005 17:53, Jim Higson wrote:
If a Java parser is tried, this might be of interest:
an alternative might be ttp://www.antlr.org/ if someone would try to write a proper parser in java, i'd be glad to help.
Hi
A better Java parser is also interesting for our eclipse plugin project. Although we already have a handwritten parser, it might be better to create a new one with a parser generator.
I made a short description of the existing CVS modules and further possible improvements (like PDF generation) here: http://www.plog4u.org/index.php/Using_Eclipse_Wikipedia_Editor:Development Maybe someone of you can comment on this.
How can we join our development efforts?
Hi, thanks for the response to everybody. Some comments.
+ Using c in java makes for my project no sense and I personal made a lot of bad experience with jni. + javacc would be from the performance point of view the best choice however it will be very hard to write a javacc parser for the wikipedia syntax, since the syntax is not very strictly. + Axel parser looks interesting since - at first it already exists. :-) and the concept of using Radeox Wiki Engine make sense. However as Axel already mentioned the code is not very 'contribution' friendly and it is not a own 'jar'.
Before I heard of Axel parser I spend some hours to write a parser based on apache nekohtml. http://www.apache.org/~andyc/neko/doc/html/
The advantage is that nekohtml can be extended by filters, use the Xerces Native Interface (XNI) framework and already will parse some possible html snippets in the text. Attention the result of the nekohtml parser extend by a wikipedia filter set is not HTML but a in memory DOM object that can be transformed to xhtml.
I have more or less written the basic stuff and defined an interface that need to be implemented to handle the content of a 'tagged' text snippet. Anyway I need now to implement as many Classes as tags exist in the wikipedia syntax. :-o
For me two possibilities would be interesting refactor and separate the parser from Axel. Or I can contribute my code in cvs somewhere and we clue things together.
Anyway I personal searching for quick solution. ;-)
Stefan
--------------------------------------------------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net
Stefan Groschupf schrieb:
Before I heard of Axel parser I spend some hours to write a parser based on apache nekohtml. http://www.apache.org/~andyc/neko/doc/html/
The advantage is that nekohtml can be extended by filters, use the Xerces Native Interface (XNI) framework and already will parse some possible html snippets in the text. Attention the result of the nekohtml parser extend by a wikipedia filter set is not HTML but a in memory DOM object that can be transformed to xhtml.
I have more or less written the basic stuff and defined an interface that need to be implemented to handle the content of a 'tagged' text snippet. Anyway I need now to implement as many Classes as tags exist in the wikipedia syntax. :-o
For me two possibilities would be interesting refactor and separate the parser from Axel. Or I can contribute my code in cvs somewhere and we clue things together.
Sounds interesting, could you please send me the current sources so that I can make some tests with the parser inside the eclipse plugin?
If you like I can also give you CVS access.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Stefan Groschupf schrieb: | Anyway I need now to implement as many Classes as tags exist in the | wikipedia syntax. :-o
The Bison parser does something like this:
<div attr1="foo" attr2="bar">stuff</div>
to
<extension name='div'><attr name="attr1">foo</attr><attr name="attr2">bar</attr>stuff</extension>
Maybe that would be easier to do?
Magnus
Stefan Groschupf schrieb:
Hi there,
may someone know if there is a java based parser for the wikimedia formating language. I would love to transform an article into html and transfer for example [image:xxx] to <img src... etc. I found the lucene project in the cvs but this only cleanup the strings.
Or may someone can give me a hint where I can found a full documentation of the allowed annontations since the help pages wasn' available or at least wasn't complete.
Hi
I started a java parser at: http://www.plog4u.org http://www.plog4u.de (german site)
It's an eclipse plugin which contains a Wikipedia syntax parser for the internal previewer. The internal previewer creates HTML files. The parser isn't perfect yet, but much things work.
The parser/render engine is based on radeox: http://www.radeox.org
Radeox is also used in wiki engines like http://www.snipsnap.org and http://www.xwiki.org
In CVS you can also find a start for an Eclipse HTML export wizard and PDF export wizard (Menu: File->Export...) for saving the HTML files on disk or for converting the internal HTML to PDF with the iText packages: http://www.lowagie.com/iText
If you like to help improving the plugin or parser, please mail me.
wikitech-l@lists.wikimedia.org