java based format parser

List overview All Threads
Download

newer

older

LocalSettings.php file not being...

Increasing the upload limit on...

Stefan Groschupf

12 Feb 2005 12 Feb '05

5:09 p.m.

Hi there,

may someone know if there is a java based parser for the wikimedia formating language. I would love to transform an article into html and transfer for example [image:xxx] to <img src... etc. I found the lucene project in the cvs but this only cleanup the strings.

Or may someone can give me a hint where I can found a full documentation of the allowed annontations since the help pages wasn' available or at least wasn't complete.

Thanks for any hints. Stefan

Show replies by date

Magnus Manske

12 Feb 12 Feb

9:18 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Groschupf schrieb: | Hi there, | | may someone know if there is a java based parser for the wikimedia | formating language. | I would love to transform an article into html and transfer for example | [image:xxx] to <img src... etc. | I found the lucene project in the cvs but this only cleanup the strings.

Try the flexbisonparse module from the CVS. It contains a mostly-finished Bison-generated C parser that turns wiki text into XML.

<cardealer> It needs a few minor fixes and a total rework of the HTML part, but otherwise it's in good shape... </cardealer>

Be sure to write a *real* parser, not "merely" a converter. We have plenty of those, half-finished. Wiki syntax is really tricky when it comes to special cases and syntax combinations.

If you can get a "real" parser to work, please let us know.

| Or may someone can give me a hint where I can found a full documentation | of the allowed annontations since the help pages wasn' available or at | least wasn't complete.

http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCDnKfCZKBJbEFcz0RAiowAJ4htAPxkJVPD6wmn/SC7f5qTnxLMACdF40c qGZUY6imgl/ATTZ3wyuAh9I= =823F -----END PGP SIGNATURE-----

Jim Higson

14 Feb 14 Feb

4:53 p.m.

Magnus Manske wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Groschupf schrieb: | Hi there, | | may someone know if there is a java based parser for the wikimedia | formating language. | I would love to transform an article into html and transfer for example | [image:xxx] to <img src... etc. | I found the lucene project in the cvs but this only cleanup the strings.

Try the flexbisonparse module from the CVS. It contains a mostly-finished Bison-generated C parser that turns wiki text into XML.

<cardealer> It needs a few minor fixes and a total rework of the HTML part, but otherwise it's in good shape... </cardealer>

Be sure to write a *real* parser, not "merely" a converter. We have plenty of those, half-finished. Wiki syntax is really tricky when it comes to special cases and syntax combinations.

If a Java parser is tried, this might be of interest:

https://javacc.dev.java.net/

It might be possible to adapt the Bison specs to work with javacc, although using the JNI to call the Bison generated C parser would be a lot simpler, depending on what the Java based parser is being used for.

Daniel Wunsch

6:47 p.m.

On Monday 14 February 2005 17:53, Jim Higson wrote:

...

If a Java parser is tried, this might be of interest:

https://javacc.dev.java.net/

an alternative might be ttp://www.antlr.org/ if someone would try to write a proper parser in java, i'd be glad to help.

...

It might be possible to adapt the Bison specs to work with javacc, although using the JNI to call the Bison generated C parser would be a lot simpler, depending on what the Java based parser is being used for.

not a good idea - one of the best things about java is that you can just throw in a jar file and it works on all supported platforms, if you do it right.

playing around with FUSE and java has tought me one thing: JNI is slow, clumsy and absolutely no fun. polluting your code with dependencies on native code leads to far too many problems.

i'd avoid it whenever possible.

daniel

Axel

7:43 p.m.

Daniel Wunsch schrieb:

...

On Monday 14 February 2005 17:53, Jim Higson wrote:

...
If a Java parser is tried, this might be of interest:

https://javacc.dev.java.net/

an alternative might be ttp://www.antlr.org/ if someone would try to write a proper parser in java, i'd be glad to help.

A better Java parser is also interesting for our eclipse plugin project. Although we already have a handwritten parser, it might be better to create a new one with a parser generator.

I made a short description of the existing CVS modules and further possible improvements (like PDF generation) here: http://www.plog4u.org/index.php/Using_Eclipse_Wikipedia_Editor:Development Maybe someone of you can comment on this.

How can we join our development efforts?

-- Axel Kramer

Stefan Groschupf

9:14 p.m.

Hi, thanks for the response to everybody. Some comments.

+ Using c in java makes for my project no sense and I personal made a lot of bad experience with jni. + javacc would be from the performance point of view the best choice however it will be very hard to write a javacc parser for the wikipedia syntax, since the syntax is not very strictly. + Axel parser looks interesting since - at first it already exists. :-) and the concept of using Radeox Wiki Engine make sense. However as Axel already mentioned the code is not very 'contribution' friendly and it is not a own 'jar'.

Before I heard of Axel parser I spend some hours to write a parser based on apache nekohtml. http://www.apache.org/~andyc/neko/doc/html/

The advantage is that nekohtml can be extended by filters, use the Xerces Native Interface (XNI) framework and already will parse some possible html snippets in the text. Attention the result of the nekohtml parser extend by a wikipedia filter set is not HTML but a in memory DOM object that can be transformed to xhtml.

I have more or less written the basic stuff and defined an interface that need to be implemented to handle the content of a 'tagged' text snippet. Anyway I need now to implement as many Classes as tags exist in the wikipedia syntax. :-o

For me two possibilities would be interesting refactor and separate the parser from Axel. Or I can contribute my code in cvs somewhere and we clue things together.

Anyway I personal searching for quick solution. ;-)

Stefan

--------------------------------------------------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net

Axel

9:25 p.m.

Stefan Groschupf schrieb:

...

Before I heard of Axel parser I spend some hours to write a parser based on apache nekohtml. http://www.apache.org/~andyc/neko/doc/html/

The advantage is that nekohtml can be extended by filters, use the Xerces Native Interface (XNI) framework and already will parse some possible html snippets in the text. Attention the result of the nekohtml parser extend by a wikipedia filter set is not HTML but a in memory DOM object that can be transformed to xhtml.

I have more or less written the basic stuff and defined an interface that need to be implemented to handle the content of a 'tagged' text snippet. Anyway I need now to implement as many Classes as tags exist in the wikipedia syntax. :-o

For me two possibilities would be interesting refactor and separate the parser from Axel. Or I can contribute my code in cvs somewhere and we clue things together.

Sounds interesting, could you please send me the current sources so that I can make some tests with the parser inside the eclipse plugin?

If you like I can also give you CVS access.

-- Axel Kramer

Magnus Manske

9:55 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Groschupf schrieb: | Anyway I need now to implement as many Classes as tags exist in the | wikipedia syntax. :-o

The Bison parser does something like this:

<div attr1="foo" attr2="bar">stuff</div>

<extension name='div'><attr name="attr1">foo</attr><attr name="attr2">bar</attr>stuff</extension>

Maybe that would be easier to do?

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCER45CZKBJbEFcz0RAinSAJ9wvYzxme9iCFI8z2ziiXMn+nN+nACdEK/n TD9KA/TzZjZcetXbd5twSVs= =OECf -----END PGP SIGNATURE-----

Axel

12 Feb 12 Feb

11:38 p.m.

Stefan Groschupf schrieb:

...

Hi there,

may someone know if there is a java based parser for the wikimedia formating language. I would love to transform an article into html and transfer for example [image:xxx] to <img src... etc. I found the lucene project in the cvs but this only cleanup the strings.

Or may someone can give me a hint where I can found a full documentation of the allowed annontations since the help pages wasn' available or at least wasn't complete.

I started a java parser at: http://www.plog4u.org http://www.plog4u.de (german site)

It's an eclipse plugin which contains a Wikipedia syntax parser for the internal previewer. The internal previewer creates HTML files. The parser isn't perfect yet, but much things work.

The parser/render engine is based on radeox: http://www.radeox.org

Radeox is also used in wiki engines like http://www.snipsnap.org and http://www.xwiki.org

In CVS you can also find a start for an Eclipse HTML export wizard and PDF export wizard (Menu: File->Export...) for saving the HTML files on disk or for converting the internal HTML to PDF with the iText packages: http://www.lowagie.com/iText

If you like to help improving the plugin or parser, please mail me.

-- Axel Kramer

7098

Age (days ago)

7100

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Axel
Daniel Wunsch
Jim Higson
Magnus Manske
Stefan Groschupf