Thank you all for your contribs :).
Hi,
So... I was over-optimistic about managing to extract the first paragraph of a "Wikipedia" article out of its "Wikitext" easily...
Yet, I managed (1) for instance (for the "Wikipedia" article "Čokot") to get the following "Wikitext" sentence: ------------------------------------------------------------------------- '''Cokot''', en [[serbe]] [[Alphabet cyrillique serbe|cyrillique]] {{lang|sr|?????}}, est une localité de [[Serbie]] située dans la municipalité de [[Palilula (Niš)]], district de [[Nišava (district)| Nišava]]. En [[2002]], elle comptait {{formatnum:1401}} habitants<ref name="stats1">{{Historique de la population (Serbie)}}</ref>, dont une majorité de [[Serbes]]. -------------------------------------------------------------------------
I then used the "Bliki" (2) engine to convert this "Wikitext" sentence to "HTML". Here is what I got: ------------------------------------------------------------------------- <p>Cokot, en http://fr.wikipedia.org/wiki/Serbe serbe http://fr.wikipedia.org/wiki/ Alphabet_cyrillique_serbe cyrillique {{lang}}, est une localité de http:// fr.wikipedia.org/wiki/Serbie Serbie située dans la municipalité de http://fr.wikipedia.org/wiki/Palilula_(Ni %C2%9A) Palilula (Niš) , district de http://fr.wikipedia.org/wiki/Ni%C2%9Aava_(district) Nišava . En http:// fr.wikipedia.org/wiki/2002 2002 , elle comptait {{formatnum:1401}} habitants<sup id="_ref-stats1_a" class="reference"> #_note-stats1 [1] </sup>, dont une majorité de http://fr.wikipedia.org/wiki/Serbes Serbes .</p> ------------------------------------------------------------------------- This "HTML" sentence still contains two "Wikitext" chunks: - {{lang}} and - {{formatnum:1401}}.
=> "{{lang}}" should have been suppressed. => "{{formatnum:1401}}" should have been replaced by "1401".
So, I posted on the "Bliki" forum (3) and someone told me they hadn't implemented yet what was necessary to handle those two chunks of "Wikitext" that remain in the example above... and that I had to do it myself...
The reason I chose "Bliki" is because there was a Java ".jar" archive available (and ready to be embedded in my Eclipse project) which is quite convenient for me.
MY FIRST QUESTION IS: ===================== I was wondering if you knew a better tool than this one... one which wouldn't "miss" some "Wikitext" chunks of code like in the above example (or maybe which at least would handle usual templates like "lang" and "formatnum")?
MY SECOND QUESTION IS: ====================== I was also wondering: the parser which is used in "Wikipedia" works pretty well... I mean: such things as above never happen... as far as I know... So my question is: is this parser available? Where? Can I use it with my Java code? And please, forgive me if this question is naïve...
Thank you for your help and indulgence. All the best, -- Lmhelp
(1) Really, it is something which wouldn't probably work in all cases and is based on the fact that a paragraph ends with "\n\n" as "Platonides" said in his first post. (2) http://code.google.com/p/gwtwiki/ (3) http://groups.google.com/group/bliki/browse_thread/thread/7ed33272b206826f