Hi,
Thank you or reading my post.
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
I need to extract automatically the first paragraph of a Wiki article...
I did it from the HTML version of a Wiki article (because I noticed the first paragraph was the first <p> element child of a <div> element which id is "bodyContent"...) but I need to work with the "Wikitext" itself...
- Is a grammar available somewhere? - Do you have any idea how to extract the first paragaph of a Wiki article? - Any advice? - Does a Java "Wikitext" "parser" exists which would do it?
Thank you for your help. All the best,
-- Lmhelp
lmhelp wrote:
Hi,
Thank you or reading my post.
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
No. But see http://www.mediawiki.org/wiki/Markup_spec for grammars which "kind of work".
I need to extract automatically the first paragraph of a Wiki article...
I did it from the HTML version of a Wiki article (because I noticed the first paragraph was the first <p> element child of a <div> element which id is "bodyContent"...) but I need to work with the "Wikitext" itself...
- Is a grammar available somewhere?
- Do you have any idea how to extract the first paragaph of a Wiki article?
- Any advice?
- Does a Java "Wikitext" "parser" exists which would do it?
Get the first text before a double new line (\n\n), which is what splits paragraphs in wikitext.
However, pages commonly begin with templates, so if the page begins with {{, you would remove everything up to the matching }} (and remove leading whitespace).
On 4 August 2010 20:45, lmhelp lmbox@wanadoo.fr wrote:
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
There isn't one. The "parser" is not actually a parser - it takes wikitext in, does things to it and spits HTML out. Much of its expected behaviour is actually emergent properties of the vagaries of PHP.
Many have tried to write a description of wikitext that isn't the code itself, all so far have failed ...
- Is a grammar available somewhere?
- Do you have any idea how to extract the first paragaph of a Wiki article?
- Any advice?
- Does a Java "Wikitext" "parser" exists which would do it?
If anyone ever does come up with an algorithm that accurately
On 4 August 2010 23:58, David Gerard dgerard@gmail.com wrote:
On 4 August 2010 20:45, lmhelp lmbox@wanadoo.fr wrote:
- Is a grammar available somewhere?
- Do you have any idea how to extract the first paragaph of a Wiki article?
- Any advice?
- Does a Java "Wikitext" "parser" exists which would do it?
If anyone ever does come up with an algorithm that accurately
... renders the existing body of wikitext the way the present code does, it will be a day of great joy.
- d.
The current "parser" is, as David Gerard said, not much of a parser by any conventional definition. It's more of a macro-expander (for parser tags and templates) and a series of mostly-regular-expression-based replacement routines, which result in partially valid HTML which is then repaired in most cases to be valid HTML.
This past spring I wrote a parser which tokenizes and parses wikitext into a node-tree. It understands template nesting and it completely ignores HTML comments and parser tags using a masking technique.
/start of long-winded explanation/
The key to parsing wikitext is to use a mental model of what's going on, not get stuck on the source code of the "parser" or get too worked up about BNF and it's variants. Wikitext is based on blocks - blocks are on or more consecutive lines which share a rendering intent, such as a paragraph, list, table, or heading. Some blocks (one or more lines) should be merged together with neighboring blocks of the same type such as list items, while some mixed lines (single lines containing more than one logical block) should be broken apart, such as raw text being typed on the same line just after a table closing.
The parser I wrote explains these rules and all syntax in a simple meta-language expressed in PHP arrays. I've been running real Wikipedia articles through it for a while with excellent results. I do not have a template-expander or HTML renderer yet, so right now the results are merely syntax highlighted wikitext visually broken into logical blocks, or raw JSON/XML dumps of the node-tree.
The reason I went about writing this parser was to solve a problem on the front-end, which is that there's no way to know where any given portion of a page came from, and the current parser doesn't follow any rules of encapsulation. Could have been text directly within the article, the results of expanding one or more templates, or processing a parser tag. By parsing the wikitext into a node-tree, it can be rendered in an encapsulated way and IDs and classes can be added to the output to explain where each bit of text came from.
By encapsulation, I'm specifically meaning that the results of any generated content such as template expansion or parser-hooks should be complete validated HTML, opening all tags it closes and closing all tags it opens. This is different from the way templates and parser-hooks currently work, and would require adjustments with some templates, but such template reform is feasible, and such use of templates is defensibly evil anyways.
I showed a demo of this parser working in Berlin this year, and got more done on it while stuck in Berlin thanks to the volcano of death, but since I've been back to work have not had much time to complete it. I intend on getting this code up on our SVN soon as part of my flying-cars version of MediaWiki I've been hacking away at on my laptop.
Just wanted to throw this all in here, hopefully it will be useful. I'm glad to share more about what I learned embarking on this endeavor and share my code as well - might commit it within a week or two.
/end of long-winded explanation/
In short, the current "parser" is a bad example of how to write a parser, but it does work. I have found that studying how it works is far less useful than observing what it does in practice and reverse engineering it with more scalable and flexible parsing techniques in mind.
- Trevor
On 8/4/10 3:58 PM, David Gerard wrote:
On 4 August 2010 20:45, lmhelplmbox@wanadoo.fr wrote:
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
There isn't one. The "parser" is not actually a parser - it takes wikitext in, does things to it and spits HTML out. Much of its expected behaviour is actually emergent properties of the vagaries of PHP.
Many have tried to write a description of wikitext that isn't the code itself, all so far have failed ...
- Is a grammar available somewhere?
- Do you have any idea how to extract the first paragaph of a Wiki article?
- Any advice?
- Does a Java "Wikitext" "parser" exists which would do it?
If anyone ever does come up with an algorithm that accurately
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 6 August 2010 18:59, Trevor Parscal tparscal@wikimedia.org wrote:
In short, the current "parser" is a bad example of how to write a parser,
I forgot to call it "a box of pure malevolent evil, a purveyor of insidious insanity, an eldritch manifestation that would make Bill Gates let out a low whistle of admiration," but it's all those, too.
but it does work. I have found that studying how it works is far less useful than observing what it does in practice and reverse engineering it with more scalable and flexible parsing techniques in mind.
Dude, if you've got what you've described here, you may in fact be a genius.
- d.
Thank you all for your contribs :).
Hi,
So... I was over-optimistic about managing to extract the first paragraph of a "Wikipedia" article out of its "Wikitext" easily...
Yet, I managed (1) for instance (for the "Wikipedia" article "Čokot") to get the following "Wikitext" sentence: ------------------------------------------------------------------------- '''Cokot''', en [[serbe]] [[Alphabet cyrillique serbe|cyrillique]] {{lang|sr|?????}}, est une localité de [[Serbie]] située dans la municipalité de [[Palilula (Niš)]], district de [[Nišava (district)| Nišava]]. En [[2002]], elle comptait {{formatnum:1401}} habitants<ref name="stats1">{{Historique de la population (Serbie)}}</ref>, dont une majorité de [[Serbes]]. -------------------------------------------------------------------------
I then used the "Bliki" (2) engine to convert this "Wikitext" sentence to "HTML". Here is what I got: ------------------------------------------------------------------------- <p>Cokot, en http://fr.wikipedia.org/wiki/Serbe serbe http://fr.wikipedia.org/wiki/ Alphabet_cyrillique_serbe cyrillique {{lang}}, est une localité de http:// fr.wikipedia.org/wiki/Serbie Serbie située dans la municipalité de http://fr.wikipedia.org/wiki/Palilula_(Ni %C2%9A) Palilula (Niš) , district de http://fr.wikipedia.org/wiki/Ni%C2%9Aava_(district) Nišava . En http:// fr.wikipedia.org/wiki/2002 2002 , elle comptait {{formatnum:1401}} habitants<sup id="_ref-stats1_a" class="reference"> #_note-stats1 [1] </sup>, dont une majorité de http://fr.wikipedia.org/wiki/Serbes Serbes .</p> ------------------------------------------------------------------------- This "HTML" sentence still contains two "Wikitext" chunks: - {{lang}} and - {{formatnum:1401}}.
=> "{{lang}}" should have been suppressed. => "{{formatnum:1401}}" should have been replaced by "1401".
So, I posted on the "Bliki" forum (3) and someone told me they hadn't implemented yet what was necessary to handle those two chunks of "Wikitext" that remain in the example above... and that I had to do it myself...
The reason I chose "Bliki" is because there was a Java ".jar" archive available (and ready to be embedded in my Eclipse project) which is quite convenient for me.
MY FIRST QUESTION IS: ===================== I was wondering if you knew a better tool than this one... one which wouldn't "miss" some "Wikitext" chunks of code like in the above example (or maybe which at least would handle usual templates like "lang" and "formatnum")?
MY SECOND QUESTION IS: ====================== I was also wondering: the parser which is used in "Wikipedia" works pretty well... I mean: such things as above never happen... as far as I know... So my question is: is this parser available? Where? Can I use it with my Java code? And please, forgive me if this question is naïve...
Thank you for your help and indulgence. All the best, -- Lmhelp
(1) Really, it is something which wouldn't probably work in all cases and is based on the fact that a paragraph ends with "\n\n" as "Platonides" said in his first post. (2) http://code.google.com/p/gwtwiki/ (3) http://groups.google.com/group/bliki/browse_thread/thread/7ed33272b206826f
On Sat, Aug 7, 2010 at 9:21 AM, lmhelp lmbox@wanadoo.fr wrote:
MY FIRST QUESTION IS:
I was wondering if you knew a better tool than this one... one which wouldn't "miss" some "Wikitext" chunks of code like in the above example (or maybe which at least would handle usual templates like "lang" and "formatnum")?
mwlib is the best parser available for folks who want to do a quick job such as yours.
MY SECOND QUESTION IS:
I was also wondering: the parser which is used in "Wikipedia" works pretty well... I mean: such things as above never happen... as far as I know... So my question is: is this parser available? Where? Can I use it with my Java code? And please, forgive me if this question is naïve...
You can use the dumpHTML maintenance script to convert wikitext to html, and then you can use a dom library such as BeautifulSoup to grab all of the text nodes. This approach is very similar to using mwlib, except that it trades off using a lot of cpu time for being a little bit easier.
Hi,
Thank you for your answer.
mwlib is the best parser available for folks who want to do a quick job such as yours.
Maybe it is, I don't know... I know (since recently) it is not an easy task constructing a parser for "Wikitext"... but, fairly, it is not really satisfactory to have {{lang}}, {{formatnum:1401}} left in the generated "HTML" code, is it (I mean... given the fact that it never happens with "Wikipedia").
You can use the dumpHTML maintenance script to convert wikitext to html
Would "dumpHTML" work with only one "Wikitext" sentence having to be translated to "HTML"?
Actually, on: http://www.mediawiki.org/wiki/Extension:DumpHTML one can read: "dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation". It looks a bit oversized in my case... doesn't it?
All the best, -- Lmhelp
On Sat, Aug 7, 2010 at 10:54 AM, lmhelp lmbox@wanadoo.fr wrote:
Hi,
Thank you for your answer.
mwlib is the best parser available for folks who want to do a quick job such as yours.
Maybe it is, I don't know... I know (since recently) it is not an easy task constructing a parser for "Wikitext"... but, fairly, it is not really satisfactory to have {{lang}}, {{formatnum:1401}} left in the generated "HTML" code, is it (I mean... given the fact that it never happens with "Wikipedia").
mwlib was written in conjunction with the WMF, and IIRC had at least some input from Brion Vibber. It's high quality and works well. There is a 2-3 hour learning curve for navigating the python modules and methods using dir and help.
You can use the dumpHTML maintenance script to convert wikitext to html
Would "dumpHTML" work with only one "Wikitext" sentence having to be translated to "HTML"?
Actually, on: http://www.mediawiki.org/wiki/Extension:DumpHTML one can read: "dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation". It looks a bit oversized in my case... doesn't it?
IIRC dumpHTML is a maintenance script that is included with mediawiki. I don't believe that it requires you to have images. I have used both of the approaches I described to you in the past, and found them both to be straightforward.
All the best,
Lmhelp
View this message in context: http://old.nabble.com/Wikitext-grammar-tp29350471p29375714.html Sent from the WikiMedia General mailing list archive at Nabble.com.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
So why not use the "real" parser?
* Get rendered HTML page * Extract <div id="bodyContent"> * Take the first <p> element in there
Profit!
Magnus
On Sat, Aug 7, 2010 at 6:19 PM, Brian J Mingus brian.mingus@colorado.edu wrote:
On Sat, Aug 7, 2010 at 10:54 AM, lmhelp lmbox@wanadoo.fr wrote:
Hi,
Thank you for your answer.
mwlib is the best parser available for folks who want to do a quick job such as yours.
Maybe it is, I don't know... I know (since recently) it is not an easy task constructing a parser for "Wikitext"... but, fairly, it is not really satisfactory to have {{lang}}, {{formatnum:1401}} left in the generated "HTML" code, is it (I mean... given the fact that it never happens with "Wikipedia").
mwlib was written in conjunction with the WMF, and IIRC had at least some input from Brion Vibber. It's high quality and works well. There is a 2-3 hour learning curve for navigating the python modules and methods using dir and help.
You can use the dumpHTML maintenance script to convert wikitext to html
Would "dumpHTML" work with only one "Wikitext" sentence having to be translated to "HTML"?
Actually, on: http://www.mediawiki.org/wiki/Extension:DumpHTML one can read: "dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation". It looks a bit oversized in my case... doesn't it?
IIRC dumpHTML is a maintenance script that is included with mediawiki. I don't believe that it requires you to have images. I have used both of the approaches I described to you in the past, and found them both to be straightforward.
All the best,
Lmhelp
View this message in context: http://old.nabble.com/Wikitext-grammar-tp29350471p29375714.html Sent from the WikiMedia General mailing list archive at Nabble.com.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
So why not use the "real" parser?
Exactly. Where can it be found, please?
Thanks and all the best, -- Lmhelp
Hi,
I have abandonned "Bliki" because look what happenned:
Here is what I gave to "Bliki" as an input: ------------------------------------------------------------------------------- Le {{Guil|'''parti philosophique'''}} désignait globalement au {{s-|XVIII|e|}}, en [[France]], les intellectuels partisans du mouvement des [[Lumières (philosophie)|Lumières]], par opposition au parti dit dévôt. -------------------------------------------------------------------------------
And here is what I got as an (HTML) output: ------------------------------------------------------------------------------- <p>Le {{Guil}} désignait globalement au {{s-}}, en < a href="http://fr.wikipedia.org/wiki/France" title="France">France , les intellectuels partisans du mouvement des < a href="http://fr.wikipedia.org/wiki/Lumi%C3%A8res_(philosophie)" title="Lumières (philosophie)">Lumières , par opposition au parti dit dévôt. </p> -------------------------------------------------------------------------------
The two templates: - {{Guil|'''parti philosophique'''}} - {{s-|XVIII|e|}} haven't been resolved correctly, respectively: - {{Guil}} - {{s-}}
I think this is worse than not being resolved at all because, here for instance, we lost information.
I don't know if it's normal or not. I don't know either if this is the good forum for such a thread... Indeed, I have the impression that you mostly work with "wiki"s not with some programs dealing with "Wikitext"s like Java programs...
Someone told me: why not use the real parser? Ok. I am not against it but where can it be found? Can you give me the URL?
Thanks. Best regards, -- Lmhelp
On Sun, Aug 8, 2010 at 9:49 PM, lmhelp lmbox@wanadoo.fr wrote:
Hi,
I have abandonned "Bliki" because look what happenned:
Here is what I gave to "Bliki" as an input:
Le {{Guil|'''parti philosophique'''}} désignait globalement au {{s-|XVIII|e|}}, en [[France]], les intellectuels partisans du mouvement des [[Lumières (philosophie)|Lumières]], par opposition au parti dit dévôt.
And here is what I got as an (HTML) output:
<p>Le {{Guil}} désignait globalement au {{s-}}, en < a href="http://fr.wikipedia.org/wiki/France" title="France">France , les intellectuels partisans du mouvement des < a href="http://fr.wikipedia.org/wiki/Lumi%C3%A8res_(philosophie)" title="Lumières (philosophie)">Lumières , par opposition au parti dit dévôt. </p> -------------------------------------------------------------------------------
The two templates:
- {{Guil|'''parti philosophique'''}}
- {{s-|XVIII|e|}}
haven't been resolved correctly, respectively:
- {{Guil}}
- {{s-}}
In your wikimodel you have to implement, how to get the raw text of these templates. In the APIWikiModel, there's an example getRawWikiContent() method, which reads the template through the http://en.wikipedia.org/w/api.php
http://code.google.com/p/gwtwiki/source/browse/trunk/info.bliki.wiki/bliki-p...
See these examples for reading articles from wikipedia: http://code.google.com/p/gwtwiki/source/browse/trunk/info.bliki.wiki/bliki-p...
Hi Axel,
Thank you for your answer.
I am wondering... how do you explain that the two templates "{{Guil|'''parti philosophique'''}}" and "{{s-|XVIII|e|}}" in my example are not processed correctly (by default) (*)?
Is it because "Bliki" works correctly with English "wiki" articles and not with, for instance, French ones? I mean: the template "century" ("{{s-|XVIII|e|}}") exists in English "wiki"s but has not the same "schema"/"structure" as the French one and this would be the reason why it is processed weirdly?
Thank you for answering. All the best, -- Lmhelp
(*) In Java, I am only doing: ----------------------------------------------------------------------- htmlStr = wikiModel.render(sWikiText); -----------------------------------------------------------------------
with "sWikiText" being exactly (for instance): ----------------------------------------------------------------------- Le {{Guil|'''parti philosophique'''}} désignait globalement au {{s-|XVIII|e|}}, en [[France]], les intellectuels partisans du mouvement des [[Lumières (philosophie)|Lumières]], par opposition au parti dit dévôt. -----------------------------------------------------------------------
and without having done anything else in addition.
Ok. I can answer myself the question, it is: no. It doesn't depend on the Wikipedia language.
-- Lmhelp
On 8/9/2010 9:19 AM, lmhelp2 wrote:
Hi Axel,
Thank you for your answer.
I am wondering... how do you explain that the two templates "{{Guil|'''parti philosophique'''}}" and "{{s-|XVIII|e|}}" in my example are not processed correctly (by default) (*)?
Is it because "Bliki" works correctly with English "wiki" articles and not with, for instance, French ones? I mean: the template "century" ("{{s-|XVIII|e|}}") exists in English "wiki"s but has not the same "schema"/"structure" as the French one and this would be the reason why it is processed weirdly?
Thank you for answering. All the best, -- Lmhelp
(*) In Java, I am only doing:
htmlStr = wikiModel.render(sWikiText); ----------------------------------------------------------------------- with "sWikiText" being exactly (for instance): ----------------------------------------------------------------------- Le {{Guil|'''parti philosophique'''}} désignait globalement au {{s-|XVIII|e|}}, en [[France]], les intellectuels partisans du mouvement des [[Lumières (philosophie)|Lumières]], par opposition au parti dit dévôt. ----------------------------------------------------------------------- and without having done anything else in addition.
2010-08-07 20:24, lmhelp skrev:
So why not use the "real" parser?
Exactly. Where can it be found, please?
Thanks and all the best,
Lmhelp
fetch the html from wikipedia.org with something like wget (playing nicely and using delays!) and then extract the first <p> element with something which parses the html into a tree. I've done that using perl with HTML::Tree. Generally a regular expression like /<p\b.+?</p>/ might do the extraction just as well, but cheaper and faster if you,re just after the first <p> element! Really cheap, I know!
/BP
Hi Magnus,
This would be really great if I could do that!
Where can I download the "real" parser?
Can I use it in the following way:
=> let's suppose: - the parser's name is "wiki_to_html_parser", - I have a "Wikipedia" article in its "Wikitext" version "article.wikitext", - I want to generated the corresponding "HTML" file "article.html"
=> could I execute something like:
------------------------------------------------------------------------------- command_line> wiki_to_html_parser -wikitext article.wikitext -html article.html
------------------------------------------------------------------------------- which would generate "article.html" from "article.wikitext" using the "real" parser?
And what would be even better for me, would be to be able to do that from inside a Java program. Is it possible?
Thank you for your help. Sincerely, -- Léa
On 8/7/2010 8:19 PM, Magnus Manske wrote:
So why not use the "real" parser?
- Get rendered HTML page
- Extract<div id="bodyContent">
- Take the first<p> element in there
Profit!
Magnus
mwlib was written in conjunction with the WMF, and IIRC had at least some input from Brion Vibber. It's high quality and works well. There is a 2-3 hour learning curve for navigating the python modules and methods using dir and help.
Well, in front of the evidence... "{{lang}}" and "{{formatnum...}}" are not processed... Well, I guess I should stop insisting on that.
IIRC dumpHTML is a maintenance script that is included with mediawiki. I don't believe that it requires you to have images. I have used both of the approaches I described to you in the past, and found them both to be straightforward.
Ok. I have looked closer. I have downloaded "DumpHTML" and tried to execute it with a file containing my "Wikitext" sentence. "DumpHTML" is not OK for me.
See from: http://www.mediawiki.org/wiki/Special:ExtensionDistributor
"The tar archive should be extracted into your extensions directory. For example, on a unix-like OS: tar -xzf DumpHTML-MW1.16-r59064.tar.gz -C /var/www/mediawiki/extensions"
I have no "extensions directory" because I have no "wiki". I have sentences, isolated sentences (actually "Java" "String" objects) which I am manipulating through a "Java" program.
Thanks and all the best, -- Lmhelp
You may have a look at the DPL (DynamicPageList) Extension. It provides lots of functions to render segments of pages, they may have what you want...
Bernhard
-----Ursprüngliche Nachricht----- Von: lmhelp [mailto:lmbox@wanadoo.fr] Gesendet: Mittwoch, 4. August 2010 21:45 An: mediawiki-l@lists.wikimedia.org Betreff: [Mediawiki-l] Wikitext grammar
Hi,
Thank you or reading my post.
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
I need to extract automatically the first paragraph of a Wiki article...
I did it from the HTML version of a Wiki article (because I noticed the first paragraph was the first <p> element child of a <div> element which id is "bodyContent"...) but I need to work with the "Wikitext" itself...
- Is a grammar available somewhere? - Do you have any idea how to extract the first paragaph of a Wiki article? - Any advice? - Does a Java "Wikitext" "parser" exists which would do it?
Thank you for your help. All the best,
-- Lmhelp
We have written a Java utility program that accesses the first few paragraphs of a wiki article (or the article as a whole). Of course, with trying to parse templates and the wiki syntax, it isn't perfect.
If you are trying to get the non-template text, you'll have to do the following: Determine if you're parsing to the end of the first paragraph (\n\n) or the first section (line beginning with ===) Remove all templates (there's no way to parse the templates into their presentation formats easily), or at least the infoboxes Transform any links to their visible representation (look for square brackets) Remove citations, and other markup
We are trying to work through some IP issues, but I might be able to post source code.
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of lmhelp Sent: Wednesday, August 04, 2010 3:45 PM To: mediawiki-l@lists.wikimedia.org Subject: [Mediawiki-l] Wikitext grammar
Hi,
Thank you or reading my post.
I am wondering if there exists a "grammar" for the "Wikicode"/"Wikitext" language (or an *exhaustive* (and formal) set of rules about how is constructed a "Wikitext"). I've looked for such a grammar/set of rules on the Web but I couldn't find one...
I need to extract automatically the first paragraph of a Wiki article...
I did it from the HTML version of a Wiki article (because I noticed the first paragraph was the first <p> element child of a <div> element which id is "bodyContent"...) but I need to work with the "Wikitext" itself...
- Is a grammar available somewhere? - Do you have any idea how to extract the first paragaph of a Wiki article? - Any advice? - Does a Java "Wikitext" "parser" exists which would do it?
Thank you for your help. All the best,
-- Lmhelp
Hi,
Thanks to all of you for your answers.
I have decided (in the light of what you told me) to read the "Wikitext" line after line.
I must "ignore" leading: - templates (including the ones which span over several consecutive lines like "infoboxes"): {{...}}, - isolated internal links: [[...]].
I return the first line which is not one of the two "things" above.
Of course, I hope I am not forgetting any other "things" which could appear at the begenning of the "Wikitext". I have examined several articles at random and I saw nothing else... If I am forgetting something please tell me... :)
All the best, -- Lmhelp
On Thu, Aug 5, 2010 at 1:10 PM, lmhelp2 lea.massiot@ign.fr wrote:
Hi,
Thanks to all of you for your answers.
I have decided (in the light of what you told me) to read the "Wikitext" line after line.
I must "ignore" leading:
- templates (including the ones which span
over several consecutive lines like "infoboxes"): {{...}},
- isolated internal links: [[...]].
I return the first line which is not one of the two "things" above.
Of course, I hope I am not forgetting any other "things" which could appear at the begenning of the "Wikitext". I have examined several articles at random and I saw nothing else... If I am forgetting something please tell me... :)
For starters, {{ can be nested, and do not need to be at the beginning of a line, but could be anywhere. Same goes for the matching }}. Of course, {{{ has a different meaning...
Magnus
Hi,
there might be an occurrence of __TOC__ or __NOTOC__ before the first "real" paragraph.
Good luck with finding all exeptions. :) Katharina
Am 05.08.2010 14:10 schrieb lmhelp2:
Hi,
Thanks to all of you for your answers.
I have decided (in the light of what you told me) to read the "Wikitext" line after line.
I must "ignore" leading:
- templates (including the ones which span over several consecutive lines like "infoboxes"): {{...}},
- isolated internal links: [[...]].
I return the first line which is not one of the two "things" above.
Of course, I hope I am not forgetting any other "things" which could appear at the begenning of the "Wikitext". I have examined several articles at random and I saw nothing else... If I am forgetting something please tell me... :)
All the best,
Lmhelp
Thank you!
So here is the list I have for the moment: I need to ignore lines: - containing: {{...}} => possibly spreading over several lines, => being possibly nested {{... {{ ... }} ... }}. - containing: [[...]] => being possibly nested [[... [[ ... ]] ... ]]. - equal to: __TOC__ - equal to: __NOTOC__ - beginning with the '=' character - beginning with the '*' character
Feel free to help me complement that list!
Cheers, -- Lmhelp
Hi,
Am 05.08.2010 16:47 schrieb lmhelp2:
Thank you!
So here is the list I have for the moment: I need to ignore lines:
- containing: {{...}} => possibly spreading over several lines, => being possibly nested {{... {{ ... }} ... }}.
- containing: [[...]] => being possibly nested [[... [[ ... ]] ... ]].
- equal to: __TOC__
- equal to: __NOTOC__
- beginning with the '=' character
- beginning with the '*' character
I don't think you should ignore lines beginning with the '*' character - those may include the wanted first paragraph of the text as the '*' is just a way of formatting the page...
Greetings Katharina
If you are to extract only Wikipedia'a articles first paragraph no problema.
2010/8/6 Katharina Wolkwitz wolkwitz@fh-swf.de
Hi,
Am 05.08.2010 16:47 schrieb lmhelp2:
Thank you!
So here is the list I have for the moment: I need to ignore lines:
- containing: {{...}} => possibly spreading over several lines, => being possibly nested {{... {{ ... }} ... }}.
- containing: [[...]] => being possibly nested [[... [[ ... ]] ... ]].
- equal to: __TOC__
- equal to: __NOTOC__
- beginning with the '=' character
- beginning with the '*' character
I don't think you should ignore lines beginning with the '*' character - those may include the wanted first paragraph of the text as the '*' is just a way of formatting the page...
Greetings Katharina
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Also ignore lines starting with "#", ":", " " (space), or ";" .
Then there are (potentially nested) tables, which start with a line beginning with "{|" and end in a line beginning with "|}".
There are more "magic words" with the general pattern "__SOMEUPPERCASECHARACTERS__", IIRC.
Note that sometimes, people start the paragraph after a closure that should be alone, such as |} First line of text
[[ and ]] pairs should not extend over a line, but they can be nested, e.g. for images.
Or, and there are HTML comments to remove, and <nowiki>...</nowiki>
That's all I can come up with right now...
Magnus
On Fri, Aug 6, 2010 at 4:07 PM, nevio carlos de alarcão nevinhoalarcao@gmail.com wrote:
If you are to extract only Wikipedia'a articles first paragraph no problema.
2010/8/6 Katharina Wolkwitz wolkwitz@fh-swf.de
Hi,
Am 05.08.2010 16:47 schrieb lmhelp2:
Thank you!
So here is the list I have for the moment: I need to ignore lines:
- containing: {{...}}
=> possibly spreading over several lines, => being possibly nested {{... {{ ... }} ... }}.
- containing: [[...]]
=> being possibly nested [[... [[ ... ]] ... ]].
- equal to: __TOC__
- equal to: __NOTOC__
- beginning with the '=' character
- beginning with the '*' character
I don't think you should ignore lines beginning with the '*' character - those may include the wanted first paragraph of the text as the '*' is just a way of formatting the page...
Greetings Katharina
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
-- {+}Nevinho Venha para o Movimento Colaborativo http://sextapoetica.com.br !! _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On Wed, Aug 4, 2010 at 1:45 PM, lmhelp lmbox@wanadoo.fr wrote:
I need to extract automatically the first paragraph of a Wiki article...
See Extracted page extracts for Yahoo: http://download.wikimedia.org/enwiki/20100730/
A colleague told me about that... so we had a look at it. Unfortunately, abstracts are not correct most of the time...
----------------------------------------------------------------- Example (in French): ----------------------------------------------------------------- <title>Wikipédia : Arabie saoudite</title> <url>http://fr.wikipedia.org/wiki/Arabie_saoudite</url> <abstract>| lien_villes=Villes d'Arabie saoudite</abstract> <links> [...] -----------------------------------------------------------------
Cheers, -- Lmhelp
On 8/6/2010 5:49 PM, Brian J Mingus wrote:
On Wed, Aug 4, 2010 at 1:45 PM, lmhelplmbox@wanadoo.fr wrote:
I need to extract automatically the first paragraph of a Wiki article...
See Extracted page extracts for Yahoo: http://download.wikimedia.org/enwiki/20100730/ _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On Fri, Aug 6, 2010 at 10:06 AM, Léa Massiot lea.massiot@ign.fr wrote:
A colleague told me about that... so we had a look at it. Unfortunately, abstracts are not correct most of the time...
Example (in French):
<title>Wikipédia : Arabie saoudite</title> <url>http://fr.wikipedia.org/wiki/Arabie_saoudite</url> <abstract>| lien_villes=Villes d'Arabie saoudite</abstract> <links> [...] -----------------------------------------------------------------
Cheers,
Lmhelp
In that case I recommend the parser in mwlib for fast extraction of the text in the first paragraph: http://code.pediapress.com/wiki/wiki/mwlib
Are you sure this will be able to extract the introductory paragraph (only) which is not in any section... (because it is not trivial).
There is only one example I could find at http://code.pediapress.com/wiki/wiki/mwlib ... which is not so easy to understand by the way...
Cheers, -- Lmhelp
On 8/6/2010 6:09 PM, Brian J Mingus wrote:
In that case I recommend the parser in mwlib for fast extraction of the text in the first paragraph: http://code.pediapress.com/wiki/wiki/mwlib
On Fri, Aug 6, 2010 at 10:18 AM, Léa Massiot lea.massiot@ign.fr wrote:
Are you sure this will be able to extract the introductory paragraph (only) which is not in any section... (because it is not trivial).
There is only one example I could find at http://code.pediapress.com/wiki/wiki/mwlib ... which is not so easy to understand by the way...
Cheers,
Lmhelp
mwlib is a fairly full featured parser for wikitext. It is not documented, but by using the dir and help python commands you can easily navigate its methods. It creates a parse tree from which you can reconstruct the plain text. Once you have the plain text extracting paragraphs is straightforward.
mediawiki-l@lists.wikimedia.org