Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.
Truly yours Ben Sidi Ahmed
you want from all articles or just one? see here http://stackoverflow.com/questions/627594/is-there-a-wikipedia-api
mike
On Sun, Nov 27, 2011 at 6:02 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.
Truly yours Ben Sidi Ahmed _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I have already read the responses given in this post.
I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples.
look, for 100 articles, just create a list of them, and export them as xml. or use the book creator. http://en.wikipedia.org/wiki/Help:Books
also there is a json api to pull single articles. http://www.barattalo.it/2010/08/29/php-bot-to-get-wikipedia-definitions/
mike
On Sun, Nov 27, 2011 at 6:15 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
I have already read the responses given in this post.
I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
http://code.google.com/p/jwpl/ this looks also good JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia.
have not tried that , but you said you wanted to do it in java. mike
On Sun, Nov 27, 2011 at 6:20 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
look, for 100 articles, just create a list of them, and export them as xml. or use the book creator. http://en.wikipedia.org/wiki/Help:Books
also there is a json api to pull single articles. http://www.barattalo.it/2010/08/29/php-bot-to-get-wikipedia-definitions/
mike
On Sun, Nov 27, 2011 at 6:15 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
I have already read the responses given in this post.
I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org
I'm developping my project in Java. I'm not a good php developper.
JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first sentence of each article seems for me to be not the appropriate solution.
If someone has already used JSoup for such aim, I would like to see just few examples to learn how to use this API.
* Khalida BEN SIDI AHMED wrote:
JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first sentence of each article seems for me to be not the appropriate solution.
The dumps on http://dumps.wikimedia.org/backup-index.html have "page abstracts" which typically contain the first sentence. I've found that http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot) works quite well, at least on the english version. I'd probably use my http://cutycapt.sf.net/ utility like so:
% CutyCapt --url=http://en.wikipedia.org/wiki/Empire --user-style-string= " .mw-content-ltr > * { display: none } .mw-content-ltr > p:first-of-type, .mw-content-ltr > p:first-of-type * { display: inline } " --out=output.txt
Where output.txt would then be something like
Please read: A personal appeal from Wikipedia founder Jimmy Wales Read now Empire From Wikipedia, the free encyclopedia The term empire derives from the Latin imperium (power, authority)...
You would then just have to strip the leading gibberish and possibly fiddle with the user style sheet to remove references for instance. You could also just use a sophisticated HTML parser and pick simply pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just a few articles that would require some setup cost.
Thank you Hoehrmann. I will try to apply the options you've mentionned. However, if someone can help me in using JSoup, his ideas are welcomed.
If you are focused on jsoup, then best to ask there on the mailing list http://jsoup.org/discussion
On Sun, Nov 27, 2011 at 7:51 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
Thank you Hoehrmann. I will try to apply the options you've mentionned. However, if someone can help me in using JSoup, his ideas are welcomed. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello,
This is the answer that was given for my question: http://stackoverflow.com/questions/8286786/wikipedia-first-paragraph
It works perfectly, the code may be useful for you.
Truly yours Khalida Ben Sidi Ahmed
The list of the articles I will need is not known from the beggining. Through my project, I will find a list of words (50). I try to find for them definitions in Wikipedia. After that I will extract the hyperonym of each word. I will have a new list for which I then retrieve the respective articles...etc. So the the list of articles I need grows through the evolution of the algorithm' execution.
have you seen wordnet,http://en.wikipedia.org/wiki/WordNet wikitionary? in any case you can call the export routine as you need it, even for single articles. I think it would not cause too much load on the server. anyway, good luck
mike
On Sun, Nov 27, 2011 at 6:36 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
The list of the articles I will need is not known from the beggining. Through my project, I will find a list of words (50). I try to find for them definitions in Wikipedia. After that I will extract the hyperonym of each word. I will have a new list for which I then retrieve the respective articles...etc. So the the list of articles I need grows through the evolution of the algorithm' execution. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The words I use are belonging to a special domain (oil and gas industry). So Wordnet and even Wictionnary are not useful enough (they are generalized corpora).
Thank you very much indeed.
Dear Ben Sidi Ahmed, DBpedia might have all the data you need, already extracted (also in over 80 languages): http://wiki.dbpedia.org/Downloads37
Here are all first 2 sentences of each article in a structured format: http://downloads.dbpedia.org/3.7/en/short_abstracts_en.nt.bz2 Here is the first abstract: http://downloads.dbpedia.org/3.7/en/long_abstracts_en.nt.bz2
If you just want them for single articles you can also query the DBpedia API: The first 2 sentences for London (all languages) : http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Chttp%3A%2F%2Fdbp...
The first 2 sentences for London (only English ): http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Chttp%3A%2F%2Fdbp...
All that contain the keyword "London": http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Fs+rdfs%3Acomment...
You can also query them on a synchronized database (which gets updates every 5 minutes from Wikipedia): http://live.dbpedia.org/
Hope that helps, Sebastian
On 11/27/2011 06:02 PM, Khalida BEN SIDI AHMED wrote:
Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.
Truly yours Ben Sidi Ahmed _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org