Extracting text from Wikipedia

List overview All Threads
Download

newer

older

Re: [Wikitech-l] Final mobile...

Adding MD5 / SHA1 column to...

Khalida BEN SIDI AHMED

27 Nov 2011 27 Nov '11

9:02 a.m.

Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.

Truly yours Ben Sidi Ahmed

Show replies by date

Mike Dupont

27 Nov 27 Nov

9:10 a.m.

you want from all articles or just one? see here http://stackoverflow.com/questions/627594/is-there-a-wikipedia-api

mike

On Sun, Nov 27, 2011 at 6:02 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:

...

Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.

Truly yours Ben Sidi Ahmed _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Khalida BEN SIDI AHMED

9:15 a.m.

I have already read the responses given in this post.

I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples.

Mike Dupont

9:20 a.m.

look, for 100 articles, just create a list of them, and export them as xml. or use the book creator. http://en.wikipedia.org/wiki/Help:Books

also there is a json api to pull single articles. http://www.barattalo.it/2010/08/29/php-bot-to-get-wikipedia-definitions/

mike

On Sun, Nov 27, 2011 at 6:15 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:

...

I have already read the responses given in this post.

I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Mike Dupont

9:21 a.m.

http://code.google.com/p/jwpl/ this looks also good JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia.

have not tried that , but you said you wanted to do it in java. mike

On Sun, Nov 27, 2011 at 6:20 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:

...

look, for 100 articles, just create a list of them, and export them as xml. or use the book creator. http://en.wikipedia.org/wiki/Help:Books

also there is a json api to pull single articles. http://www.barattalo.it/2010/08/29/php-bot-to-get-wikipedia-definitions/

mike

On Sun, Nov 27, 2011 at 6:15 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:

...
I have already read the responses given in this post.

I want to the extract the first paragraph (or the first sentence) for a list of <100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Khalida BEN SIDI AHMED

9:29 a.m.

I'm developping my project in Java. I'm not a good php developper.

JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first sentence of each article seems for me to be not the appropriate solution.

If someone has already used JSoup for such aim, I would like to see just few examples to learn how to use this API.

Bjoern Hoehrmann

10:12 a.m.

* Khalida BEN SIDI AHMED wrote:

...

JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first sentence of each article seems for me to be not the appropriate solution.

The dumps on http://dumps.wikimedia.org/backup-index.html have "page abstracts" which typically contain the first sentence. I've found that http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot) works quite well, at least on the english version. I'd probably use my http://cutycapt.sf.net/ utility like so:

% CutyCapt --url=http://en.wikipedia.org/wiki/Empire --user-style-string= " .mw-content-ltr > * { display: none } .mw-content-ltr > p:first-of-type, .mw-content-ltr > p:first-of-type * { display: inline } " --out=output.txt

Where output.txt would then be something like

Please read: A personal appeal from Wikipedia founder Jimmy Wales Read now Empire From Wikipedia, the free encyclopedia The term empire derives from the Latin imperium (power, authority)...

You would then just have to strip the leading gibberish and possibly fiddle with the user style sheet to remove references for instance. You could also just use a sophisticated HTML parser and pick simply pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just a few articles that would require some setup cost.

-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Khalida BEN SIDI AHMED

10:51 a.m.

Thank you Hoehrmann. I will try to apply the options you've mentionned. However, if someone can help me in using JSoup, his ideas are welcomed.

Mike Dupont

10:57 a.m.

If you are focused on jsoup, then best to ask there on the mailing list http://jsoup.org/discussion

On Sun, Nov 27, 2011 at 7:51 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:

...

Thank you Hoehrmann. I will try to apply the options you've mentionned. However, if someone can help me in using JSoup, his ideas are welcomed. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Khalida BEN SIDI AHMED

3:13 p.m.

Hello,

This is the answer that was given for my question: http://stackoverflow.com/questions/8286786/wikipedia-first-paragraph

It works perfectly, the code may be useful for you.

Truly yours Khalida Ben Sidi Ahmed

Khalida BEN SIDI AHMED

9:36 a.m.

The list of the articles I will need is not known from the beggining. Through my project, I will find a list of words (50). I try to find for them definitions in Wikipedia. After that I will extract the hyperonym of each word. I will have a new list for which I then retrieve the respective articles...etc. So the the list of articles I need grows through the evolution of the algorithm' execution.

Mike Dupont

9:51 a.m.

have you seen wordnet,http://en.wikipedia.org/wiki/WordNet wikitionary? in any case you can call the export routine as you need it, even for single articles. I think it would not cause too much load on the server. anyway, good luck

mike

On Sun, Nov 27, 2011 at 6:36 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:

...

The list of the articles I will need is not known from the beggining. Through my project, I will find a list of words (50). I try to find for them definitions in Wikipedia. After that I will extract the hyperonym of each word. I will have a new list for which I then retrieve the respective articles...etc. So the the list of articles I need grows through the evolution of the algorithm' execution. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org

Khalida BEN SIDI AHMED

9:57 a.m.

The words I use are belonging to a special domain (oil and gas industry). So Wordnet and even Wictionnary are not useful enough (they are generalized corpora).

Thank you very much indeed.

Sebastian Hellmann

11:06 p.m.

Dear Ben Sidi Ahmed, DBpedia might have all the data you need, already extracted (also in over 80 languages): http://wiki.dbpedia.org/Downloads37

Here are all first 2 sentences of each article in a structured format: http://downloads.dbpedia.org/3.7/en/short_abstracts_en.nt.bz2 Here is the first abstract: http://downloads.dbpedia.org/3.7/en/long_abstracts_en.nt.bz2

If you just want them for single articles you can also query the DBpedia API: The first 2 sentences for London (all languages) : http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Chttp%3A%2F%2Fdbp...

The first 2 sentences for London (only English ): http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Chttp%3A%2F%2Fdbp...

All that contain the keyword "London": http://dbpedia.org/snorql/?query=SELECT+*+WHERE+%7B%0D%0A%3Fs+rdfs%3Acomment...

You can also query them on a synchronized database (which gets updates every 5 minutes from Wikipedia): http://live.dbpedia.org/

Hope that helps, Sebastian

On 11/27/2011 06:02 PM, Khalida BEN SIDI AHMED wrote:

...

Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a Wikipedia article? Thanks a lot.

Truly yours Ben Sidi Ahmed _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

4788

Age (days ago)

4789

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

4 participants

tags (0)

participants (4)

Bjoern Hoehrmann
Khalida BEN SIDI AHMED
Mike Dupont
Sebastian Hellmann