* Khalida BEN SIDI AHMED wrote:
JWPL needs fist to create a database whose size =158
GB. For the RAM, at
least 2 GB are necessary. I don't have neither a big hard disk neither a
big space ram. In addition, creating such big database to just extract the
first sentence of each article seems for me to be not the appropriate
solution.
The dumps on
http://dumps.wikimedia.org/backup-index.html have "page
abstracts" which typically contain the first sentence. I've found that
http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot)
works quite well, at least on the english version. I'd probably use my
http://cutycapt.sf.net/ utility like so:
% CutyCapt --url=http://en.wikipedia.org/wiki/Empire
--user-style-string=
"
.mw-content-ltr > * { display: none }
.mw-content-ltr > p:first-of-type,
.mw-content-ltr > p:first-of-type * { display: inline }
"
--out=output.txt
Where output.txt would then be something like
Please read:
A personal appeal from
Wikipedia founder Jimmy Wales
Read now
Empire
From Wikipedia, the free encyclopedia
The term empire derives from the Latin imperium (power, authority)...
You would then just have to strip the leading gibberish and possibly
fiddle with the user style sheet to remove references for instance.
You could also just use a sophisticated HTML parser and pick simply
pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just
a few articles that would require some setup cost.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de ·
http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 ·
http://www.websitedev.de/