* Khalida BEN SIDI AHMED wrote:
JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first sentence of each article seems for me to be not the appropriate solution.
The dumps on http://dumps.wikimedia.org/backup-index.html have "page abstracts" which typically contain the first sentence. I've found that http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot) works quite well, at least on the english version. I'd probably use my http://cutycapt.sf.net/ utility like so:
% CutyCapt --url=http://en.wikipedia.org/wiki/Empire --user-style-string= " .mw-content-ltr > * { display: none } .mw-content-ltr > p:first-of-type, .mw-content-ltr > p:first-of-type * { display: inline } " --out=output.txt
Where output.txt would then be something like
Please read: A personal appeal from Wikipedia founder Jimmy Wales Read now Empire From Wikipedia, the free encyclopedia The term empire derives from the Latin imperium (power, authority)...
You would then just have to strip the leading gibberish and possibly fiddle with the user style sheet to remove references for instance. You could also just use a sophisticated HTML parser and pick simply pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just a few articles that would require some setup cost.