Re: [Wikitech-l] Extracting text from Wikipedia

27 Nov 2011


      * Khalida BEN SIDI AHMED wrote:
...
JWPL needs fist to create a database whose size =158 GB. For the RAM, at
least 2 GB are necessary. I don't have neither a big hard disk neither a
big space ram. In addition, creating such big database to just extract the
first sentence of each article seems for me to be not the appropriate
solution.
The dumps on http://dumps.wikimedia.org/backup-index.html have "page
abstracts" which typically contain the first sentence. I've found that
http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot)
works quite well, at least on the english version. I'd probably use my
http://cutycapt.sf.net/ utility like so:
% CutyCapt --url=http://en.wikipedia.org/wiki/Empire
             --user-style-string=
             "
               .mw-content-ltr > * { display: none }
               .mw-content-ltr > p:first-of-type,
               .mw-content-ltr > p:first-of-type * { display: inline }
             "
             --out=output.txt
Where output.txt would then be something like
Please read:
  A personal appeal from 
  Wikipedia founder Jimmy Wales
  Read now 
  Empire
  From Wikipedia, the free encyclopedia
  The term empire derives from the Latin imperium (power, authority)...
You would then just have to strip the leading gibberish and possibly
fiddle with the user style sheet  to remove references for instance.
You could also just use a sophisticated HTML parser and pick simply
pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just
a few articles that would require some setup cost.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Extracting text from Wikipedia