Re: [Wikipedia-l] The Book

27 Mar 2011


      ...
Now we can do this also in Wikipedia. I wrote a Perl-script which scan
the dumps of a language and sort the title. An other script get via API
the first paragraph and the first image of all articles of one page.
Looks like there was a slight duplication of efforts.
http://toolserver.org/~dschwen/synopsis/?l=en&t=Synopsis
I developed the synopsis script on the toolserver for the
WikiMiniAtlas, where it allows a quick preview of the articles on the
map.
I found the task to be not entirely trivial. At first I tried fetching
the raw wikitext and stripping the markup. However Templates (some
Wikipedias use templates to insert population numbers!!), Comments,
References, Links make this tedious. If you want to retain basic
formatting such as Bold/Italic it becomes a near impossible task. So I
switched to fetching action=rendered and using PHP:DOMDocument to
extract the first paragraph (Minus tables and minus short paragraph
elements that contain coordinates and removing internal links to the
reference section etc.). Works quite well.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] The Book