I've unsuscribed myself twice from this stupid list!!!!!!!
Why I'm still receiving messages!!!?!?!?!?!?!?!!?!?!?!?
On 26 March 2011 22:43, Daniel Schwen <lists(a)schwen.de> wrote:
Now we can do
this also in Wikipedia. I wrote a Perl-script which scan
the dumps of a language and sort the title. An other script get via API
the first paragraph and the first image of all articles of one page.
Looks like there was a slight duplication of efforts.
http://toolserver.org/~dschwen/synopsis/?l=en&t=Synopsis
I developed the synopsis script on the toolserver for the
WikiMiniAtlas, where it allows a quick preview of the articles on the
map.
I found the task to be not entirely trivial. At first I tried fetching
the raw wikitext and stripping the markup. However Templates (some
Wikipedias use templates to insert population numbers!!), Comments,
References, Links make this tedious. If you want to retain basic
formatting such as Bold/Italic it becomes a near impossible task. So I
switched to fetching action=rendered and using PHP:DOMDocument to
extract the first paragraph (Minus tables and minus short paragraph
elements that contain coordinates and removing internal links to the
reference section etc.). Works quite well.
_______________________________________________
Wikipedia-l mailing list
Wikipedia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikipedia-l