Hello gurus,
we (Kolossos and I) work an a new update for Wikipedia-World and Vorlagenauswertung (Templatetiger)
http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikiped... http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagenauswertung/en
In the last years we use the dump from Wikimedia Foundation. We download this dump for the main languages and unpack this big files at the toolserver and scan this file with a perl-script for geocoordinates and templates.
Problem: This need many time, make many traffic and need many disk space at the toolserver. Also the update is only possible if a new dump is available.
Now we want create a new workflow. We use a up-to-date list of articles of a language and the perl-script get the text of a any article from the toolserver.
It work very well, but we need at the moment for reading the text of an article 2 seconds. If we get all 677000 German articles with 2 seconds we need 15.6 days. This is too much! With the old workflow I need not more then 30 Minutes to scan the complete DE-Dump.
I try different ways to get the text of an article with my perl-script.
1.) http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=de.wikipedia.org&t...
2.) http://tools.wikimedia.de/~kolossos/rss-test/fopen.php?pro=wikipedia&lan...
but every time the perl-script need average 2 seconds. With the second PHP-Script I see in a browser the time for getting the text and this time is mostly 0.2 seconds (factor 10 lower !!!) In this time we need only 1.5 days for scanning. This will be ok!
Here is the perl-code of my script, which get the article test. I hope one perl guru can help us to reduce the time.
Thank´s for every help! Stefan Kühn
sub get_article_text_from_web { my $title = $_[0]; my $page_id = 0; my $revision_id = 0; my $revision_time = 0; my $text = "";
print "get\t".$title."\n"; my $test1 = time(); print localtime($test1)."\n"; use URI::Escape; use LWP::UserAgent; # http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=$lang.wikipedia.org&am... my $url = 'http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=de.wikipedia.org&t...; #http://www.webkuehn.de uri_escape($url); my $ua = LWP::UserAgent->new; my $response = $ua->get( $url ); $response->is_success or die "$url: ", $response->status_line; my $result = $response->content; if ( $result ) { $text = $result; #print "$result\n"; print "ok\t".$title."\n"; my $test2 = time(); my $test3 = $test2 - $test1;
print localtime($test2)."\n"; print "second\t".$test3."\n\n";
} else { print "No result $title\n"; } #print "ok2\n";
return($title, $page_id, $revision_id, $revision_time, $text); }
On Saturday 17 November 2007 10:56:20 Stefan Kühn wrote:
In the last years we use the dump from Wikimedia Foundation. We download this dump for the main languages and unpack this big files at the toolserver and scan this file with a perl-script for geocoordinates and templates.
What about scanning the externallinks table [1] to find every link to http://tools.wikimedia.de/~magnus/geo/geohack.php?language=de&... ? If all the information you need is in the link, this could be an alternative method.
Jelte a.k.a. WebBoy
toolserver-l@lists.wikimedia.org