Time problem - Toolserver-l

17 Nov 2007


      Hello gurus,
we (Kolossos and I) work an a new update for Wikipedia-World and 
Vorlagenauswertung (Templatetiger)
http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikiped...
http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagenauswertung/en
In the last years we use the dump from Wikimedia Foundation. We download 
this dump for the main languages and unpack this big files at the 
toolserver and scan this file with a perl-script for geocoordinates and 
templates.
Problem: This need many time, make many traffic and need many disk space 
at the toolserver. Also the update is only possible if a new dump is 
available.
Now we want create a new workflow. We use a up-to-date list of articles 
of a language and the perl-script get the text of a any article from the 
toolserver.
It work very well, but we need at the moment for reading the text of an 
article 2 seconds. If we get all 677000 German articles with 2 seconds 
we need 15.6 days. This is too much! With the old workflow I need not 
more then 30 Minutes to scan the complete DE-Dump.
I try different ways to get the text of an article with my perl-script.
1.)
http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=de.wikipedia.org&t...
2.)
http://tools.wikimedia.de/~kolossos/rss-test/fopen.php?pro=wikipedia&lan...
but every time the perl-script need average 2 seconds. With the second 
PHP-Script I see in a browser the time for getting the text and this 
time is mostly 0.2 seconds (factor 10 lower !!!) In this time we need 
only 1.5 days for scanning. This will be ok!
Here is the perl-code of my script, which get the article test. I hope 
one perl guru can help us to reduce the time.
Thank´s for every help!
Stefan Kühn
sub get_article_text_from_web {
     my $title = $_[0];
     my $page_id = 0;
     my $revision_id = 0;
     my $revision_time = 0;
     my $text = "";
print "get\t".$title."\n";
     my $test1 = time();
     print localtime($test1)."\n";
     use URI::Escape;
     use LWP::UserAgent;
     # 
http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=$lang.wikipedia.org&am...
     my $url = 
'http://localhost/~daniel/WikiSense/WikiProxy.php?wiki=de.wikipedia.org&t...; 
#http://www.webkuehn.de
     uri_escape($url);
     my $ua = LWP::UserAgent->new;
     my $response = $ua->get( $url );
     $response->is_success or die "$url: ", $response->status_line;
     my $result = $response->content;
     if ( $result ) {
         $text = $result;
         #print  "$result\n";
         print "ok\t".$title."\n";
         my $test2 = time();
         my $test3 = $test2 - $test1;
print localtime($test2)."\n";
         print "second\t".$test3."\n\n";
} else {
         print "No result $title\n";
     }
     #print "ok2\n";
return($title, $page_id, $revision_id, $revision_time, $text);
}