[Toolserver-l] Looking for utility to perform text search in dump

Christian Thiele apper at apper.de
Sat Jul 25 15:27:58 UTC 2009


Hi,

Am 25.07.2009, 17:13 Uhr, schrieb River Tarnell  
<river at loreley.flyingparchment.org.uk>:

> .NET is not, itself, non-free.  Microsoft's implementation (the most  
> common
> one) is, but Mono (http://mono-project.com/Main_Page) is not.  perhaps  
> the AWB
> developers could make whatever changes are needed to run it on a free
> implementation.

Mono works great, I'm using bots using the DotNetWikiBot framework on the  
toolserver.

For simple parsing of a pages-articles.xml file, you may test a script, I  
used some time ago - it is a very simple xml parser (for the  
pages-articles.xml structure) and calls a function called "test" with the  
article title and the text of the article. Its not the perfect solution  
but the solution implemented in five minutes ;)

   function test($title, $text)
   {
     // do something here
   }

   $filename = "enwiki-200XXXXX-pages-articles.xml";
   $dataFile = fopen($filename, "r");
   if ($dataFile)
   {
     $status = 0;
     while (!feof($dataFile))
     {
       $buffer = fgets($dataFile, 4096);
       if (($status == 0) && (stripos($buffer, "<page>") !== false))
         $status = 1;
       elseif (($status == 1) && (stripos($buffer, "<title>") !== false))
         $title = strip_tags($buffer);
       elseif (($status == 1) && (stripos($buffer, "<revision>") !== false))
         $status = 2;
       elseif (($status == 2) && (stripos($buffer, "<text ") !== false))
       {
         $status = 3;
         $text = strip_tags($buffer);
         if (stripos($buffer, "</text>") !== false) { $status = 2; }
       }
       elseif (($status == 3) && (stripos($buffer, "</text>") === false))
         $text .= strip_tags($buffer);
       elseif ($status == 3)
       {
         $text .= strip_tags($buffer);
         $status = 2;
       }
       elseif (($status == 2) && (stripos($buffer, "</revision>") !==  
false))
         $status = 1;
       elseif (($status == 1) && (stripos($buffer, "</page>") !== false))
       {
         test(trim($title), trim($text));
         $title = ""; $text = "";
         $status = 0;
       }
     }
     fclose($dataFile);
   }
   else
   {
     die("File not found: $filename");
   }



More information about the Toolserver-l mailing list