[Toolserver-l] Looking for utility to perform text search in dump
Christian Thiele
apper at apper.de
Sat Jul 25 15:27:58 UTC 2009
Hi,
Am 25.07.2009, 17:13 Uhr, schrieb River Tarnell
<river at loreley.flyingparchment.org.uk>:
> .NET is not, itself, non-free. Microsoft's implementation (the most
> common
> one) is, but Mono (http://mono-project.com/Main_Page) is not. perhaps
> the AWB
> developers could make whatever changes are needed to run it on a free
> implementation.
Mono works great, I'm using bots using the DotNetWikiBot framework on the
toolserver.
For simple parsing of a pages-articles.xml file, you may test a script, I
used some time ago - it is a very simple xml parser (for the
pages-articles.xml structure) and calls a function called "test" with the
article title and the text of the article. Its not the perfect solution
but the solution implemented in five minutes ;)
function test($title, $text)
{
// do something here
}
$filename = "enwiki-200XXXXX-pages-articles.xml";
$dataFile = fopen($filename, "r");
if ($dataFile)
{
$status = 0;
while (!feof($dataFile))
{
$buffer = fgets($dataFile, 4096);
if (($status == 0) && (stripos($buffer, "<page>") !== false))
$status = 1;
elseif (($status == 1) && (stripos($buffer, "<title>") !== false))
$title = strip_tags($buffer);
elseif (($status == 1) && (stripos($buffer, "<revision>") !== false))
$status = 2;
elseif (($status == 2) && (stripos($buffer, "<text ") !== false))
{
$status = 3;
$text = strip_tags($buffer);
if (stripos($buffer, "</text>") !== false) { $status = 2; }
}
elseif (($status == 3) && (stripos($buffer, "</text>") === false))
$text .= strip_tags($buffer);
elseif ($status == 3)
{
$text .= strip_tags($buffer);
$status = 2;
}
elseif (($status == 2) && (stripos($buffer, "</revision>") !==
false))
$status = 1;
elseif (($status == 1) && (stripos($buffer, "</page>") !== false))
{
test(trim($title), trim($text));
$title = ""; $text = "";
$status = 0;
}
}
fclose($dataFile);
}
else
{
die("File not found: $filename");
}
More information about the Toolserver-l
mailing list