Hi all,
I've noticed that prior to about April 2007, none of the revisions in
the database stored their rev_len. I assume this was because this
feature was added at this time. Currently, in order to obtain the
rev_len for these revisions I'm using the following function to issue
a HEAD request to the live production wiki:
function get_revision_length($lang, $rev_id) {
$socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
if (!isset($cached_ip[$lang . ".wikipedia.org"])) {
$cached_ip[$host] = gethostbyname($lang . ".wikipedia.org");
}
socket_connect($socket, $cached_ip[$host], 80);
$request = 'HEAD ' . "/w/index.php?oldid=" . $revids[$i] .
"&action=raw" . ' HTTP/1.1' . "\n" .
'Host: ' . $lang . ".wikipedia.org" . "\n" .
'User-Agent: Mozilla/5.0' . "\n" .
'Connection: close' . "\n\n";
socket_send($socket, $request, strlen($request), 0);
socket_recv($socket, $buffer, 2048, MSG_WAITALL);
if (preg_match('/Content-Length: ([0-9]+)/', $buffer, $matches)) {
$result = $matches[1];
} else {
$result = null;
}
socket_close($socket);
return $result;
}
This is not too bad, but still is my app's bottleneck. I had little
luck with Duesentrieb's WikiProxy - it was much slower than this
approach, even just fetching metadata. Is there some other better way?
Is there any plan to update the rev_len fields for old revisions?
Thanks!
--
Derrick Coetzee
User:Dcoetzee, En/Commons admin