[Toolserver-l] Getting lengths of old revisions

Derrick Coetzee dc at moonflare.com
Thu Feb 18 03:02:39 UTC 2010


Hi all,

I've noticed that prior to about April 2007, none of the revisions in
the database stored their rev_len. I assume this was because this
feature was added at this time. Currently, in order to obtain the
rev_len for these revisions I'm using the following function to issue
a HEAD request to the live production wiki:

function get_revision_length($lang, $rev_id) {
  $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
  if (!isset($cached_ip[$lang . ".wikipedia.org"])) {
    $cached_ip[$host] = gethostbyname($lang . ".wikipedia.org");
  }
  socket_connect($socket, $cached_ip[$host], 80);
  $request = 'HEAD ' . "/w/index.php?oldid=" . $revids[$i] .
"&action=raw" . ' HTTP/1.1' . "\n" .
             'Host: ' . $lang . ".wikipedia.org" . "\n" .
             'User-Agent: Mozilla/5.0' . "\n" .
             'Connection: close' . "\n\n";
  socket_send($socket, $request, strlen($request), 0);
  socket_recv($socket, $buffer, 2048, MSG_WAITALL);
  if (preg_match('/Content-Length: ([0-9]+)/', $buffer, $matches)) {
    $result = $matches[1];
  } else {
    $result = null;
  }
  socket_close($socket);
  return $result;
}

This is not too bad, but still is my app's bottleneck. I had little
luck with Duesentrieb's WikiProxy - it was much slower than this
approach, even just fetching metadata. Is there some other better way?
Is there any plan to update the rev_len fields for old revisions?
Thanks!
-- 
Derrick Coetzee
User:Dcoetzee, En/Commons admin



More information about the Toolserver-l mailing list