Hi all,
I've noticed that prior to about April 2007, none of the revisions in the database stored their rev_len. I assume this was because this feature was added at this time. Currently, in order to obtain the rev_len for these revisions I'm using the following function to issue a HEAD request to the live production wiki:
function get_revision_length($lang, $rev_id) { $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if (!isset($cached_ip[$lang . ".wikipedia.org"])) { $cached_ip[$host] = gethostbyname($lang . ".wikipedia.org"); } socket_connect($socket, $cached_ip[$host], 80); $request = 'HEAD ' . "/w/index.php?oldid=" . $revids[$i] . "&action=raw" . ' HTTP/1.1' . "\n" . 'Host: ' . $lang . ".wikipedia.org" . "\n" . 'User-Agent: Mozilla/5.0' . "\n" . 'Connection: close' . "\n\n"; socket_send($socket, $request, strlen($request), 0); socket_recv($socket, $buffer, 2048, MSG_WAITALL); if (preg_match('/Content-Length: ([0-9]+)/', $buffer, $matches)) { $result = $matches[1]; } else { $result = null; } socket_close($socket); return $result; }
This is not too bad, but still is my app's bottleneck. I had little luck with Duesentrieb's WikiProxy - it was much slower than this approach, even just fetching metadata. Is there some other better way? Is there any plan to update the rev_len fields for old revisions? Thanks!