Hi all,
I've noticed that prior to about April 2007, none of the revisions in the database stored their rev_len. I assume this was because this feature was added at this time. Currently, in order to obtain the rev_len for these revisions I'm using the following function to issue a HEAD request to the live production wiki:
function get_revision_length($lang, $rev_id) { $socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP); if (!isset($cached_ip[$lang . ".wikipedia.org"])) { $cached_ip[$host] = gethostbyname($lang . ".wikipedia.org"); } socket_connect($socket, $cached_ip[$host], 80); $request = 'HEAD ' . "/w/index.php?oldid=" . $revids[$i] . "&action=raw" . ' HTTP/1.1' . "\n" . 'Host: ' . $lang . ".wikipedia.org" . "\n" . 'User-Agent: Mozilla/5.0' . "\n" . 'Connection: close' . "\n\n"; socket_send($socket, $request, strlen($request), 0); socket_recv($socket, $buffer, 2048, MSG_WAITALL); if (preg_match('/Content-Length: ([0-9]+)/', $buffer, $matches)) { $result = $matches[1]; } else { $result = null; } socket_close($socket); return $result; }
This is not too bad, but still is my app's bottleneck. I had little luck with Duesentrieb's WikiProxy - it was much slower than this approach, even just fetching metadata. Is there some other better way? Is there any plan to update the rev_len fields for old revisions? Thanks!
On Wed, Feb 17, 2010 at 10:02 PM, Derrick Coetzee dc@moonflare.com wrote:
'User-Agent: Mozilla/5.0' . "\n" .
Do not lie in your User-Agent. If your script causes any significant load, and a Wikimedia sysadmin notices it pretending to be a browser, they wouldn't be able to do much except block the entire toolserver until we're able to figure out which script is causing the problem, and disable it. Use a descriptive User-Agent, preferably one that gives contact information.
I don't have any suggestions as to your actual problem, though. I don't see a script to populate rev_len in maintenance/, so I guess someone would have to write one.
Aryeh Gregor a écrit :
On Wed, Feb 17, 2010 at 10:02 PM, Derrick Coetzee dc@moonflare.com wrote:
'User-Agent: Mozilla/5.0' . "\n" .
Do not lie in your User-Agent. [...]
Thats funny, i ask it in IRC few minutes ago.
in extenso:
<irc> bayo_O: is anybody know what the User-Agent should look like when we want to connect a bot to Wikipedia? dispenser: its typically TOOLNAME/VER (+SUPPORTURL) atglenn: more specific is better: dispenser: Googlebot/2.1 (+http://www.google.com/bot.html) atglenn: contact info in case the bot goes awry, is good Daniel_WMDE: dispenser: is the + somehow significant? some convention? dispenser: No clue why its there, but everyone does it </irc>
-bayo
Hi there,
I would also love to see such a script. The next step would be to run it on ALL OLD EDITS of every db und add the according values in the replication of the revision tables.
Maybe this can be done even easier with the XML-Dumps. I advice you to ask Felipe Ortega who wrote WikiXRay:
http://meta.wikimedia.org/wiki/WikiXRay
Greetings and good luck
euro
8< 8< 8< 8< 8<
Nachricht von Aryeh Gregor am 18.02.2010 18:52:
On Wed, Feb 17, 2010 at 10:02 PM, Derrick Coetzee dc@moonflare.com wrote:
'User-Agent: Mozilla/5.0' . "\n" .
Do not lie in your User-Agent. If your script causes any significant load, and a Wikimedia sysadmin notices it pretending to be a browser, they wouldn't be able to do much except block the entire toolserver until we're able to figure out which script is causing the problem, and disable it. Use a descriptive User-Agent, preferably one that gives contact information.
I don't have any suggestions as to your actual problem, though. I don't see a script to populate rev_len in maintenance/, so I guess someone would have to write one.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
[...] I don't have any suggestions as to your actual problem, though. I don't see a script to populate rev_len in maintenance/, so I guess someone would have to write one.
If someone does, please update bug #18881. There is Bryan's patch in bug #12188 to build upon.
Tim
toolserver-l@lists.wikimedia.org