Михајло Анђелковић wrote:
I would ask for allowance to run a request that can be resource consuming if not properly scaled:
SELECT page.page_title as title, rev_user_text as user, rev_timestamp as timestamp, rev_len as len FROM revision JOIN page ON page.page_id = rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;
This is intended to extract basic data about all publicly visible revisions from 1 to [...]. Info about each revision would be a 4-tuple title/user name/time/length. I need this data to start generating a timeline of editing of srwiki, so it is intended to be run only once for each revision.
If this is generally allowed to do, my question is how large chunks of data can I take at once, and how long should be waited between two takes?
srwiki_p isn't very large (3665333 revisions and 413987 pages), so I personally wouldn't worry about performance very much at all. If you were going to run this query on enwiki_p or another larger database, it might be more of a concern. Run the queries that you need to run.
The "Queries" page on the Toolserver wiki might be helpful to you.[1]
Looking at your query, you should pull page.page_namespace or specify page_namespace = 0. Pulling only page.page_title without specifying a namespace will output useless results. I'm also unclear why you'd need to specify rev_id > 0, though you might have your reasons for doing so.
Your Toolserver account has a quota (viewable with 'quota -v') that you might hit if you're outputting a lot of data to disk. You can always use /mnt/user-store/ or file a ticket in JIRA if you need an increased quota.
MZMcBride