[Toolserver-l] Extracting basic revision data

MZMcBride z at mzmcbride.com
Mon Nov 29 20:09:17 UTC 2010


Михајло Анђелковић wrote:
> I would ask for allowance to run a request that can be resource
> consuming if not properly scaled:
> 
> SELECT page.page_title as title, rev_user_text as user, rev_timestamp
> as timestamp, rev_len as len FROM revision JOIN page ON page.page_id =
> rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;
> 
> This is intended to extract basic data about all publicly visible
> revisions from 1 to [...]. Info about each revision would be a 4-tuple
> title/user name/time/length. I need this data to start generating a
> timeline of editing of srwiki, so it is intended to be run only once
> for each revision.
> 
> If this is generally allowed to do, my question is how large chunks of
> data can I take at once, and how long should be waited between two
> takes?

srwiki_p isn't very large (3665333 revisions and 413987 pages), so I
personally wouldn't worry about performance very much at all. If you were
going to run this query on enwiki_p or another larger database, it might be
more of a concern. Run the queries that you need to run.

The "Queries" page on the Toolserver wiki might be helpful to you.[1]

Looking at your query, you should pull page.page_namespace or specify
page_namespace = 0. Pulling only page.page_title without specifying a
namespace will output useless results. I'm also unclear why you'd need to
specify rev_id > 0, though you might have your reasons for doing so.

Your Toolserver account has a quota (viewable with 'quota -v') that you
might hit if you're outputting a lot of data to disk. You can always use
/mnt/user-store/ or file a ticket in JIRA if you need an increased quota.

MZMcBride

[1] https://wiki.toolserver.org/view/Queries





More information about the Toolserver-l mailing list