Extracting basic revision data

List overview All Threads
Download

newer

older

Software update schedule

Monthly pageviews

Михајло Анђелковић

29 Nov 2010 29 Nov '10

8:57 a.m.

Hello,

I would ask for allowance to run a request that can be resource consuming if not properly scaled:

SELECT page.page_title as title, rev_user_text as user, rev_timestamp as timestamp, rev_len as len FROM revision JOIN page ON page.page_id = rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

This is intended to extract basic data about all publicly visible revisions from 1 to [...]. Info about each revision would be a 4-tuple title/user name/time/length. I need this data to start generating a timeline of editing of srwiki, so it is intended to be run only once for each revision.

If this is generally allowed to do, my question is how large chunks of data can I take at once, and how long should be waited between two takes?

Show replies by date

Petr Kadlec

29 Nov 29 Nov

2:26 p.m.

2010/11/29 Михајло Анђелковић michael.angelkovich@gmail.com:

...

This is intended to extract basic data about all publicly visible revisions from 1 to [...]. Info about each revision would be a 4-tuple title/user name/time/length. I need this data to start generating a timeline of editing of srwiki, so it is intended to be run only once for each revision.

It seems to me you might want to use the dumps instead http://download.wikimedia.org/, even though they are unavailable at the moment…

-- [[ cs:User:Mormegil | Petr Kadlec ]]

Михајло Анђелковић

2:31 p.m.

Unfortunately, the complete dumps contain lots if data I don't actually need and I am afraid I am not willing to commit such an impact to my small HDD. And even more, they are really unavailable since 10.11, which is kind of very long already.

Right now I have time for this research and I want to use toolserver, unless admins do not let me do it. I asked per 'ask first'.

River Tarnell

2:56 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Михајло Анђелковић:

...

WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

Please check that MySQL plans this correctly (using the rev_id index).

...

If this is generally allowed to do, my question is how large chunks of data can I take at once, and how long should be waited between two takes?

A simple fetch like this doesn't usually take many resources. You can probably pick any random number for chunking. If you run each chunk under SGE[0] as a separate job, and request an sqlprocs resource, SGE will handle scheduling for you so there's no need to worry about how long to wait between runs.

- river.

[0] https://wiki.toolserver.org/view/Job_scheduling

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAkzzsQgACgkQIXd7fCuc5vI7ZACbBkmQAFXTSBEDYm+I8xf+Ci4X LFQAnjOPp1zFUQ/Njs90nQ4us4IZswZE =mPqu -----END PGP SIGNATURE-----

Platonides

3:09 p.m.

Михајло Анђелковић wrote:

...

Hello,

I would ask for allowance to run a request that can be resource consuming if not properly scaled:

SELECT page.page_title as title, rev_user_text as user, rev_timestamp as timestamp, rev_len as len FROM revision JOIN page ON page.page_id = rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

This is intended to extract basic data about all publicly visible revisions from 1 to [...]. Info about each revision would be a 4-tuple title/user name/time/length. I need this data to start generating a timeline of editing of srwiki, so it is intended to be run only once for each revision.

If this is generally allowed to do, my question is how large chunks of data can I take at once, and how long should be waited between two takes?

M

Have you considered generating the early timeline from dumps?

MZMcBride

9:09 p.m.

Михајло Анђелковић wrote:

...

I would ask for allowance to run a request that can be resource consuming if not properly scaled:

SELECT page.page_title as title, rev_user_text as user, rev_timestamp as timestamp, rev_len as len FROM revision JOIN page ON page.page_id = rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

This is intended to extract basic data about all publicly visible revisions from 1 to [...]. Info about each revision would be a 4-tuple title/user name/time/length. I need this data to start generating a timeline of editing of srwiki, so it is intended to be run only once for each revision.

If this is generally allowed to do, my question is how large chunks of data can I take at once, and how long should be waited between two takes?

srwiki_p isn't very large (3665333 revisions and 413987 pages), so I personally wouldn't worry about performance very much at all. If you were going to run this query on enwiki_p or another larger database, it might be more of a concern. Run the queries that you need to run.

The "Queries" page on the Toolserver wiki might be helpful to you.[1]

Looking at your query, you should pull page.page_namespace or specify page_namespace = 0. Pulling only page.page_title without specifying a namespace will output useless results. I'm also unclear why you'd need to specify rev_id > 0, though you might have your reasons for doing so.

Your Toolserver account has a quota (viewable with 'quota -v') that you might hit if you're outputting a lot of data to disk. You can always use /mnt/user-store/ or file a ticket in JIRA if you need an increased quota.

MZMcBride

[1] https://wiki.toolserver.org/view/Queries

Михајло Анђелковић

9:19 p.m.

Thank you, guys, I've already taken what I needed.

Namespaces are easily determined from the page prefix, I am not bothered if there are any anomalies out there (i.e. page starting with "User talk:" being in NS 0) and the query is lighter in case ns isn't being pulled out from the DB. In overall, it was taking about 17 seconds to write down the data about 1M revisions.

Setting span for rev_ids was only to take the data in chunks, I didn't have to specify them if I wanted to take them all at once. But hey, such chunks are even easier to sort by rev_id.

River Tarnell

9:21 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Михајло Анђелковић:

...

Namespaces are easily determined from the page prefix, I am not bothered if there are any anomalies out there (i.e. page starting with "User talk:" being in NS 0)

There are no page namespace prefixes in the databases. IOW, "Foo" and "Talk:Foo" both have page_title="Foo".

I can't believe that's what you want, since the output will be useless.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAkz0C1AACgkQIXd7fCuc5vKVDQCfUXpleT++0/KTGPCB4Iv5fuh8 MVUAoIOjDKBPbW5NhRssqr+luRNrx13q =9aVE -----END PGP SIGNATURE-----

Михајло Анђелковић

9:39 p.m.

Hm, that is very correct. The data I've got do not have this info.

But I won't run such a query again soon, since this still does the job: for now I only want to acknowledge when somebody has left sr.wp and to book the reason by reviewing the talk and other relevant pages from that time.

Thank you both for the tip.

4981

Age (days ago)

4981

Last active (days ago)

toolserver-l@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

MZMcBride
Petr Kadlec
Platonides
River Tarnell
Михајло Анђелковић