On 10/29/07, Fawad Nazir <fawad.nazir(a)gmail.com> wrote:
I am trying to run the following query on the
Wikipedia data and this
query is running for the last 36 hours now. Can anyone please tell me
is this normal time for such queries on Wikipedia data?
. . .
select rev_page, rev_user, rev_user_text, count(rev_user)
from revision
where
rev_user IN
(
select r1.rev_user
from revision r1, revision r2
where r1.rev_user <> r2.rev_user and r1.rev_user <> 0 and r2.rev_user
<> 0 and r1.rev_page = r2.rev_page
group by r1.rev_user
)
group by rev_user
order by rev_page, rev_user;
Your outer query is attempting to scan the entire revision table, and
do a filesort on it, and that's even without considering the subquery.
You need to write more efficient queries if you want to expect them
not to take days.
Another questions is: Do we have some service where we
can query
offline wikipedia data and get our desired results?
No, because perfectly well-intentioned queries like the one you posted
would slow it down so much as to make it useless. Such a service did
exist, long ago, but was stopped once the databases got to some
reasonable size.