Thanks a lot for help, Finn. Now my query can draw sample of new registered editors.
Best,
Haifeng Zhang
________________________________
From: Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> on behalf of
fn(a)imm.dtu.dk <fn(a)imm.dtu.dk>
Sent: Wednesday, March 13, 2019 12:01:59 PM
To: wiki-research-l(a)lists.wikimedia.org
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng,
On 13/03/2019 15:56, Haifeng Zhang wrote:
Thanks for pointing me to Quarray, Finn.
I tried a couple queries, but not sure why all took forever to get result.
I am not familiar with Quarry. It might have a timeout. The user table
associated with the English Wikipedia is quite large, so any operation
on that may take long time.
You might be able to get "timein" with a simplified SQL. For instance,
the query below takes 52.35 seconds:
USE enwiki_p;
SELECT user_id, user_name, user_registration, user_editcount
FROM user
LIMIT 1000
OFFSET 32000000
Is it possible to download relevant Media Wiki
database tables (e.g., user, user_groups, logging) and run SQL in my local machine?
There are SQL files available here
https://dumps.wikimedia.org/enwiki/20190301/ but I do not think the user
table is there, - at least I cannot identify it. Perhaps other people
would know.
You might be able try the Toolforge
https://tools.wmflabs.org/ You
should be able to access the tables via mysql on the prompt.
Login to
dev.tools.wmflabs.org
Then do "sql enwiki"
Read more about Toolforge here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge
/Finn
Thanks,
Haifeng Zhang
________________________________
From: Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> on behalf of
fn(a)imm.dtu.dk <fn(a)imm.dtu.dk>
Sent: Tuesday, March 12, 2019 7:25:53 PM
To: wiki-research-l(a)lists.wikimedia.org
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng ,
While some suggests the dumps or notice boards, my immediate thought was
a database query, e.g., through Quarry. It just happens that Jonathan T.
Morgan has created a query there:
https://quarry.wmflabs.org/query/310
SELECT user_id, user_name, user_registration, user_editcount
FROM enwiki_p.user
WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1
DAY),'%Y%m%d%H%i%s')
AND user_editcount > 10
AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE
ug_group = 'bot')
AND user_name not in (SELECT REPLACE(log_title,"_"," ")
from
enwiki_p.logging
where log_type = "block" and log_action = "block"
and log_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2
DAY),'%Y%m%d%H%i%s'));
You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork
here
https://quarry.wmflabs.org/query/34256 querying for month, - as
another example.
Finn Årup Nielsen
http://people.compute.dtu.dk/faan/
On 12/03/2019 19:18, Haifeng Zhang wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l