Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang
Hi, can you expand on what you mean by "sample"? If you're referring to analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring to analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring to analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Pine and Stuart,
I meant extracting a random sample of new editors (month by month) from Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring to analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Haifeng, thanks for the information. I think that your idea of looking in the dumps makes sense. Am I understanding correctly that you would like advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding Heifeng's work. There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research is legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month) from Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring to analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that
would
be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <haifeng1@andrew.cmu.edu
wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other
options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hey Haifeng, If you decide to process the dumps, you should be able to easily repurpose some quick code that I wrote for a similar project: https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnove...
Notably, I'd suggest using the stub history dumps as they are much smaller because they do not include the actual content. For instance, for March 1st and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
Best, Isaac
On Tue, Mar 12, 2019 at 3:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Haifeng, thanks for the information. I think that your idea of looking in the dumps makes sense. Am I understanding correctly that you would like advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding Heifeng's work. There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research is legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month) from Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring
to
analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that
would
be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
haifeng1@andrew.cmu.edu
wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g.,
100
editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other
options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Note that this code deals with accounts, not editors, which is what Haifeng asked for.
There are many reasons, both licit and illicit for editors to have more than one account. I know I have more than ten for policy-compliant reasons.
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 10:21, Isaac Johnson isaac@wikimedia.org wrote:
Hey Haifeng, If you decide to process the dumps, you should be able to easily repurpose some quick code that I wrote for a similar project: https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnove...
Notably, I'd suggest using the stub history dumps as they are much smaller because they do not include the actual content. For instance, for March 1st and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
Best, Isaac
On Tue, Mar 12, 2019 at 3:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Haifeng, thanks for the information. I think that your idea of looking in the dumps makes sense. Am I understanding correctly that you would like advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding Heifeng's work. There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research is legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month) from Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring
to
analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that
would
be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
haifeng1@andrew.cmu.edu
wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g.,
100
editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other
options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Yes, thanks for the clarification Stuart. I don't know of any statistics to suggest how widespread this is, but it might be worth checking, especially if you are focusing on editors with higher edit counts (who I suspect are more likely to have multiple accounts for licit reasons).
On Tue, Mar 12, 2019 at 4:34 PM Stuart A. Yeates syeates@gmail.com wrote:
Note that this code deals with accounts, not editors, which is what Haifeng asked for.
There are many reasons, both licit and illicit for editors to have more than one account. I know I have more than ten for policy-compliant reasons.
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 10:21, Isaac Johnson isaac@wikimedia.org wrote:
Hey Haifeng, If you decide to process the dumps, you should be able to easily
repurpose
some quick code that I wrote for a similar project:
https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnove...
Notably, I'd suggest using the stub history dumps as they are much
smaller
because they do not include the actual content. For instance, for March
1st
and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/),
this
file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
Best, Isaac
On Tue, Mar 12, 2019 at 3:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Haifeng, thanks for the information. I think that your idea of
looking
in the dumps makes sense. Am I understanding correctly that you would
like
advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding
Heifeng's
work. There has been discussion on English Wikipedia regarding
volunteers
being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's
research is
legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang <haifeng1@andrew.cmu.edu
wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month)
from
Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org
on
behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English
Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other
than
via dumps. It would be great if someone who knows the answer to
help
them to move forward.
If you are interested to learn more about their research, instead
of
answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?"
kind of
question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that
can
help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com
wrote:
Hi, can you expand on what you mean by "sample"? If you're
referring
to
analyzing users' edit histories then that should be fine.
However, if
you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience,
that
would
be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
haifeng1@andrew.cmu.edu
wrote:
> Hi folks, > > My work needs to randomly sample new editors in each month,
e.g.,
100
> editors per month. > > Do any of you have good suggestions for how to do this
efficiently?
> > I could think of using the dump files, but wonder are there
other
options?
> > > Thanks, > > Haifeng Zhang > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
There are thousands and thousands of editors with multiple accounts. Those who have been bothered to add a category are listed at https://en.wikipedia.org/wiki/Category:Wikipedians_with_alternative_accounts
Many editors who engage in outreach are advised to create new accounts for themselves regularly, simply because the experience of new account creation changes over time and helping users streamline that (especially in situations such as editathons) requires thorough knowledge of account creation and the things that can make it go wrong. Pretty much a prerequisite for the old accountcreator userright https://en.wikipedia.org/wiki/Wikipedia:Account_creator (which I've had on several occasions) and the new eventcoordinator userright https://en.wikipedia.org/wiki/Wikipedia:Event_coordinator (which is too new for me to have had yet).
cheers stuart -- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 10:40, Isaac Johnson isaac@wikimedia.org wrote:
Yes, thanks for the clarification Stuart. I don't know of any statistics to suggest how widespread this is, but it might be worth checking, especially if you are focusing on editors with higher edit counts (who I suspect are more likely to have multiple accounts for licit reasons).
On Tue, Mar 12, 2019 at 4:34 PM Stuart A. Yeates syeates@gmail.com wrote:
Note that this code deals with accounts, not editors, which is what Haifeng asked for.
There are many reasons, both licit and illicit for editors to have more than one account. I know I have more than ten for policy-compliant reasons.
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 10:21, Isaac Johnson isaac@wikimedia.org wrote:
Hey Haifeng, If you decide to process the dumps, you should be able to easily
repurpose
some quick code that I wrote for a similar project:
https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnove...
Notably, I'd suggest using the stub history dumps as they are much
smaller
because they do not include the actual content. For instance, for March
1st
and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/),
this
file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
Best, Isaac
On Tue, Mar 12, 2019 at 3:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Haifeng, thanks for the information. I think that your idea of
looking
in the dumps makes sense. Am I understanding correctly that you would
like
advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding
Heifeng's
work. There has been discussion on English Wikipedia regarding
volunteers
being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's
research is
legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang <haifeng1@andrew.cmu.edu
wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month)
from
Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org
on
behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English
Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other
than
via dumps. It would be great if someone who knows the answer to
help
them to move forward.
If you are interested to learn more about their research, instead
of
answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?"
kind of
question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that
can
help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com
wrote:
> > Hi, can you expand on what you mean by "sample"? If you're
referring
to
> analyzing users' edit histories then that should be fine.
However, if
> you're planning to send surveys or messages to them, sending them > barnstars, or otherwise manipulating their on-wiki experience,
that
would
> be problematic. > > Pine > ( https://meta.wikimedia.org/wiki/User:Pine ) > > > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
haifeng1@andrew.cmu.edu
> wrote: > > > Hi folks, > > > > My work needs to randomly sample new editors in each month,
e.g.,
100
> > editors per month. > > > > Do any of you have good suggestions for how to do this
efficiently?
> > > > I could think of using the dump files, but wonder are there
other
options?
> > > > > > Thanks, > > > > Haifeng Zhang > > _______________________________________________ > > Wiki-research-l mailing list > > Wiki-research-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
This can be a good option too. Thanks, Issac.
Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Isaac Johnson isaac@wikimedia.org Sent: Tuesday, March 12, 2019 5:21:11 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hey Haifeng, If you decide to process the dumps, you should be able to easily repurpose some quick code that I wrote for a similar project: https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnove...
Notably, I'd suggest using the stub history dumps as they are much smaller because they do not include the actual content. For instance, for March 1st and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
Best, Isaac
On Tue, Mar 12, 2019 at 3:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Haifeng, thanks for the information. I think that your idea of looking in the dumps makes sense. Am I understanding correctly that you would like advice regarding how to do that in the most efficient way?
Hi Leila, I believe that I asked for more information regarding Heifeng's work. There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research is legitimate, and I tried to provide some examples of possible types of research. I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Pine and Stuart,
I meant extracting a random sample of new editors (month by month) from Wikipedia edit history.
It is not about survey of new editors, but still thanks for your suggestions.
Thanks, Haifeng Zhang
Postdoctoral Research Fellow Human-Computer Interaction Institute Carnegie Mellon University ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Tuesday, March 12, 2019 3:46:19 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
There are a number of new-editor-heavy noticeboards. I would suggest posting an invite there to your survey (or whatever) If you ask for editor's usernames you can filter out those who don't meet your definition of 'new'
I'm thinking of places like: https://en.wikipedia.org/wiki/Wikipedia:Teahouse and https://en.wikipedia.org/wiki/Wikipedia:Help_desk
cheers stuart
-- ...let us be heard from red core to black sky
On Wed, 13 Mar 2019 at 08:37, Leila Zia leila@wikimedia.org wrote:
Hi Pine,
Haifeng has a simple question about how to sample editors other than via dumps. It would be great if someone who knows the answer to help them to move forward.
If you are interested to learn more about their research, instead of answering their question, my recommendation would be to start the conversation with: "can you tell us more about your research?" kind of question. I find the current way of communication very speculative, and that is not good for making a vibrant research community that can help us address some of our big questions.
Best, Leila
On Tue, Mar 12, 2019 at 12:08 PM Pine W wiki.pine@gmail.com wrote:
Hi, can you expand on what you mean by "sample"? If you're referring
to
analyzing users' edit histories then that should be fine. However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that
would
be problematic.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
haifeng1@andrew.cmu.edu
wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g.,
100
editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other
options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Tue, Mar 12, 2019 at 1:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Leila, I believe that I asked for more information regarding Heifeng's work.
You stated
"However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic."
and I'm suggesting that you enter from a question angle, please.
There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research is legitimate, and I tried to provide some examples of possible types of research.
Please check your email. There was no question there in the part related to this discussion. Also, even if there was a question posed, I highly recommend you enter from a different angle to these conversations. There are many reasons someone may need the sampled data of newcomers. A few examples: they may want to test the assumption whether the arrivals (registrations) to a specific Wikipedia language follow a Poisson process or not, they may want to learn about the distribution of topics editors in a given language edit in the first 24 hours after they open the account, they may want to build a prediction model to predict whether the editor will make the n-th edit or not given that they have started at time x, they may want to see whether external events have strong correlations with account registration and Wikipedia activity, they may want to see if the change to HTTPS had impact on registrations, etc. There are literally millions of questions people may ask (given that the data is available to them) with respect to Wikipedia. The answer to some of them may require interaction with Wikipedia editors, the answer to some may not. So the safest bet to start having a fruitful conversation is to ask: can you tell us more about what you're trying to do?
I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
I understand and I'm looking forward to having conversations with you all about how to achieve that.
Best, Leila
Leila, can we discuss this off list?
Thanks,
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 9:29 PM Leila Zia leila@wikimedia.org wrote:
On Tue, Mar 12, 2019 at 1:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Leila, I believe that I asked for more information regarding Heifeng's work.
You stated
"However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic."
and I'm suggesting that you enter from a question angle, please.
There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research
is
legitimate, and I tried to provide some examples of possible types of research.
Please check your email. There was no question there in the part related to this discussion. Also, even if there was a question posed, I highly recommend you enter from a different angle to these conversations. There are many reasons someone may need the sampled data of newcomers. A few examples: they may want to test the assumption whether the arrivals (registrations) to a specific Wikipedia language follow a Poisson process or not, they may want to learn about the distribution of topics editors in a given language edit in the first 24 hours after they open the account, they may want to build a prediction model to predict whether the editor will make the n-th edit or not given that they have started at time x, they may want to see whether external events have strong correlations with account registration and Wikipedia activity, they may want to see if the change to HTTPS had impact on registrations, etc. There are literally millions of questions people may ask (given that the data is available to them) with respect to Wikipedia. The answer to some of them may require interaction with Wikipedia editors, the answer to some may not. So the safest bet to start having a fruitful conversation is to ask: can you tell us more about what you're trying to do?
I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
I understand and I'm looking forward to having conversations with you all about how to achieve that.
Best, Leila
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Let's do it.
On Tue, Mar 12, 2019 at 3:04 PM Pine W wiki.pine@gmail.com wrote:
Leila, can we discuss this off list?
Thanks,
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Mar 12, 2019 at 9:29 PM Leila Zia leila@wikimedia.org wrote:
On Tue, Mar 12, 2019 at 1:56 PM Pine W wiki.pine@gmail.com wrote:
Hi Leila, I believe that I asked for more information regarding Heifeng's work.
You stated
"However, if you're planning to send surveys or messages to them, sending them barnstars, or otherwise manipulating their on-wiki experience, that would be problematic."
and I'm suggesting that you enter from a question angle, please.
There has been discussion on English Wikipedia regarding volunteers being unhappy with the interventions or proposed interventions of researchers. I think that asking about the nature of Haifeng's research
is
legitimate, and I tried to provide some examples of possible types of research.
Please check your email. There was no question there in the part related to this discussion. Also, even if there was a question posed, I highly recommend you enter from a different angle to these conversations. There are many reasons someone may need the sampled data of newcomers. A few examples: they may want to test the assumption whether the arrivals (registrations) to a specific Wikipedia language follow a Poisson process or not, they may want to learn about the distribution of topics editors in a given language edit in the first 24 hours after they open the account, they may want to build a prediction model to predict whether the editor will make the n-th edit or not given that they have started at time x, they may want to see whether external events have strong correlations with account registration and Wikipedia activity, they may want to see if the change to HTTPS had impact on registrations, etc. There are literally millions of questions people may ask (given that the data is available to them) with respect to Wikipedia. The answer to some of them may require interaction with Wikipedia editors, the answer to some may not. So the safest bet to start having a fruitful conversation is to ask: can you tell us more about what you're trying to do?
I'm trying to protect the community from problematic interventions, while also welcoming research that is accepted by the community.
I understand and I'm looking forward to having conversations with you all about how to achieve that.
Best, Leila
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Haifeng ,
While some suggests the dumps or notice boards, my immediate thought was a database query, e.g., through Quarry. It just happens that Jonathan T. Morgan has created a query there:
https://quarry.wmflabs.org/query/310
SELECT user_id, user_name, user_registration, user_editcount FROM enwiki_p.user WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 DAY),'%Y%m%d%H%i%s') AND user_editcount > 10 AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE ug_group = 'bot') AND user_name not in (SELECT REPLACE(log_title,"_"," ") from enwiki_p.logging where log_type = "block" and log_action = "block" and log_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2 DAY),'%Y%m%d%H%i%s'));
You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork here https://quarry.wmflabs.org/query/34256 querying for month, - as another example.
Finn Årup Nielsen http://people.compute.dtu.dk/faan/
On 12/03/2019 19:18, Haifeng Zhang wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks for pointing me to Quarray, Finn.
I tried a couple queries, but not sure why all took forever to get result.
Is it possible to download relevant Media Wiki database tables (e.g., user, user_groups, logging) and run SQL in my local machine?
Thanks,
Haifeng Zhang ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of fn@imm.dtu.dk fn@imm.dtu.dk Sent: Tuesday, March 12, 2019 7:25:53 PM To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng ,
While some suggests the dumps or notice boards, my immediate thought was a database query, e.g., through Quarry. It just happens that Jonathan T. Morgan has created a query there:
https://quarry.wmflabs.org/query/310
SELECT user_id, user_name, user_registration, user_editcount FROM enwiki_p.user WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 DAY),'%Y%m%d%H%i%s') AND user_editcount > 10 AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE ug_group = 'bot') AND user_name not in (SELECT REPLACE(log_title,"_"," ") from enwiki_p.logging where log_type = "block" and log_action = "block" and log_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2 DAY),'%Y%m%d%H%i%s'));
You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork here https://quarry.wmflabs.org/query/34256 querying for month, - as another example.
Finn Årup Nielsen http://people.compute.dtu.dk/faan/
On 12/03/2019 19:18, Haifeng Zhang wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Haifeng,
On 13/03/2019 15:56, Haifeng Zhang wrote:
Thanks for pointing me to Quarray, Finn.
I tried a couple queries, but not sure why all took forever to get result.
I am not familiar with Quarry. It might have a timeout. The user table associated with the English Wikipedia is quite large, so any operation on that may take long time.
You might be able to get "timein" with a simplified SQL. For instance, the query below takes 52.35 seconds:
USE enwiki_p;
SELECT user_id, user_name, user_registration, user_editcount FROM user LIMIT 1000 OFFSET 32000000
Is it possible to download relevant Media Wiki database tables (e.g., user, user_groups, logging) and run SQL in my local machine?
There are SQL files available here https://dumps.wikimedia.org/enwiki/20190301/ but I do not think the user table is there, - at least I cannot identify it. Perhaps other people would know.
You might be able try the Toolforge https://tools.wmflabs.org/ You should be able to access the tables via mysql on the prompt.
Login to dev.tools.wmflabs.org Then do "sql enwiki"
Read more about Toolforge here: https://wikitech.wikimedia.org/wiki/Help:Toolforge
/Finn
Thanks,
Haifeng Zhang ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of fn@imm.dtu.dk fn@imm.dtu.dk Sent: Tuesday, March 12, 2019 7:25:53 PM To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng ,
While some suggests the dumps or notice boards, my immediate thought was a database query, e.g., through Quarry. It just happens that Jonathan T. Morgan has created a query there:
https://quarry.wmflabs.org/query/310
SELECT user_id, user_name, user_registration, user_editcount FROM enwiki_p.user WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 DAY),'%Y%m%d%H%i%s') AND user_editcount > 10 AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE ug_group = 'bot') AND user_name not in (SELECT REPLACE(log_title,"_"," ") from enwiki_p.logging where log_type = "block" and log_action = "block" and log_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2 DAY),'%Y%m%d%H%i%s'));
You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork here https://quarry.wmflabs.org/query/34256 querying for month, - as another example.
Finn Årup Nielsen http://people.compute.dtu.dk/faan/
On 12/03/2019 19:18, Haifeng Zhang wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks a lot for help, Finn. Now my query can draw sample of new registered editors.
Best,
Haifeng Zhang ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of fn@imm.dtu.dk fn@imm.dtu.dk Sent: Wednesday, March 13, 2019 12:01:59 PM To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng,
On 13/03/2019 15:56, Haifeng Zhang wrote:
Thanks for pointing me to Quarray, Finn.
I tried a couple queries, but not sure why all took forever to get result.
I am not familiar with Quarry. It might have a timeout. The user table associated with the English Wikipedia is quite large, so any operation on that may take long time.
You might be able to get "timein" with a simplified SQL. For instance, the query below takes 52.35 seconds:
USE enwiki_p;
SELECT user_id, user_name, user_registration, user_editcount FROM user LIMIT 1000 OFFSET 32000000
Is it possible to download relevant Media Wiki database tables (e.g., user, user_groups, logging) and run SQL in my local machine?
There are SQL files available here https://dumps.wikimedia.org/enwiki/20190301/ but I do not think the user table is there, - at least I cannot identify it. Perhaps other people would know.
You might be able try the Toolforge https://tools.wmflabs.org/ You should be able to access the tables via mysql on the prompt.
Login to dev.tools.wmflabs.org Then do "sql enwiki"
Read more about Toolforge here: https://wikitech.wikimedia.org/wiki/Help:Toolforge
/Finn
Thanks,
Haifeng Zhang ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of fn@imm.dtu.dk fn@imm.dtu.dk Sent: Tuesday, March 12, 2019 7:25:53 PM To: wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Haifeng ,
While some suggests the dumps or notice boards, my immediate thought was a database query, e.g., through Quarry. It just happens that Jonathan T. Morgan has created a query there:
https://quarry.wmflabs.org/query/310
SELECT user_id, user_name, user_registration, user_editcount FROM enwiki_p.user WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 DAY),'%Y%m%d%H%i%s') AND user_editcount > 10 AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE ug_group = 'bot') AND user_name not in (SELECT REPLACE(log_title,"_"," ") from enwiki_p.logging where log_type = "block" and log_action = "block" and log_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2 DAY),'%Y%m%d%H%i%s'));
You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork here https://quarry.wmflabs.org/query/34256 querying for month, - as another example.
Finn Årup Nielsen http://people.compute.dtu.dk/faan/
On 12/03/2019 19:18, Haifeng Zhang wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Thu, 14 Mar 2019 at 09:16, Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Thanks a lot for help, Finn. Now my query can draw sample of new registered editors.
To repeat a point I made earlier in the thread: this query deals with accounts not editors. Many at the coalface consider this to be a very important difference. You appear not to have shared enough of your research project for us to tell whether it's going to matter for you.
cheers stuart
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang ________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Stuart A. Yeates syeates@gmail.com Sent: Wednesday, March 13, 2019 6:31:26 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
On Thu, 14 Mar 2019 at 09:16, Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Thanks a lot for help, Finn. Now my query can draw sample of new registered editors.
To repeat a point I made earlier in the thread: this query deals with accounts not editors. Many at the coalface consider this to be a very important difference. You appear not to have shared enough of your research project for us to tell whether it's going to matter for you.
cheers stuart
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Haifeng,
Some users will state on user pages that an account is an alternate account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
Apart from the legitimate alternate accounts and the illegitimate sockpuppet accounts, there are other ways that alternate accounts exist.
Occasional contributors often forget their username and/or password. Password recovery isn't possible unless you provide an email address at sign-up (it's optional, but you can add it later). So what such people then do is just create a new user account (I'm not sure there is anything else they can do). I see this sort of behaviour a lot at events. The other variation of the problem is that they did provide an email address but it is one not easily accessible to them at the event (i.e. a librarian who signed up with a work email address that cannot be accessed outside of the organisation).
The other group of people with multiple accounts are those who edit anonymously as serial IPs. The same person can use a number of IP numbers over time. Often you don't realise it is the same person unless you see a lot of their work and can see a pattern in it. For example, at the moment, there is a person with a series of IP accounts that is changing a common section of a Queensland place article to be a subsection of another, who I notice on my watchlist . This person appears to acquire a new IP address every week or so, but the pattern of editing makes it obvious it's the same person behind it. Whether or not an IP address can be considered "an account" depends on your purposes. The one IP address can also be used by multiple people (e.g. coming through a shared organisational network in a library or school). It is claimed by some people that many new users do their first edits anonymously, so if you are serious about studying "new contributors", then maybe you have to look at anonymous editing. And also even regular contributors may sometimes choose to edit anonymously, e.g. being in an unsecure IT environment and reluctant to use their username/password in that situation (particularly people with administrator or other significant access rights).
Because I do outreach, I look for new accounts that turn up on my watchlist and send them welcome messages etc. Because I also do training, I see a lot of genuinely new people in action where I can observe their edits. So when I see new accounts or IPs doing far more "sophisticated" edits than I see new users do, I tend to suspect they are not genuinely new contributors.
I think the best you can do is look for new accounts and be prepared to omit any that show signs of sophisticated editing (either in terms of they are doing technically or what they say on Talk pages or in edit summaries). For example, no genuine new user will mention a policy (they don't know they exist). Also genuine new users don't tend to edit that quickly, so any rapid fire series of successful edits is unlikely to be a genuine new user. I think this inability to know if a new account represents a genuinely new user is an inherent limitation for your research and should be documented as such explaining the many circumstances in which new accounts might belong to non-new users.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Pine W Sent: Tuesday, 19 March 2019 5:27 AM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hi Haifeng,
Some users will state on user pages that an account is an alternate account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
In addition to Kerry's excellent examples there are users editing wikipedia though TOR, the anonymity and censorship circumvention network. These users face extra scrutiny.
cheers stuart
-- ...let us be heard from red core to black sky
On Tue, 19 Mar 2019 at 13:04, Kerry Raymond kerry.raymond@gmail.com wrote:
Apart from the legitimate alternate accounts and the illegitimate sockpuppet accounts, there are other ways that alternate accounts exist.
Occasional contributors often forget their username and/or password. Password recovery isn't possible unless you provide an email address at sign-up (it's optional, but you can add it later). So what such people then do is just create a new user account (I'm not sure there is anything else they can do). I see this sort of behaviour a lot at events. The other variation of the problem is that they did provide an email address but it is one not easily accessible to them at the event (i.e. a librarian who signed up with a work email address that cannot be accessed outside of the organisation).
The other group of people with multiple accounts are those who edit anonymously as serial IPs. The same person can use a number of IP numbers over time. Often you don't realise it is the same person unless you see a lot of their work and can see a pattern in it. For example, at the moment, there is a person with a series of IP accounts that is changing a common section of a Queensland place article to be a subsection of another, who I notice on my watchlist . This person appears to acquire a new IP address every week or so, but the pattern of editing makes it obvious it's the same person behind it. Whether or not an IP address can be considered "an account" depends on your purposes. The one IP address can also be used by multiple people (e.g. coming through a shared organisational network in a library or school). It is claimed by some people that many new users do their first edits anonymously, so if you are serious about studying "new contributors", then maybe you have to look at anonymous editing. And also even regular contributors may sometimes choose to edit anonymously, e.g. being in an unsecure IT environment and reluctant to use their username/password in that situation (particularly people with administrator or other significant access rights).
Because I do outreach, I look for new accounts that turn up on my watchlist and send them welcome messages etc. Because I also do training, I see a lot of genuinely new people in action where I can observe their edits. So when I see new accounts or IPs doing far more "sophisticated" edits than I see new users do, I tend to suspect they are not genuinely new contributors.
I think the best you can do is look for new accounts and be prepared to omit any that show signs of sophisticated editing (either in terms of they are doing technically or what they say on Talk pages or in edit summaries). For example, no genuine new user will mention a policy (they don't know they exist). Also genuine new users don't tend to edit that quickly, so any rapid fire series of successful edits is unlikely to be a genuine new user. I think this inability to know if a new account represents a genuinely new user is an inherent limitation for your research and should be documented as such explaining the many circumstances in which new accounts might belong to non-new users.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Pine W Sent: Tuesday, 19 March 2019 5:27 AM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hi Haifeng,
Some users will state on user pages that an account is an alternate account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Does anybody know how prevalent are sockpuppets? Has anybody tried estimating the percentage of editors that have created at least one additional account? (Legitimate or otherwise.)
Giovanni
On Mon, Mar 18, 2019, 20:20 Stuart A. Yeates syeates@gmail.com wrote:
In addition to Kerry's excellent examples there are users editing wikipedia though TOR, the anonymity and censorship circumvention network. These users face extra scrutiny.
cheers stuart
-- ...let us be heard from red core to black sky
On Tue, 19 Mar 2019 at 13:04, Kerry Raymond kerry.raymond@gmail.com wrote:
Apart from the legitimate alternate accounts and the illegitimate
sockpuppet accounts, there are other ways that alternate accounts exist.
Occasional contributors often forget their username and/or password.
Password recovery isn't possible unless you provide an email address at sign-up (it's optional, but you can add it later). So what such people then do is just create a new user account (I'm not sure there is anything else they can do). I see this sort of behaviour a lot at events. The other variation of the problem is that they did provide an email address but it is one not easily accessible to them at the event (i.e. a librarian who signed up with a work email address that cannot be accessed outside of the organisation).
The other group of people with multiple accounts are those who edit
anonymously as serial IPs. The same person can use a number of IP numbers over time. Often you don't realise it is the same person unless you see a lot of their work and can see a pattern in it. For example, at the moment, there is a person with a series of IP accounts that is changing a common section of a Queensland place article to be a subsection of another, who I notice on my watchlist . This person appears to acquire a new IP address every week or so, but the pattern of editing makes it obvious it's the same person behind it. Whether or not an IP address can be considered "an account" depends on your purposes. The one IP address can also be used by multiple people (e.g. coming through a shared organisational network in a library or school). It is claimed by some people that many new users do their first edits anonymously, so if you are serious about studying "new contributors", then maybe you have to look at anonymous editing. And also even regular contributors may sometimes choose to edit anonymously, e.g. being in an unsecure IT environment and reluctant to use their username/password in that situation (particularly people with administrator or other significant access rights).
Because I do outreach, I look for new accounts that turn up on my
watchlist and send them welcome messages etc. Because I also do training, I see a lot of genuinely new people in action where I can observe their edits. So when I see new accounts or IPs doing far more "sophisticated" edits than I see new users do, I tend to suspect they are not genuinely new contributors.
I think the best you can do is look for new accounts and be prepared to
omit any that show signs of sophisticated editing (either in terms of they are doing technically or what they say on Talk pages or in edit summaries). For example, no genuine new user will mention a policy (they don't know they exist). Also genuine new users don't tend to edit that quickly, so any rapid fire series of successful edits is unlikely to be a genuine new user. I think this inability to know if a new account represents a genuinely new user is an inherent limitation for your research and should be documented as such explaining the many circumstances in which new accounts might belong to non-new users.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:
wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Pine W
Sent: Tuesday, 19 March 2019 5:27 AM To: Research into Wikimedia content and communities <
wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hi Haifeng,
Some users will state on user pages that an account is an alternate
account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for
illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as
people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
The thing about sockpuppets is that we only know about the ones that have been detected (and some of them have been large groups of 100s of accounts). The problem is that we don’t know about the undetected ones. I am sure many of us have had suspicions about the behaviour of certain accounts but to request a sockpuppet investigation requires a level of evidence above suspicious behaviour (specifically identifying another account). New users with sophisticated editing skills and writing on topics associated with living individuals, businesses or products in a positive way often seem to me to be the kind of account likely to be doing undisclosed paid editing, and almost therefore certainly a sockpuppet of a paid PR person, but if each account writes about a different topic, it is difficult to work out what the other accounts might be to look for evidence of sockpuppeting.
How far underwater does the iceberg go?
Kerry
From: Giovanni Luca Ciampaglia [mailto:glciampagl@gmail.com] Sent: Tuesday, 19 March 2019 11:37 AM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Cc: Kerry Raymond kerry.raymond@gmail.com Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Does anybody know how prevalent are sockpuppets? Has anybody tried estimating the percentage of editors that have created at least one additional account? (Legitimate or otherwise.)
Giovanni
On Mon, Mar 18, 2019, 20:20 Stuart A. Yeates <syeates@gmail.com mailto:syeates@gmail.com > wrote:
In addition to Kerry's excellent examples there are users editing wikipedia though TOR, the anonymity and censorship circumvention network. These users face extra scrutiny.
cheers stuart
-- ...let us be heard from red core to black sky
On Tue, 19 Mar 2019 at 13:04, Kerry Raymond <kerry.raymond@gmail.com mailto:kerry.raymond@gmail.com > wrote:
Apart from the legitimate alternate accounts and the illegitimate sockpuppet accounts, there are other ways that alternate accounts exist.
Occasional contributors often forget their username and/or password. Password recovery isn't possible unless you provide an email address at sign-up (it's optional, but you can add it later). So what such people then do is just create a new user account (I'm not sure there is anything else they can do). I see this sort of behaviour a lot at events. The other variation of the problem is that they did provide an email address but it is one not easily accessible to them at the event (i.e. a librarian who signed up with a work email address that cannot be accessed outside of the organisation).
The other group of people with multiple accounts are those who edit anonymously as serial IPs. The same person can use a number of IP numbers over time. Often you don't realise it is the same person unless you see a lot of their work and can see a pattern in it. For example, at the moment, there is a person with a series of IP accounts that is changing a common section of a Queensland place article to be a subsection of another, who I notice on my watchlist . This person appears to acquire a new IP address every week or so, but the pattern of editing makes it obvious it's the same person behind it. Whether or not an IP address can be considered "an account" depends on your purposes. The one IP address can also be used by multiple people (e.g. coming through a shared organisational network in a library or school). It is claimed by some people that many new users do their first edits anonymously, so if you are serious about studying "new contributors", then maybe you have to look at anonymous editing. And also even regular contributors may sometimes choose to edit anonymously, e.g. being in an unsecure IT environment and reluctant to use their username/password in that situation (particularly people with administrator or other significant access rights).
Because I do outreach, I look for new accounts that turn up on my watchlist and send them welcome messages etc. Because I also do training, I see a lot of genuinely new people in action where I can observe their edits. So when I see new accounts or IPs doing far more "sophisticated" edits than I see new users do, I tend to suspect they are not genuinely new contributors.
I think the best you can do is look for new accounts and be prepared to omit any that show signs of sophisticated editing (either in terms of they are doing technically or what they say on Talk pages or in edit summaries). For example, no genuine new user will mention a policy (they don't know they exist). Also genuine new users don't tend to edit that quickly, so any rapid fire series of successful edits is unlikely to be a genuine new user. I think this inability to know if a new account represents a genuinely new user is an inherent limitation for your research and should be documented as such explaining the many circumstances in which new accounts might belong to non-new users.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org mailto:wiki-research-l-bounces@lists.wikimedia.org ] On Behalf Of Pine W Sent: Tuesday, 19 March 2019 5:27 AM To: Research into Wikimedia content and communities <wiki-research-l@lists.wikimedia.org mailto:wiki-research-l@lists.wikimedia.org > Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hi Haifeng,
Some users will state on user pages that an account is an alternate account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang <haifeng1@andrew.cmu.edu mailto:haifeng1@andrew.cmu.edu > wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org mailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org mailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks Kerry.
Your raise a valid point: It makes sense that focusing only on detected cases may not be representative or indicative of how widespread the behavior is.
Has anybody ever though about running a survey with a representative sample of registered editors? Given the nature of the behavior I can imagine response rates would still be affected by social desirability bias, but it could at least be a starting point for a slightly less biased estimate....
Giovanni Luca Ciampaglia ∙ glciampaglia.com Assistant Professor Computer Science and Engineering https://www.usf.edu/engineering/cse/ ∙ University of South Florida https://www.usf.edu/News 🕫*New email address*: glc3@mail.usf.edu *Hoaxy Botometer*: Check out our new tool: https://hoaxy.iuni.iu.edu/
On Mon, Mar 18, 2019 at 10:12 PM Kerry Raymond kerry.raymond@gmail.com wrote:
The thing about sockpuppets is that we only know about the ones that have been detected (and some of them have been large groups of 100s of accounts). The problem is that we don’t know about the undetected ones. I am sure many of us have had suspicions about the behaviour of certain accounts but to request a sockpuppet investigation requires a level of evidence above suspicious behaviour (specifically identifying another account). New users with sophisticated editing skills and writing on topics associated with living individuals, businesses or products in a positive way often seem to me to be the kind of account likely to be doing undisclosed paid editing, and almost therefore certainly a sockpuppet of a paid PR person, but if each account writes about a different topic, it is difficult to work out what the other accounts might be to look for evidence of sockpuppeting.
How far underwater does the iceberg go?
Kerry
*From:* Giovanni Luca Ciampaglia [mailto:glciampagl@gmail.com] *Sent:* Tuesday, 19 March 2019 11:37 AM *To:* Research into Wikimedia content and communities < wiki-research-l@lists.wikimedia.org> *Cc:* Kerry Raymond kerry.raymond@gmail.com *Subject:* Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Does anybody know how prevalent are sockpuppets? Has anybody tried estimating the percentage of editors that have created at least one additional account? (Legitimate or otherwise.)
Giovanni
On Mon, Mar 18, 2019, 20:20 Stuart A. Yeates syeates@gmail.com wrote:
In addition to Kerry's excellent examples there are users editing wikipedia though TOR, the anonymity and censorship circumvention network. These users face extra scrutiny.
cheers stuart
-- ...let us be heard from red core to black sky
On Tue, 19 Mar 2019 at 13:04, Kerry Raymond kerry.raymond@gmail.com wrote:
Apart from the legitimate alternate accounts and the illegitimate
sockpuppet accounts, there are other ways that alternate accounts exist.
Occasional contributors often forget their username and/or password.
Password recovery isn't possible unless you provide an email address at sign-up (it's optional, but you can add it later). So what such people then do is just create a new user account (I'm not sure there is anything else they can do). I see this sort of behaviour a lot at events. The other variation of the problem is that they did provide an email address but it is one not easily accessible to them at the event (i.e. a librarian who signed up with a work email address that cannot be accessed outside of the organisation).
The other group of people with multiple accounts are those who edit
anonymously as serial IPs. The same person can use a number of IP numbers over time. Often you don't realise it is the same person unless you see a lot of their work and can see a pattern in it. For example, at the moment, there is a person with a series of IP accounts that is changing a common section of a Queensland place article to be a subsection of another, who I notice on my watchlist . This person appears to acquire a new IP address every week or so, but the pattern of editing makes it obvious it's the same person behind it. Whether or not an IP address can be considered "an account" depends on your purposes. The one IP address can also be used by multiple people (e.g. coming through a shared organisational network in a library or school). It is claimed by some people that many new users do their first edits anonymously, so if you are serious about studying "new contributors", then maybe you have to look at anonymous editing. And also even regular contributors may sometimes choose to edit anonymously, e.g. being in an unsecure IT environment and reluctant to use their username/password in that situation (particularly people with administrator or other significant access rights).
Because I do outreach, I look for new accounts that turn up on my
watchlist and send them welcome messages etc. Because I also do training, I see a lot of genuinely new people in action where I can observe their edits. So when I see new accounts or IPs doing far more "sophisticated" edits than I see new users do, I tend to suspect they are not genuinely new contributors.
I think the best you can do is look for new accounts and be prepared to
omit any that show signs of sophisticated editing (either in terms of they are doing technically or what they say on Talk pages or in edit summaries). For example, no genuine new user will mention a policy (they don't know they exist). Also genuine new users don't tend to edit that quickly, so any rapid fire series of successful edits is unlikely to be a genuine new user. I think this inability to know if a new account represents a genuinely new user is an inherent limitation for your research and should be documented as such explaining the many circumstances in which new accounts might belong to non-new users.
Kerry
-----Original Message----- From: Wiki-research-l [mailto:
wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Pine W
Sent: Tuesday, 19 March 2019 5:27 AM To: Research into Wikimedia content and communities <
wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
Hi Haifeng,
Some users will state on user pages that an account is an alternate
account. However, this practice is not followed by everyone, and those who do follow this practice aren't required to so in a uniform way.
Alternate accounts which are not labeled as such, and which are used for
illegitimate purposes such as double voting, are an ongoing problem. You might be interested in the English Wikipedia page https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
Alternate accounts can also be used for legitimate purposes, such as
people who have one account for their professional or academic activities and another account for their personal use.
Good luck with your project.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Stuart,
I'm building an agent-based simulation of Wikipedia collaboration.
I would like my model to be empirically grounded, so I need to collect data for new editors.
Alternative accounts can be an issue, but I wonder is there a way to identify editors who have multiple account?
Thanks,
Haifeng Zhang
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
A quick and dirty solution might be to use the hostbot list from the teahouse at https://en.wikipedia.org/wiki/Wikipedia:Teahouse/Hosts/Database_reports The list is regularly refreshed, so you could pull the account names from there over the course of a month and then randomly select your sample, noting that it is biased towards new editors that have made more than 10 edits.
Otherwise perhaps using recent changes, but filtering for logged actions by new users? https://en.wikipedia.org/wiki/Special:RecentChanges?userExpLevel=newcomer&am...
https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
On Wed, 13 Mar 2019 at 04:49, Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Hi folks,
My work needs to randomly sample new editors in each month, e.g., 100 editors per month.
Do any of you have good suggestions for how to do this efficiently?
I could think of using the dump files, but wonder are there other options?
Thanks,
Haifeng Zhang _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org