One missing item is:


Submit an application to the IRB.





From: Wiki-research-l [] On Behalf Of siddhartha banerjee
Sent: Monday, 15 August 2016 8:17 AM
Subject: Re: [Wiki-research-l] Research on automatically created articles




Based on the discussion and suggestion in the Admin incidents page:, I have gone to each of the articles (that still existed) and made corrections and changes necessary -- both in terms of the content written as well as unreliable sources. I have requested administrators to check if my edits still have issues, and I would go back and change anything else required. I guess my advisor would be posting to this thread only later this week, so before that I wanted to summarize all that I learnt during the discussion here and on the incidents page. 


1. Multiple accounts policy: Do not use multiple user accounts to post content. 

2. Research ethics:  There was a serious issue in assumptions made (even by other researchers as can be seen from the multiple papers mentioned who work in this area). Furthermore, when our previous work ( was mentioned on Wikimedia newsletter, it did not provide any indication to us about the issues with legitimacy about this kind of research. But, based on that, the assumptions were inappropriate. It is better to involve the WMF community by letting them know about any project prior to its start and engaging them such that best decisions could be taken and such similar situations do not arise.

As an administrator mentioned in the discussion and I think is very important to note: 'you not only denied the community the opportunity to decide whether we wish to allow/participate in this research, you precluded any efforts we might have made to minimize the disruption and affect a quick clean-up'. 

Based on the last few emails, it seems that IRB is waived, however, that waiver should be stamped (but this should be after the community has been informed of a task -- if a research might cause some disruption, it should not be done at any cost). Also, it would be better to create articles in a different namespace. The problem here was that clicking on red-links directly went to the article creation markup page -- which should have been put into draft space. But still, even creating drafts imply that other editors are looking at it, which should not be done without prior consent. Testing of any content should be done offline, and not on Wikipedia -- as it can potentially disrupt. Even with moderate quality content, it implies wastage of time for editors. I plan to bring all of these to the notice of the research committee who had approved this work such that similar issues do not happen in the future. Also, I plan to write on this and share this to the wider community who have worked or are working on similar problems [I am not sure if they have already been contacted by someone from WMF]. If they could be also roped into the discussion. that would be better is what I think. 

One thing I would quote from the discussion in the incident page:"Because researchers and institutions need to realize that this project is not a laboratory for their work, not unless they make an effort to work with the community" and this is also very important. 

My apologies for the extra work that had to be done by the numerous editors to edit the content and clean them -- that cannot be reverted now but can definitely be stopped in future. We did not add any content after Feb earlier this year and have promised in that discussion not to create anything more. If we want to do some analysis, we plan to use other crowdsourcing techniques (such as Amazon mech turk) and find out quality of the generated content. 


Please add anything you think that I have missed and also regarding the clean-up as I have tried to remove the irrelevant material from all the articles edited using the usernames. 









On Fri, Aug 12, 2016 at 10:02 AM, siddhartha banerjee <> wrote:



My advisor, Prof. Mitra is busy in travels this week. He said he will be posting to this thread about his thoughts later next week. 


Also, one thing he wanted me to mention here is the following: 

Although the content in the articles were generated by an algorithm, a human — I — took those articles and posted them online. We randomly chose few articles and checked whether any objectionable content was collected from the web. We planned to remove those before posting on Wikipedia. We did not create a bot that went and created the articles randomly. We generated the content offline and then copy-pasted the content of randomly selected articles. While objectionable content was decided to be removed, we did not make any changes to sentences anywhere other than that because that would void checking for linguistic consistency -- which was our soul purpose. Also, it was done in 'good faith' and hence we just worked on bare minimum articles to get an idea , not let a bot create random junk. Our algo does not have the capability of judging whether the cited references (when we search on google) are reliable or not, but we thought that reviewers on Wikipedia would remove content from such links as well as references if they are unreliable. While some references were removed because of such reasons (eg, there were some articles removed saying promotional content (which, as well, our algo cannot really determine). 


Thanks for the comments here, we will keep them in mind if we do anything similar to this in the future, and I will try to inform other researchers who work in this area. 


