One missing item is:
Submit an application to the IRB.
Kerry
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of
siddhartha banerjee
Sent: Monday, 15 August 2016 8:17 AM
To: wiki-research-l(a)lists.wikimedia.org
Subject: Re: [Wiki-research-l] Research on automatically created articles
Hello,
Based on the discussion and suggestion in the Admin incidents page:
https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Incid…,
I have gone to each of the articles (that still existed) and made corrections and changes
necessary -- both in terms of the content written as well as unreliable sources. I have
requested administrators to check if my edits still have issues, and I would go back and
change anything else required. I guess my advisor would be posting to this thread only
later this week, so before that I wanted to summarize all that I learnt during the
discussion here and on the incidents page.
1. Multiple accounts policy: Do not use multiple user accounts to post content.
2. Research ethics: There was a serious issue in assumptions made (even by other
researchers as can be seen from the multiple papers mentioned who work in this area).
Furthermore, when our previous work
(
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recen…)
was mentioned on Wikimedia newsletter, it did not provide any indication to us about the
issues with legitimacy about this kind of research. But, based on that, the assumptions
were inappropriate. It is better to involve the WMF community by letting them know about
any project prior to its start and engaging them such that best decisions could be taken
and such similar situations do not arise.
As an administrator mentioned in the discussion and I think is very important to note:
'you not only denied the community the opportunity to decide whether we wish to
allow/participate in this research, you precluded any efforts we might have made to
minimize the disruption and affect a quick clean-up'.
Based on the last few emails, it seems that IRB is waived, however, that waiver should be
stamped (but this should be after the community has been informed of a task -- if a
research might cause some disruption, it should not be done at any cost). Also, it would
be better to create articles in a different namespace. The problem here was that clicking
on red-links directly went to the article creation markup page -- which should have been
put into draft space. But still, even creating drafts imply that other editors are looking
at it, which should not be done without prior consent. Testing of any content should be
done offline, and not on Wikipedia -- as it can potentially disrupt. Even with moderate
quality content, it implies wastage of time for editors. I plan to bring all of these to
the notice of the research committee who had approved this work such that similar issues
do not happen in the future. Also, I plan to write on this and share this to the wider
community who have worked or are working on similar problems [I am not sure if they have
already been contacted by someone from WMF]. If they could be also roped into the
discussion. that would be better is what I think.
One thing I would quote from the discussion in the incident page:"Because researchers
and institutions need to realize that this project is not a laboratory for their work, not
unless they make an effort to work with the community" and this is also very
important.
My apologies for the extra work that had to be done by the numerous editors to edit the
content and clean them -- that cannot be reverted now but can definitely be stopped in
future. We did not add any content after Feb earlier this year and have promised in that
discussion not to create anything more. If we want to do some analysis, we plan to use
other crowdsourcing techniques (such as Amazon mech turk) and find out quality of the
generated content.
Please add anything you think that I have missed and also regarding the clean-up as I have
tried to remove the irrelevant material from all the articles edited using the usernames.
Thanks,
Sidd
On Fri, Aug 12, 2016 at 10:02 AM, siddhartha banerjee <sidd2006(a)gmail.com
<mailto:sidd2006@gmail.com> > wrote:
Hi,
My advisor, Prof. Mitra is busy in travels this week. He said he will be posting to this
thread about his thoughts later next week.
Also, one thing he wanted me to mention here is the following:
Although the content in the articles were generated by an algorithm, a human — I — took
those articles and posted them online. We randomly chose few articles and checked whether
any objectionable content was collected from the web. We planned to remove those before
posting on Wikipedia. We did not create a bot that went and created the articles randomly.
We generated the content offline and then copy-pasted the content of randomly selected
articles. While objectionable content was decided to be removed, we did not make any
changes to sentences anywhere other than that because that would void checking for
linguistic consistency -- which was our soul purpose. Also, it was done in 'good
faith' and hence we just worked on bare minimum articles to get an idea , not let a
bot create random junk. Our algo does not have the capability of judging whether the cited
references (when we search on google) are reliable or not, but we thought that reviewers
on Wikipedia would remove content from such links as well as references if they are
unreliable. While some references were removed because of such reasons (eg
https://en.wikipedia.org/wiki/Atripliceae), there were some articles removed saying
promotional content (which, as well, our algo cannot really determine).
Thanks for the comments here, we will keep them in mind if we do anything similar to this
in the future, and I will try to inform other researchers who work in this area.
Thanks,
Sidd