Message: 6 Date: Wed, 26 Oct 2011 11:11:57 +0100 From: Oliver Keyes scire.facias@gmail.com Subject: Re: [Foundation-l] Office Hours on the article feedback tool To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Message-ID: <CAPYupWA34CujYan_vV_cHgYxWfCT3EJnb4d-Nrav_U20QejZ1A@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1
No, the data will remain; you can find it at http://toolserver.org/~catrope/articlefeedback/ (we really need to advertise that more widely, actually).
To be clear, we're not talking about junking the idea; we will still have an "Article Feedback Tool" that lets readers provide feedback to editors. The goal is more to move away from a subjective rating system, and towards something the editors can look at and go "huh, that's a reasonable suggestion as to how to fix the article, I'll go do that" or "aw, that's really nice! I'm glad they liked it so much"
O.
As someone who was never exactly a fan of the Article Feedback Tool I'm glad to hear that the current version is to be canned. The sort of subjective ratings it could produce were never going to be useful at improving articles, certainly not useful enough to justify the screen space. My fear was that it might divert people from improving articles to complaining about them. Since we skipped a key stage in the testing we will never know whether it did that. I didn't realise at the time that it was going to abuse our readers trust by collecting shed loads of data that we weren't going to use.
We took a big risk in implementing the Article Feedback Tool without first testing to see whether it would do more harm than good. It is hard to tell in hindsight whether it has been negative or neutral in effect. Yes recruitment of new editors has fallen sharply - September's new editors on EN wiki are down to levels not seen since 2005 http://stats.wikimedia.org/EN/TablesWikipediaEN.htm#editdistribution but things were on the decline anyway so we don't know whether and to what extent the Article Feedback tool exacerbated the trend. My concern about turning it into something that collects more meaningful comments is that this could exacerbate the pernicious trend from improving articles to tagging them for others to improve. I appreciate that there are various competing theories as to why the community went off the boil circa 2007, but for me and anyone else who considers that the trend to template rather than improve articles has been a major cause of community decline, an "improved" version of the Article Feedback Tool is a worrying prospect.
Can we make sure that any new generation Article Feedback tool is properly tested, and that testing includes:
1. Implementing it on a random group of articles and comparing them with a control sample to see which group of articles had the more edits from newbies; 2. Whether the collecting of feedback on ways to improve the article generates additional comments or diverts some editors away from actually fixing the article. 3. Which group of articles recruited the most new editors to the pedia.
Please don't implement it if the testing shows that it diverts people from fixing articles to pointing out things that others can fix.
On a broader note I suggested some time ago that for the community to give meaningful input into article development we need a process for the community to give feedback on the priority of various potential developments. Wikimania does something like that in the way the program is put together. The image filter "referendum" came close in that it asked people to rate the image filter for importance, unfortunately it didn't include other proposals so that people could put them in order of relevant importance (we also need a quite separate question for whether you think something is worth doing at all). In your new role as liaison between the community and the development team please could you initiate something like that, so that those of us who would give a higher priority to global watchlists or enhancing catalot so that it works on uncategorised articles can say so?
Regards
WereSpielChequers
On Wed, 26 Oct 2011 15:33:30 +0100, WereSpielChequers werespielchequers@gmail.com wrote:
Can we make sure that any new generation Article Feedback tool is
properly
tested, and that testing includes:
- Implementing it on a random group of articles and comparing them
with a control sample to see which group of articles had the more edits
from
newbies; 2. Whether the collecting of feedback on ways to improve the article generates additional comments or diverts some editors away from
actually
fixing the article. 3. Which group of articles recruited the most new editors to the
pedia.
Please don't implement it if the testing shows that it diverts people
from
fixing articles to pointing out things that others can fix.
And I think this time showing it to the Research Committee prior to running the tests would be a good idea.
Cheers Yaroslav
On Wed, Oct 26, 2011 at 3:46 PM, Yaroslav M. Blanter putevod@mccme.ruwrote:
On Wed, 26 Oct 2011 15:33:30 +0100, WereSpielChequers werespielchequers@gmail.com wrote:
Can we make sure that any new generation Article Feedback tool is
properly
tested, and that testing includes:
- Implementing it on a random group of articles and comparing them
with a control sample to see which group of articles had the more edits
from
newbies; 2. Whether the collecting of feedback on ways to improve the article generates additional comments or diverts some editors away from
actually
fixing the article. 3. Which group of articles recruited the most new editors to the
pedia.
Please don't implement it if the testing shows that it diverts people
from
fixing articles to pointing out things that others can fix.
And I think this time showing it to the Research Committee prior to running the tests would be a good idea.
Cheers Yaroslav
I'll bring these points up with the folks :). If you have any others, do
come to office-hours.
O.
Hi WereSpielChequers,
I worked on the data analysis for previous AFT versions and I believe I've already answered on a number of occasions your questions as to what we could test and what we couldn't in the previous phase, but I am happy to do this here and clarify what the research plans for the next version are.
Subjective ratings
We have definitely seen a lot of love/hate rating happen in the case of popular articles (e.g. Lady Gaga, Justin Bieber). Teasing apart ratings on the quality of the article and rater attitudes towards the topic of the article is pretty hard given the fact that an average enwiki article gets a very small number of ratings per day and articles that get a sufficient number of ratings tend to be attracting particularly opinionated or polarized visitors.
To give you a measure of the problem: of the 3.7M articles in the main namespace of the English Wikipedia only 40 articles (0.001%) obtain 10 or more ratings per day. The vast majority of articles don't get any rating for several days or weeks or ever. FInding ways to increase the volume of ratings per article is one of the issues we're discussing in the context of v.5.
The second problem is that we don't have enough observations on multiple ratings by the same user. Only 0.02% of unique raters rate more than one article and that means that on a single article basis we cannot easily filter out users who only rated a topic they love or hate and still have enough good data to process. This is unfortunate: the more rating data we can get per rater, the more we can identify gaming or rating biases and control them in public article feedback reports.
Effects of AFT on participations
I ran a number of pre/post analyses comparing editing activity before and after AFT was activated on a random sample of English Wikipedia articles, controlling for page views before and after the activation and found no statistically significant difference in the volume of edits. As I noted elsewhere the comparison between two random samples of articles is problematic because we cannot easily control for the multiple factors that affect editing activity in independent samples of articles so any result you may get out of this coarse analysis would be questionable. I agree that's a very important issue and the proper way to address it is by a/b testing different AFT interfaces (including no AFT widget whatsoever) for the same article and measuring the effects on edit activity for the same articles across different user groups: this is one of the plans we are considering for v.5
Another important limitation of AFT v.4 is that we only collected aggregate event counts for call to actions and we didn't mark edits or new accounts created via AFT, which means that we couldn't directly study the effects of AFT as an on-ramping tool for new editors (e.g. how many readers it is converting to registered users and what is the quality of edits generated via the AFT. i.e. how many users who create an account via AFT call to actions actually end up becoming editors? What is their survival compared to users who create an account in a standard way? And how many among the edits created via AFT are vandalism? How many are good faith tests that get reverted? These are all questions that we will be addressing as of v.5.
We'll be still working on analyzing the current AFT data to support the design of v.5. In particular, we will be focusing on (1) correlations between consistent low ratings and poor quality or vandalism or the likelihood of an article to be nominated for deletion and (2) the relation between ratings and changes in other quality-related metrics on a per-article basis.
I have also pitched the existing data to a number of external researchers interesting in article quality measurements and/or rating systems and I invite you to do the same.
Hope this helps. I look forward to a more in-depth discussion during the office hours.
Dario
On Oct 26, 2011, at 7:33 AM, WereSpielChequers wrote:
Message: 6 Date: Wed, 26 Oct 2011 11:11:57 +0100 From: Oliver Keyes scire.facias@gmail.com Subject: Re: [Foundation-l] Office Hours on the article feedback tool To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Message-ID: <CAPYupWA34CujYan_vV_cHgYxWfCT3EJnb4d-Nrav_U20QejZ1A@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1
No, the data will remain; you can find it at http://toolserver.org/~catrope/articlefeedback/ (we really need to advertise that more widely, actually).
To be clear, we're not talking about junking the idea; we will still have an "Article Feedback Tool" that lets readers provide feedback to editors. The goal is more to move away from a subjective rating system, and towards something the editors can look at and go "huh, that's a reasonable suggestion as to how to fix the article, I'll go do that" or "aw, that's really nice! I'm glad they liked it so much"
O.
As someone who was never exactly a fan of the Article Feedback Tool I'm glad to hear that the current version is to be canned. The sort of subjective ratings it could produce were never going to be useful at improving articles, certainly not useful enough to justify the screen space. My fear was that it might divert people from improving articles to complaining about them. Since we skipped a key stage in the testing we will never know whether it did that. I didn't realise at the time that it was going to abuse our readers trust by collecting shed loads of data that we weren't going to use.
We took a big risk in implementing the Article Feedback Tool without first testing to see whether it would do more harm than good. It is hard to tell in hindsight whether it has been negative or neutral in effect. Yes recruitment of new editors has fallen sharply - September's new editors on EN wiki are down to levels not seen since 2005 http://stats.wikimedia.org/EN/TablesWikipediaEN.htm#editdistribution but things were on the decline anyway so we don't know whether and to what extent the Article Feedback tool exacerbated the trend. My concern about turning it into something that collects more meaningful comments is that this could exacerbate the pernicious trend from improving articles to tagging them for others to improve. I appreciate that there are various competing theories as to why the community went off the boil circa 2007, but for me and anyone else who considers that the trend to template rather than improve articles has been a major cause of community decline, an "improved" version of the Article Feedback Tool is a worrying prospect.
Can we make sure that any new generation Article Feedback tool is properly tested, and that testing includes:
- Implementing it on a random group of articles and comparing them with
a control sample to see which group of articles had the more edits from newbies; 2. Whether the collecting of feedback on ways to improve the article generates additional comments or diverts some editors away from actually fixing the article. 3. Which group of articles recruited the most new editors to the pedia.
Please don't implement it if the testing shows that it diverts people from fixing articles to pointing out things that others can fix.
On a broader note I suggested some time ago that for the community to give meaningful input into article development we need a process for the community to give feedback on the priority of various potential developments. Wikimania does something like that in the way the program is put together. The image filter "referendum" came close in that it asked people to rate the image filter for importance, unfortunately it didn't include other proposals so that people could put them in order of relevant importance (we also need a quite separate question for whether you think something is worth doing at all). In your new role as liaison between the community and the development team please could you initiate something like that, so that those of us who would give a higher priority to global watchlists or enhancing catalot so that it works on uncategorised articles can say so?
Regards
WereSpielChequers _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org