Ditty,
Article quality is inherently subjective in the hard-AI sense. A panel of judges will consider accurate articles full of spelling, grammar, and formatting errors superior in quality to hoax, biased, spam, or out-of-date articles with perfect grammar, impeccable spelling, and immaculate formatting.
In my studies of the short popular vital articles (WP:SPVA) the closest correlation with subjective mean opinion score quality I've found so far is sentence length. But it has diminishing returns and the raw correlation is +0.2 at best.
The entirely subjective nature of article quality is additional support for automating accuracy review.
Best regards, James
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for a general audience. So articles with long sentences should be improved by rewriting into more but shorter sentences,
On 24 October 2014 19:44, James Salsman jsalsman@gmail.com wrote:
Ditty,
Article quality is inherently subjective in the hard-AI sense. A panel of judges will consider accurate articles full of spelling, grammar, and formatting errors superior in quality to hoax, biased, spam, or out-of-date articles with perfect grammar, impeccable spelling, and immaculate formatting.
In my studies of the short popular vital articles (WP:SPVA) the closest correlation with subjective mean opinion score quality I've found so far is sentence length. But it has diminishing returns and the raw correlation is +0.2 at best.
The entirely subjective nature of article quality is additional support for automating accuracy review.
Best regards, James
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for a general audience. So articles with long sentences should be improved by rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We could equally ask, what makes for quality entries there?
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for a general audience. So articles with long sentences should be improved by rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We could equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Ziko,
You are right. But if the content of the article is very less or having less references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find the quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some
people
who think that a general interest encyclopaedia should be written for a general audience. So articles with long sentences should be improved by rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We could equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Okay. What do you think of the wikibu tool from Switzerland? It believes that the number of editors and readers etc are indicators for the quality, or at least a basis to discuss. Kind regards Ziko
http://www.wikibu.ch/search.php?search=Frankfurter+Nationalversammlung
2014-10-25 14:44 GMT+02:00 Ditty Mathew dittyvkm@gmail.com:
Hi Ziko,
You are right. But if the content of the article is very less or having less references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find the quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for a general audience. So articles with long sentences should be improved by rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We could equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I think it's pointless to argue over what we mean by "quality" or "well written" in general. It is fair to say that there are a lot of mechanically derivable metrics for articles including:
* number of citations * number of unique citations * article length * density of citations, unique citations relative to article length * ditto for photos, infoboxes, navbox, categories etc * linguistic analysis like sentence length, Flesch-Kincaid readability scores * Age of article * Number of editors * Number of page views * Density of ... * number of reverts * reverts per editor/year/etc .. * number of inbound links, number of outbound links, number of redlinks * manual quality assessments (usually in project tags) * presence of "issue" tags, e.g. refimprove, citation needed, etc
It seem to be that if we had a tool that could generate a wide range of these sort of metrics, folks could then put their own algorithm over the top to compute and weight whatever combination of them makes sense for their particular purpose.
Kerry
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Ziko van Dijk Sent: Saturday, 25 October 2014 11:28 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Tool to find poorly written articles
Okay. What do you think of the wikibu tool from Switzerland? It believes that the number of editors and readers etc are indicators for the quality, or at least a basis to discuss. Kind regards Ziko
http://www.wikibu.ch/search.php?search=Frankfurter+Nationalversammlung
2014-10-25 14:44 GMT+02:00 Ditty Mathew dittyvkm@gmail.com:
Hi Ziko,
You are right. But if the content of the article is very less or having
less
references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find
the
quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for
a
general audience. So articles with long sentences should be improved
by
rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We
could
equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I agree with this *so much*. Give us infrastructure to make views, and we'll use it to make amazing things!
Sent from my iPhone
On 25 Oct 2014, at 21:41, "Kerry Raymond" kerry.raymond@gmail.com wrote:
I think it's pointless to argue over what we mean by "quality" or "well written" in general. It is fair to say that there are a lot of mechanically derivable metrics for articles including:
- number of citations
- number of unique citations
- article length
- density of citations, unique citations relative to article length
- ditto for photos, infoboxes, navbox, categories etc
- linguistic analysis like sentence length, Flesch-Kincaid readability
scores
- Age of article
- Number of editors
- Number of page views
- Density of ...
- number of reverts
- reverts per editor/year/etc ..
- number of inbound links, number of outbound links, number of redlinks
- manual quality assessments (usually in project tags)
- presence of "issue" tags, e.g. refimprove, citation needed, etc
It seem to be that if we had a tool that could generate a wide range of these sort of metrics, folks could then put their own algorithm over the top to compute and weight whatever combination of them makes sense for their particular purpose.
Kerry
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Ziko van Dijk Sent: Saturday, 25 October 2014 11:28 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Tool to find poorly written articles
Okay. What do you think of the wikibu tool from Switzerland? It believes that the number of editors and readers etc are indicators for the quality, or at least a basis to discuss. Kind regards Ziko
http://www.wikibu.ch/search.php?search=Frankfurter+Nationalversammlung
2014-10-25 14:44 GMT+02:00 Ditty Mathew dittyvkm@gmail.com:
Hi Ziko,
You are right. But if the content of the article is very less or having
less
references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find
the
quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for
a
general audience. So articles with long sentences should be improved
by
rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We
could
equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I think it's a bit of time from now, but there are several open source Watson-like machine reading tools coming along, any one of which could be put to the task of interest here. Could do more than that, but it's a start.
On Sat, Oct 25, 2014 at 1:41 PM, Kerry Raymond kerry.raymond@gmail.com wrote:
I think it's pointless to argue over what we mean by "quality" or "well written" in general. It is fair to say that there are a lot of mechanically derivable metrics for articles including:
- number of citations
- number of unique citations
- article length
- density of citations, unique citations relative to article length
- ditto for photos, infoboxes, navbox, categories etc
- linguistic analysis like sentence length, Flesch-Kincaid readability
scores
- Age of article
- Number of editors
- Number of page views
- Density of ...
- number of reverts
- reverts per editor/year/etc ..
- number of inbound links, number of outbound links, number of redlinks
- manual quality assessments (usually in project tags)
- presence of "issue" tags, e.g. refimprove, citation needed, etc
It seem to be that if we had a tool that could generate a wide range of these sort of metrics, folks could then put their own algorithm over the top to compute and weight whatever combination of them makes sense for their particular purpose.
Kerry
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Ziko van Dijk Sent: Saturday, 25 October 2014 11:28 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Tool to find poorly written articles
Okay. What do you think of the wikibu tool from Switzerland? It believes that the number of editors and readers etc are indicators for the quality, or at least a basis to discuss. Kind regards Ziko
http://www.wikibu.ch/search.php?search=Frankfurter+Nationalversammlung
2014-10-25 14:44 GMT+02:00 Ditty Mathew dittyvkm@gmail.com:
Hi Ziko,
You are right. But if the content of the article is very less or having
less
references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find
the
quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com
wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for
a
general audience. So articles with long sentences should be improved
by
rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We
could
equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
My "Wikipedia research and tools: Review and comments." has a section on "Automated quality tools" starting on page 35 in http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6012/pdf/imm6012.pd...
It has far from all papers on the issue, but might help a bit.
For vandalism detection there are quite a few suggestions for quality/vandalism features. There is a good overview of some of the systems here:
Martin Potthast, Benno Stein, and Teresa Holfeld. Overview of the 1st international competition on Wikipedia vandalism detection. In PAN 2010, 2010. http://www.inf.u-szeged.hu/~ihegedus/publ/clef2010labs_submission_126.pdf
I suppose some of the features might work for more general quality assessment.
best regards Finn Årup Nielsen http://people.compute.dtu.dk/faan/
On 10/25/2014 10:41 PM, Kerry Raymond wrote:
I think it's pointless to argue over what we mean by "quality" or "well written" in general. It is fair to say that there are a lot of mechanically derivable metrics for articles including:
- number of citations
- number of unique citations
- article length
- density of citations, unique citations relative to article length
- ditto for photos, infoboxes, navbox, categories etc
- linguistic analysis like sentence length, Flesch-Kincaid readability
scores
- Age of article
- Number of editors
- Number of page views
- Density of ...
- number of reverts
- reverts per editor/year/etc ..
- number of inbound links, number of outbound links, number of redlinks
- manual quality assessments (usually in project tags)
- presence of "issue" tags, e.g. refimprove, citation needed, etc
It seem to be that if we had a tool that could generate a wide range of these sort of metrics, folks could then put their own algorithm over the top to compute and weight whatever combination of them makes sense for their particular purpose.
Kerry
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Ziko van Dijk Sent: Saturday, 25 October 2014 11:28 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Tool to find poorly written articles
Okay. What do you think of the wikibu tool from Switzerland? It believes that the number of editors and readers etc are indicators for the quality, or at least a basis to discuss. Kind regards Ziko
http://www.wikibu.ch/search.php?search=Frankfurter+Nationalversammlung
2014-10-25 14:44 GMT+02:00 Ditty Mathew dittyvkm@gmail.com:
Hi Ziko,
You are right. But if the content of the article is very less or having
less
references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find
the
quality of article.
with regards
Ditty
On Sat, Oct 25, 2014 at 8:23 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello Ditty,
It is difficult for me to understand your question if you are not more specific of what you consider a "poorly written article". "Poorly" can refer her to many different things, like readability, grammar, balance, statements supported by 'sources', good division of knowledge over several articles etc.
I think that software tools can only give a hint, but the judgement (how "good" is an article) can be done only by a human, on the basis of concrete criteria what is meant to be "good", and for what target group. I tend to say that some Wikipedia articles are "good" for experts but at the same time unsuitable for the general public.
E.g., a software tool can count the words per sentence, but long sentences are not necessarily good or bad by themselves.
Etc. :-)
Kind regards Ziko
2014-10-25 1:47 GMT+02:00 Joe Corneli holtzermann17@gmail.com:
On Sat, Oct 25 2014, WereSpielChequers wrote:
And just to add to the complexity of James' comments; there are some people who think that a general interest encyclopaedia should be written for
a
general audience. So articles with long sentences should be improved
by
rewriting into more but shorter sentences,
How about an even simpler version of the problem: an encyclopedia written by robots for robots. I speak, of course, of DBPedia. We
could
equally ask, what makes for quality entries there?
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sat, Oct 25 2014, Ditty Mathew wrote:
Hi Ziko,
You are right. But if the content of the article is very less or having less references, less edits, less no of images, less no of links etc, articles are of poor quality. Based on these factors, to some extent we can find the quality of article.
To some extent I would agree with you, and there's a comparison of just this nature on pp. 96-98 of my thesis (http://oro.open.ac.uk/40775/).
However, the classic Hannah Arendt [1] vs Pamela Anderson [2] example seems like it might be a challenge: I'd be curious to know which one of those articles your metrics would describe as better quality. And how would you compare those to the biography of Meridith L. Patterson [3]?
Further, if you try to compare biographical articles with articles on technical topics, like the article on ultrafilters [3] mentioned in my thesis, then you'll really be comparing apples and oranges. At the very least it seems like you should take into consideration "network" properties of the article relative to other *related* articles -- although then you will quickly get into the business of evaluating sub-sections of the encyclopedia.
You may also have to consider the role that the article is meant to play: e.g. is it just there to present simple facts, or is it meant to be more expository? A print encyclopedia would have zero links, and a given article might be "impressionistic" and still high-calibre: http://www.newyorker.com/magazine/2001/07/16/encyclopaedia-anderson
Joe
[1] https://en.wikipedia.org/wiki/Hannah_Arendt [2] https://en.wikipedia.org/wiki/Pamela_Anderson [3] https://en.wikipedia.org/wiki/Meredith_L._Patterson [4] https://en.wikipedia.org/wiki/Ultrafilter
wiki-research-l@lists.wikimedia.org