Here are a few readability measure examples. Just a side-by-side comparison of the text from the GWB article from en.wp and simple.wp, and de.wp. I plan on parsing en, de and simple in full and exploring how these measures might be correlated with quality.
ps: Does anyone know of a script that can strip out wiki syntax? This is pertinent. It will also be necessary to leve only paragraphs of text in the articles..the below data is noticably skewed in some (but not all) of the mesures.
pss: I recall from the Wikimania meeting that someone had a script to convert a dump to tab-delimited data. That would be useful to me... could someone provide a link?
Erik: The largest of articles takes approx. 1/10 of a second running the binary produced by this C code. Using Inline::C in perl, I could fairly easily embed the code (style.c from GNU Diction) into your script. It would take and return strings. "Simple!" =) Otherwise I can just produce the data in csv etc.. and provide it to you.
See [[Readability]] and Google to get an idea of what these readability grades mean. Briefly: All of these explained quite simply: http://www.readability.info/info.shtml Kincaid: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch-Kincaid_... ARI: http://en.wikipedia.org/wiki/Automated_Readability_Index Coleman-Liau: http://en.wikipedia.org/wiki/Coleman-Liau_Index Flesh Index: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch_Reading_... Fog Index: http://en.wikipedia.org/wiki/Gunning-Fog_Index Lix: http://www.readability.info/info.shtml SMOG-Grading: http://en.wikipedia.org/wiki/SMOG_Index
This data is very easy to reproduce. I provide a unix command for each that assumes you have installed the lynx text browser, which has a dump command to strip out html and leave text, and the GNU Diction package, which provides style. Style supports English/German.
---------------------------------------------------------------- [[George W. Bush]] on en.wp: lynx -dump http://en.wikipedia.org/wiki/%22George W. Bush" | style YMMV: I removed all the hyperlinks in this article before running style ---------------------------------------------------------------- readability grades: Kincaid: 11.7 ARI: 13.5 Coleman-Liau: 12.8 Flesch Index: 54.0 Fog Index: 15.3 Lix: 51.3 = school year 10 SMOG-Grading: 13.1 sentence info: 60081 characters 12376 words, average length 4.85 characters = 1.52 syllables 513 sentences, average length 24.1 words 58% (299) short sentences (at most 19 words) 18% (97) long sentences (at least 34 words) 65 paragraphs, average length 7.9 sentences 0% (3) questions 22% (114) passive sentences longest sent 294 wds at sent 507; shortest sent 1 wds at sent 5 word usage: verb types: to be (155) auxiliary (49) types as % of total: conjunctions 4% (544) pronouns 3% (336) prepositions 11% (1311) nominalizations 3% (311) sentence beginnings: pronoun (47) interrogative pronoun (3) article (40) subordinating conjunction (23) conjunction (5) preposition (40)
---------------------------------------------------------------- [[George W. Bush]] on simple.wp: lynx -dump http://simple.wikipedia.org/wiki/%22George W. Bush" | style ---------------------------------------------------------------- readability grades: Kincaid: 3.3 ARI: 0.7 Coleman-Liau: 6.0 Flesch Index: 88.6 Fog Index: 6.5 Lix: 23.6 = below school year 5 SMOG-Grading: 7.4 sentence info: 8659 characters 2344 words, average length 3.69 characters = 1.28 syllables 248 sentences, average length 9.5 words 65% (163) short sentences (at most 4 words) 10% (26) long sentences (at least 19 words) 14 paragraphs, average length 17.7 sentences 0% (0) questions 10% (27) passive sentences longest sent 253 wds at sent 39; shortest sent 1 wds at sent 4 word usage: verb types: to be (40) auxiliary (1) types as % of total: conjunctions 1% (24) pronouns 1% (33) prepositions 4% (95) nominalizations 1% (24) sentence beginnings: pronoun (10) interrogative pronoun (0) article (3) subordinating conjunction (3) conjunction (1) preposition (2) ---------------------------------------------------------------- [[George W. Bush]] on de.wp: lynx -dump http://de.wikipedia.org/wiki/%22George W. Bush" | style -L de ---------------------------------------------------------------- readability grades: Kincaid: 8.0 ARI: 6.7 Coleman-Liau: 12.3 Flesch Index: 57.7 Fog Index: 10.8 Lix: 34.4 = school year 5 SMOG-Grading: 5.3 sentence info: 37740 characters 7909 words, average length 4.77 characters = 1.63 syllables 694 sentences, average length 11.4 words 63% (441) short sentences (at most 6 words) 16% (116) long sentences (at least 21 words) 56 paragraphs, average length 12.4 sentences 0% (2) questions 6% (44) passive sentences longest sent 274 wds at sent 256; shortest sent 1 wds at sent 191 sentence beginnings: pronoun (14) interrogative pronoun (3) article (37)
---------------------------------------------------------------- Cheers, Brian Mingus
My 'a' key is sticky, sorry for the lack of readability of my e-mail =)
On 8/18/06, Brian reflection@gmail.com wrote:
Here are a few readability measure examples. Just a side-by-side comparison of the text from the GWB article from en.wp and simple.wp, and de.wp. I plan on parsing en, de and simple in full and exploring how these measures might be correlated with quality.
ps: Does anyone know of a script that can strip out wiki syntax? This is pertinent. It will also be necessary to leve only paragraphs of text in the articles..the below data is noticably skewed in some (but not all) of the mesures.
pss: I recall from the Wikimania meeting that someone had a script to convert a dump to tab-delimited data. That would be useful to me... could someone provide a link?
Erik: The largest of articles takes approx. 1/10 of a second running the binary produced by this C code. Using Inline::C in perl, I could fairly easily embed the code (style.c from GNU Diction) into your script. It would take and return strings. "Simple!" =) Otherwise I can just produce the data in csv etc.. and provide it to you.
See [[Readability]] and Google to get an idea of what these readability grades mean. Briefly: All of these explained quite simply: http://www.readability.info/info.shtml Kincaid: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch-Kincaid_... ARI: http://en.wikipedia.org/wiki/Automated_Readability_Index Coleman-Liau: http://en.wikipedia.org/wiki/Coleman-Liau_Index Flesh Index: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch_Reading_... Fog Index: http://en.wikipedia.org/wiki/Gunning-Fog_Index Lix: http://www.readability.info/info.shtml SMOG-Grading: http://en.wikipedia.org/wiki/SMOG_Index
This data is very easy to reproduce. I provide a unix command for each that assumes you have installed the lynx text browser, which has a dump command to strip out html and leave text, and the GNU Diction package, which provides style. Style supports English/German.
[[George W. Bush]] on en.wp: lynx -dump http://en.wikipedia.org/wiki/%22George W. Bush" | style YMMV: I removed all the hyperlinks in this article before running style
readability grades: Kincaid: 11.7 ARI: 13.5 Coleman-Liau: 12.8 Flesch Index: 54.0 Fog Index: 15.3 Lix: 51.3 = school year 10 SMOG-Grading: 13.1 sentence info: 60081 characters 12376 words, average length 4.85 characters = 1.52 syllables 513 sentences, average length 24.1 words 58% (299) short sentences (at most 19 words) 18% (97) long sentences (at least 34 words) 65 paragraphs, average length 7.9 sentences 0% (3) questions 22% (114) passive sentences longest sent 294 wds at sent 507; shortest sent 1 wds at sent 5 word usage: verb types: to be (155) auxiliary (49) types as % of total: conjunctions 4% (544) pronouns 3% (336) prepositions 11% (1311) nominalizations 3% (311) sentence beginnings: pronoun (47) interrogative pronoun (3) article (40) subordinating conjunction (23) conjunction (5) preposition (40)
[[George W. Bush]] on simple.wp: lynx -dump http://simple.wikipedia.org/wiki/%22George W. Bush" | style
readability grades: Kincaid: 3.3 ARI: 0.7 Coleman-Liau: 6.0 Flesch Index: 88.6 Fog Index: 6.5 Lix: 23.6 = below school year 5 SMOG-Grading: 7.4 sentence info: 8659 characters 2344 words, average length 3.69 characters = 1.28 syllables 248 sentences, average length 9.5 words 65% (163) short sentences (at most 4 words) 10% (26) long sentences (at least 19 words) 14 paragraphs, average length 17.7 sentences 0% (0) questions 10% (27) passive sentences longest sent 253 wds at sent 39; shortest sent 1 wds at sent 4 word usage: verb types: to be (40) auxiliary (1) types as % of total: conjunctions 1% (24) pronouns 1% (33) prepositions 4% (95) nominalizations 1% (24) sentence beginnings: pronoun (10) interrogative pronoun (0) article (3) subordinating conjunction (3) conjunction (1) preposition (2)
[[George W. Bush]] on de.wp: lynx -dump http://de.wikipedia.org/wiki/%22George W. Bush" | style -L de
readability grades: Kincaid: 8.0 ARI: 6.7 Coleman-Liau: 12.3 Flesch Index: 57.7 Fog Index: 10.8 Lix: 34.4 = school year 5 SMOG-Grading: 5.3 sentence info: 37740 characters 7909 words, average length 4.77 characters = 1.63 syllables 694 sentences, average length 11.4 words 63% (441) short sentences (at most 6 words) 16% (116) long sentences (at least 21 words) 56 paragraphs, average length 12.4 sentences 0% (2) questions 6% (44) passive sentences longest sent 274 wds at sent 256; shortest sent 1 wds at sent 191 sentence beginnings: pronoun (14) interrogative pronoun (3) article (37)
Cheers, Brian Mingus
wiki-research-l@lists.wikimedia.org