My 'a' key is sticky, sorry for the lack of readability of my e-mail =)
On 8/18/06, Brian <reflection(a)gmail.com> wrote:
Here are a few readability measure examples. Just a
side-by-side
comparison of the text from the GWB article from en.wp and simple.wp,
and de.wp. I plan on parsing en, de and simple in full and exploring
how these measures might be correlated with quality.
ps: Does anyone know of a script that can strip out wiki syntax? This
is pertinent. It will also be necessary to leve only paragraphs of
text in the articles..the below data is noticably skewed in some (but
not all) of the mesures.
pss: I recall from the Wikimania meeting that someone had a script to
convert a dump to tab-delimited data. That would be useful to me...
could someone provide a link?
Erik: The largest of articles takes approx. 1/10 of a second running
the binary produced by this C code. Using Inline::C in perl, I could
fairly easily embed the code (style.c from GNU Diction) into your
script. It would take and return strings. "Simple!" =) Otherwise I can
just produce the data in csv etc.. and provide it to you.
See [[Readability]] and Google to get an idea of what these
readability grades mean. Briefly:
All of these explained quite simply:
http://www.readability.info/info.shtml
Kincaid:
http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch-Kincaid…
ARI:
http://en.wikipedia.org/wiki/Automated_Readability_Index
Coleman-Liau:
http://en.wikipedia.org/wiki/Coleman-Liau_Index
Flesh Index:
http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch_Reading…
Fog Index:
http://en.wikipedia.org/wiki/Gunning-Fog_Index
Lix:
http://www.readability.info/info.shtml
SMOG-Grading:
http://en.wikipedia.org/wiki/SMOG_Index
This data is very easy to reproduce. I provide a unix command for each
that assumes you have installed the lynx text browser, which has a
dump command to strip out html and leave text, and the GNU Diction
package, which provides style. Style supports English/German.
----------------------------------------------------------------
[[George W. Bush]] on en.wp:
lynx -dump
http://en.wikipedia.org/wiki/"George W. Bush" | style
YMMV: I removed all the hyperlinks in this article before running style
----------------------------------------------------------------
readability grades:
Kincaid: 11.7
ARI: 13.5
Coleman-Liau: 12.8
Flesch Index: 54.0
Fog Index: 15.3
Lix: 51.3 = school year 10
SMOG-Grading: 13.1
sentence info:
60081 characters
12376 words, average length 4.85 characters = 1.52 syllables
513 sentences, average length 24.1 words
58% (299) short sentences (at most 19 words)
18% (97) long sentences (at least 34 words)
65 paragraphs, average length 7.9 sentences
0% (3) questions
22% (114) passive sentences
longest sent 294 wds at sent 507; shortest sent 1 wds at sent 5
word usage:
verb types:
to be (155) auxiliary (49)
types as % of total:
conjunctions 4% (544) pronouns 3% (336) prepositions 11% (1311)
nominalizations 3% (311)
sentence beginnings:
pronoun (47) interrogative pronoun (3) article (40)
subordinating conjunction (23) conjunction (5) preposition (40)
----------------------------------------------------------------
[[George W. Bush]] on simple.wp:
lynx -dump
http://simple.wikipedia.org/wiki/"George W. Bush" | style
----------------------------------------------------------------
readability grades:
Kincaid: 3.3
ARI: 0.7
Coleman-Liau: 6.0
Flesch Index: 88.6
Fog Index: 6.5
Lix: 23.6 = below school year 5
SMOG-Grading: 7.4
sentence info:
8659 characters
2344 words, average length 3.69 characters = 1.28 syllables
248 sentences, average length 9.5 words
65% (163) short sentences (at most 4 words)
10% (26) long sentences (at least 19 words)
14 paragraphs, average length 17.7 sentences
0% (0) questions
10% (27) passive sentences
longest sent 253 wds at sent 39; shortest sent 1 wds at sent 4
word usage:
verb types:
to be (40) auxiliary (1)
types as % of total:
conjunctions 1% (24) pronouns 1% (33) prepositions 4% (95)
nominalizations 1% (24)
sentence beginnings:
pronoun (10) interrogative pronoun (0) article (3)
subordinating conjunction (3) conjunction (1) preposition (2)
----------------------------------------------------------------
[[George W. Bush]] on de.wp:
lynx -dump
http://de.wikipedia.org/wiki/"George W. Bush" | style -L de
----------------------------------------------------------------
readability grades:
Kincaid: 8.0
ARI: 6.7
Coleman-Liau: 12.3
Flesch Index: 57.7
Fog Index: 10.8
Lix: 34.4 = school year 5
SMOG-Grading: 5.3
sentence info:
37740 characters
7909 words, average length 4.77 characters = 1.63 syllables
694 sentences, average length 11.4 words
63% (441) short sentences (at most 6 words)
16% (116) long sentences (at least 21 words)
56 paragraphs, average length 12.4 sentences
0% (2) questions
6% (44) passive sentences
longest sent 274 wds at sent 256; shortest sent 1 wds at sent 191
sentence beginnings:
pronoun (14) interrogative pronoun (3) article (37)
----------------------------------------------------------------
Cheers,
Brian Mingus