I just replicated this analysis. I think you might have made some
mistakes.
I took a random sample of non-redirect articles from English Wikipedia and
compared the byte_length (from database) to the content_length (from API,
tags and comments stripped).
I get a pearson correlation coef of *0.9514766*.
See the attached scatter plot including a linear regression line. See also
the regress output below.
Call:
lm(formula = page_len ~ content_length, data = pages)
Residuals:
Min 1Q Median 3Q Max
-38263 -419 82 592 37605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -97.40412 72.46523 -1.344 0.179
content_length 1.14991 0.00832 138.210 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2722 on 1998 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16
On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) <
fabian.floeck(a)kit.edu> wrote:
Hi,
to whoever is interested in this (and I hope I didn't just repeat someone
else's experiments on this):
I wanted to know if a "long" or "short" article in terms of how much
readable material (excluding pictures) is presented to the reader in the
front-end is correlated to the byte size of the Wikisyntax which can be
obtained from the DB or API; as people often define the "length" of an
article by its length in bytes.
TL;DR: Turns out size in bytes is a really, really bad indicator for the
actual, readable content of a Wikipedia article, even worse than I thought.
We "curl"ed the front-end HTML of all articles of the English Wikipedia
(ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as
around 5900 bytes is the total en.wiki average for these articles). = 41981
articles.
Results for size in characters (w/ whitespaces) after cleaning the HTML
out:
Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748
Especially the gap between Min and Max was interesting. But templates make
it possible.
(See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
Allthough
for the ladder you could argue that expandable template listings are not
really main "reading" content..)
Effectively, correlation for readable character size with byte size = 0.04
(i.e. none) in the sample.
If someone already did this or a similar analysis, I'd appreciate pointers.
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l