(note that I posted this yesterday, but the message bounced due to the
attached scatter plot. I just uploaded the plot to commons and re-sent)
I just replicated this analysis. I think you might have made some
mistakes.
I took a random sample of non-redirect articles from English Wikipedia and
compared the byte_length (from database) to the content_length (from API,
tags and comments stripped).
I get a pearson correlation coef of *0.9514766*.
See the scatter plot including a linear regression
line<http://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter…ng>.
See also the regress output below.
Call:
lm(formula = byte_len ~ content_length, data = pages)
Residuals:
Min 1Q Median 3Q Max
-38263 -419 82 592 37605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -97.40412 72.46523 -1.344 0.179
content_length 1.14991 0.00832 138.210 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2722 on 1998 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16
On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers <
werespielchequers(a)gmail.com> wrote:
Hi Fabian,
That's interesting. When you say you stripped out the html did you also
strip out the other parts of the references? Some citation styles will take
up more bytes than others, and citation style is supposed to be consistent
at the article level.
It would also make a difference whether you included or excluded alt text
from readable material as I suspect it is non granular - ie if someone is
going to create alt text for one picture in an article they will do so for
all pictures.
More significantly there is a big difference in standards of referencing ,
broadly the higher the assessed quality and or the more contentious the
article the more references there will be.
I would expect that if you factored that in there would be some
correlation between readable length and bytes within assessed classes of
quality, and the outliers would include some of the controversial articles
like Jerusalem (353 references)
Hope that helps.
Jonathan
On 2 August 2013 18:24, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu>wrote;wrote:
Hi,
to whoever is interested in this (and I hope I didn't just repeat someone
else's experiments on this):
I wanted to know if a "long" or "short" article in terms of how much
readable material (excluding pictures) is presented to the reader in the
front-end is correlated to the byte size of the Wikisyntax which can be
obtained from the DB or API; as people often define the "length" of an
article by its length in bytes.
TL;DR: Turns out size in bytes is a really, really bad indicator for the
actual, readable content of a Wikipedia article, even worse than I thought.
We "curl"ed the front-end HTML of all articles of the English Wikipedia
(ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as
around 5900 bytes is the total en.wiki average for these articles). = 41981
articles.
Results for size in characters (w/ whitespaces) after cleaning the HTML
out:
Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748
Especially the gap between Min and Max was interesting. But templates
make it possible.
(See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
Allthough for the ladder you could argue that expandable template listings
are not really main "reading" content..)
Effectively, correlation for readable character size with byte size =
0.04 (i.e. none) in the sample.
If someone already did this or a similar analysis, I'd appreciate
pointers.
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l